SageMaker handles the undifferentiated heavy lifting of ML training: provisioning instances, pulling containers, managing storage, and tearing everything down when the job finishes. You write a training script, point SageMaker at your data, and let it run.
Here’s the fastest path to a training job:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
role = get_execution_role()
session = sagemaker.Session()
estimator = PyTorch(
entry_point="train.py",
role=role,
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.1",
py_version="py310",
output_path=f"s3://{session.default_bucket()}/models/",
)
estimator.fit({"train": "s3://my-bucket/data/train/"})
|
That’s it. SageMaker spins up a p3.2xlarge with a V100, runs your train.py script, saves the model artifact to S3, and shuts down the instance. You pay only for the compute time used.
Setting Up Your SageMaker Session#
Every SageMaker workflow starts with a session and an IAM role. If you’re running inside a SageMaker notebook instance or Studio, get_execution_role() grabs the attached role automatically. Running locally requires you to specify the role ARN directly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import boto3
import sagemaker
from sagemaker import get_execution_role
# Inside SageMaker notebook/Studio
role = get_execution_role()
# Running locally — pass the role ARN explicitly
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
session = sagemaker.Session(boto_session=boto3.Session(region_name="us-east-1"))
bucket = session.default_bucket()
print(f"Using bucket: {bucket}")
print(f"Region: {session.boto_region_name}")
|
Your IAM role needs AmazonSageMakerFullAccess at minimum. For production, scope it down to the specific S3 buckets and ECR repos your pipeline touches.
Training with Built-in Algorithms#
SageMaker ships with optimized containers for common algorithms. XGBoost is probably the most popular one, and it’s a solid choice for tabular data before you reach for deep learning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
role = get_execution_role()
session = sagemaker.Session()
bucket = session.default_bucket()
# Use the SageMaker-managed XGBoost container
xgb_container = sagemaker.image_uris.retrieve(
framework="xgboost",
region=session.boto_region_name,
version="1.7-1",
)
xgb_estimator = sagemaker.estimator.Estimator(
image_uri=xgb_container,
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path=f"s3://{bucket}/xgboost-output/",
sagemaker_session=session,
)
xgb_estimator.set_hyperparameters(
objective="binary:logistic",
num_round=200,
max_depth=6,
eta=0.1,
subsample=0.8,
colsample_bytree=0.8,
)
train_input = TrainingInput(
s3_data=f"s3://{bucket}/data/train/",
content_type="text/csv",
)
validation_input = TrainingInput(
s3_data=f"s3://{bucket}/data/validation/",
content_type="text/csv",
)
xgb_estimator.fit({"train": train_input, "validation": validation_input})
|
The built-in XGBoost container expects CSV or libsvm format. First column must be the label. No headers. Upload your data to S3 in that format before running the job.
One thing to watch: sagemaker.image_uris.retrieve() replaced the old get_image_uri function. If you see code using the old pattern, update it.
Custom Training Scripts with PyTorch#
Built-in algorithms get you surprisingly far, but most real projects need custom code. SageMaker’s PyTorch estimator lets you bring your own training script while SageMaker handles the infrastructure.
Your training script needs to follow a simple contract. SageMaker passes paths for input data and model output through environment variables, and the Python SDK wraps these into argparse arguments.
Here’s a working train.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
| import argparse
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class SimpleClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x):
return self.net(x)
def train(args):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# SageMaker puts training data in /opt/ml/input/data/<channel_name>/
train_dir = args.train
X_train = torch.load(os.path.join(train_dir, "X_train.pt"))
y_train = torch.load(os.path.join(train_dir, "y_train.pt"))
dataset = TensorDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)
model = SimpleClassifier(
input_dim=args.input_dim,
hidden_dim=args.hidden_dim,
output_dim=args.output_dim,
).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=args.lr)
for epoch in range(args.epochs):
model.train()
total_loss = 0.0
for X_batch, y_batch in loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
optimizer.zero_grad()
output = model(X_batch)
loss = criterion(output, y_batch)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(loader)
print(f"Epoch {epoch+1}/{args.epochs} - Loss: {avg_loss:.4f}")
# SageMaker expects the model at /opt/ml/model/
model_path = os.path.join(args.model_dir, "model.pth")
torch.save(model.state_dict(), model_path)
print(f"Model saved to {model_path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=10)
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--input-dim", type=int, default=784)
parser.add_argument("--hidden-dim", type=int, default=256)
parser.add_argument("--output-dim", type=int, default=10)
# SageMaker-specific arguments (injected automatically)
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train"))
args = parser.parse_args()
train(args)
|
Now launch it from your notebook or local machine:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point="train.py",
source_dir="src/",
role=role,
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.1",
py_version="py310",
hyperparameters={
"epochs": 20,
"batch-size": 128,
"lr": 0.0005,
"input-dim": 784,
"hidden-dim": 512,
"output-dim": 10,
},
output_path=f"s3://{bucket}/pytorch-output/",
)
estimator.fit({"train": f"s3://{bucket}/data/train/"})
|
The source_dir parameter lets you include other Python files your training script imports. SageMaker tars up the directory and sends it to the training instance.
Spot Instances for Cost Savings#
Managed spot training is one of the best SageMaker features. You can cut training costs by up to 90% compared to on-demand pricing. The tradeoff: your job might get interrupted if AWS needs the capacity back. SageMaker handles checkpointing and restarts automatically if you configure it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from sagemaker.pytorch import PyTorch
spot_estimator = PyTorch(
entry_point="train.py",
source_dir="src/",
role=role,
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.1",
py_version="py310",
hyperparameters={"epochs": 50, "batch-size": 128, "lr": 0.001},
output_path=f"s3://{bucket}/spot-output/",
# Spot instance configuration
use_spot_instances=True,
max_run=3600 * 4, # Max 4 hours total runtime
max_wait=3600 * 5, # Max 5 hours including wait for capacity
checkpoint_s3_uri=f"s3://{bucket}/checkpoints/",
)
spot_estimator.fit({"train": f"s3://{bucket}/data/train/"})
|
The max_wait must be greater than max_run. If SageMaker can’t get spot capacity within the difference, the job fails. Set max_wait to at least max_run plus an hour of buffer.
For checkpointing to work, your training script needs to save and load checkpoints from /opt/ml/checkpoints/. SageMaker syncs this directory to your checkpoint_s3_uri. Add something like this to your training loop:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| checkpoint_dir = "/opt/ml/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
# Save checkpoint at the end of each epoch
checkpoint_path = os.path.join(checkpoint_dir, "checkpoint.pth")
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": avg_loss,
}, checkpoint_path)
# Load checkpoint at the start of training (if resuming)
if os.path.isfile(checkpoint_path):
checkpoint = torch.load(checkpoint_path)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
start_epoch = checkpoint["epoch"] + 1
print(f"Resuming from epoch {start_epoch}")
|
My recommendation: always use spot instances for training jobs that take longer than 30 minutes. The savings are substantial and interruptions are rare in practice. For short jobs under 30 minutes, on-demand is fine since the overhead of checkpointing isn’t worth it.
Deploying to an Endpoint#
Once training completes, deploy the model to a real-time endpoint with one call:
1
2
3
4
5
6
7
8
9
10
11
12
| predictor = estimator.deploy(
initial_instance_count=1,
instance_type="ml.m5.large",
serializer=sagemaker.serializers.JSONSerializer(),
deserializer=sagemaker.deserializers.JSONDeserializer(),
)
# Run inference
import json
test_data = {"inputs": [[0.1, 0.2, 0.3, 0.4]]}
result = predictor.predict(test_data)
print(result)
|
For the PyTorch model to serve predictions, you also need a model_fn and predict_fn in your script (or a separate inference.py):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import torch
import json
def model_fn(model_dir):
model = SimpleClassifier(input_dim=784, hidden_dim=512, output_dim=10)
model.load_state_dict(torch.load(os.path.join(model_dir, "model.pth")))
model.eval()
return model
def input_fn(request_body, request_content_type):
data = json.loads(request_body)
return torch.tensor(data["inputs"], dtype=torch.float32)
def predict_fn(input_data, model):
with torch.no_grad():
output = model(input_data)
predictions = torch.argmax(output, dim=1)
return predictions.tolist()
|
Don’t forget to delete the endpoint when you’re done testing. SageMaker endpoints bill by the hour whether they’re receiving traffic or not:
1
| predictor.delete_endpoint()
|
Common Errors and Fixes#
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation
Usually a permissions issue. Your SageMaker execution role needs access to the S3 bucket containing your training data. Check the role’s trust policy includes sagemaker.amazonaws.com and has s3:GetObject on your data prefix.
AlgorithmError: Framework Error... No module named 'your_module'
Your source_dir is missing dependencies. Add a requirements.txt file inside source_dir and SageMaker installs the packages before running your script.
ResourceLimitExceeded: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0
New AWS accounts have zero quota for GPU instances. Go to Service Quotas in the AWS Console, search for SageMaker, and request an increase for the instance type you need. Approvals typically take 1-2 business days.
UnexpectedStatusException: Error for Training job ... Training job failed
Check CloudWatch Logs under /aws/sagemaker/TrainingJobs. The actual Python traceback is there. Most common causes: wrong data format, missing files in S3, or out-of-memory errors. For OOM, either reduce batch size or use a larger instance.
Spot instance job stuck in Waiting status
Capacity isn’t available. Either increase max_wait, switch to a different instance type (try ml.p3.8xlarge or ml.g4dn.xlarge as alternatives), or switch regions. us-east-1 and us-west-2 tend to have the best availability.
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation
Your inference functions (model_fn, input_fn, predict_fn) have a bug. Test them locally before deploying. The serializer/deserializer on the predictor must match what your inference code expects.