How to Build a Model Training Pipeline with AWS SageMaker and Python

SageMaker handles the undifferentiated heavy lifting of ML training: provisioning instances, pulling containers, managing storage, and tearing everything down when the job finishes. You write a training script, point SageMaker at your data, and let it run.

Here’s the fastest path to a training job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

role = get_execution_role()
session = sagemaker.Session()

estimator = PyTorch(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.1",
    py_version="py310",
    output_path=f"s3://{session.default_bucket()}/models/",
)

estimator.fit({"train": "s3://my-bucket/data/train/"})

That’s it. SageMaker spins up a p3.2xlarge with a V100, runs your train.py script, saves the model artifact to S3, and shuts down the instance. You pay only for the compute time used.

Setting Up Your SageMaker Session

Every SageMaker workflow starts with a session and an IAM role. If you’re running inside a SageMaker notebook instance or Studio, get_execution_role() grabs the attached role automatically. Running locally requires you to specify the role ARN directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import boto3
import sagemaker
from sagemaker import get_execution_role

# Inside SageMaker notebook/Studio
role = get_execution_role()

# Running locally — pass the role ARN explicitly
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

session = sagemaker.Session(boto_session=boto3.Session(region_name="us-east-1"))
bucket = session.default_bucket()

print(f"Using bucket: {bucket}")
print(f"Region: {session.boto_region_name}")

Your IAM role needs AmazonSageMakerFullAccess at minimum. For production, scope it down to the specific S3 buckets and ECR repos your pipeline touches.

Training with Built-in Algorithms

SageMaker ships with optimized containers for common algorithms. XGBoost is probably the most popular one, and it’s a solid choice for tabular data before you reach for deep learning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput

role = get_execution_role()
session = sagemaker.Session()
bucket = session.default_bucket()

# Use the SageMaker-managed XGBoost container
xgb_container = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=session.boto_region_name,
    version="1.7-1",
)

xgb_estimator = sagemaker.estimator.Estimator(
    image_uri=xgb_container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{bucket}/xgboost-output/",
    sagemaker_session=session,
)

xgb_estimator.set_hyperparameters(
    objective="binary:logistic",
    num_round=200,
    max_depth=6,
    eta=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
)

train_input = TrainingInput(
    s3_data=f"s3://{bucket}/data/train/",
    content_type="text/csv",
)

validation_input = TrainingInput(
    s3_data=f"s3://{bucket}/data/validation/",
    content_type="text/csv",
)

xgb_estimator.fit({"train": train_input, "validation": validation_input})

The built-in XGBoost container expects CSV or libsvm format. First column must be the label. No headers. Upload your data to S3 in that format before running the job.

One thing to watch: sagemaker.image_uris.retrieve() replaced the old get_image_uri function. If you see code using the old pattern, update it.

Custom Training Scripts with PyTorch

Built-in algorithms get you surprisingly far, but most real projects need custom code. SageMaker’s PyTorch estimator lets you bring your own training script while SageMaker handles the infrastructure.

Your training script needs to follow a simple contract. SageMaker passes paths for input data and model output through environment variables, and the Python SDK wraps these into argparse arguments.

Here’s a working train.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import argparse
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x):
        return self.net(x)

def train(args):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # SageMaker puts training data in /opt/ml/input/data/<channel_name>/
    train_dir = args.train
    X_train = torch.load(os.path.join(train_dir, "X_train.pt"))
    y_train = torch.load(os.path.join(train_dir, "y_train.pt"))

    dataset = TensorDataset(X_train, y_train)
    loader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)

    model = SimpleClassifier(
        input_dim=args.input_dim,
        hidden_dim=args.hidden_dim,
        output_dim=args.output_dim,
    ).to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=args.lr)

    for epoch in range(args.epochs):
        model.train()
        total_loss = 0.0
        for X_batch, y_batch in loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            optimizer.zero_grad()
            output = model(X_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(loader)
        print(f"Epoch {epoch+1}/{args.epochs} - Loss: {avg_loss:.4f}")

    # SageMaker expects the model at /opt/ml/model/
    model_path = os.path.join(args.model_dir, "model.pth")
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--batch-size", type=int, default=64)
    parser.add_argument("--lr", type=float, default=0.001)
    parser.add_argument("--input-dim", type=int, default=784)
    parser.add_argument("--hidden-dim", type=int, default=256)
    parser.add_argument("--output-dim", type=int, default=10)

    # SageMaker-specific arguments (injected automatically)
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train"))

    args = parser.parse_args()
    train(args)

Now launch it from your notebook or local machine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train.py",
    source_dir="src/",
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.1",
    py_version="py310",
    hyperparameters={
        "epochs": 20,
        "batch-size": 128,
        "lr": 0.0005,
        "input-dim": 784,
        "hidden-dim": 512,
        "output-dim": 10,
    },
    output_path=f"s3://{bucket}/pytorch-output/",
)

estimator.fit({"train": f"s3://{bucket}/data/train/"})

The source_dir parameter lets you include other Python files your training script imports. SageMaker tars up the directory and sends it to the training instance.

Spot Instances for Cost Savings

Managed spot training is one of the best SageMaker features. You can cut training costs by up to 90% compared to on-demand pricing. The tradeoff: your job might get interrupted if AWS needs the capacity back. SageMaker handles checkpointing and restarts automatically if you configure it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sagemaker.pytorch import PyTorch

spot_estimator = PyTorch(
    entry_point="train.py",
    source_dir="src/",
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.1",
    py_version="py310",
    hyperparameters={"epochs": 50, "batch-size": 128, "lr": 0.001},
    output_path=f"s3://{bucket}/spot-output/",

    # Spot instance configuration
    use_spot_instances=True,
    max_run=3600 * 4,        # Max 4 hours total runtime
    max_wait=3600 * 5,       # Max 5 hours including wait for capacity
    checkpoint_s3_uri=f"s3://{bucket}/checkpoints/",
)

spot_estimator.fit({"train": f"s3://{bucket}/data/train/"})

The max_wait must be greater than max_run. If SageMaker can’t get spot capacity within the difference, the job fails. Set max_wait to at least max_run plus an hour of buffer.

For checkpointing to work, your training script needs to save and load checkpoints from /opt/ml/checkpoints/. SageMaker syncs this directory to your checkpoint_s3_uri. Add something like this to your training loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
checkpoint_dir = "/opt/ml/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

# Save checkpoint at the end of each epoch
checkpoint_path = os.path.join(checkpoint_dir, "checkpoint.pth")
torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": avg_loss,
}, checkpoint_path)

# Load checkpoint at the start of training (if resuming)
if os.path.isfile(checkpoint_path):
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    start_epoch = checkpoint["epoch"] + 1
    print(f"Resuming from epoch {start_epoch}")

My recommendation: always use spot instances for training jobs that take longer than 30 minutes. The savings are substantial and interruptions are rare in practice. For short jobs under 30 minutes, on-demand is fine since the overhead of checkpointing isn’t worth it.

Deploying to an Endpoint

Once training completes, deploy the model to a real-time endpoint with one call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

# Run inference
import json
test_data = {"inputs": [[0.1, 0.2, 0.3, 0.4]]}
result = predictor.predict(test_data)
print(result)

For the PyTorch model to serve predictions, you also need a model_fn and predict_fn in your script (or a separate inference.py):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import torch
import json

def model_fn(model_dir):
    model = SimpleClassifier(input_dim=784, hidden_dim=512, output_dim=10)
    model.load_state_dict(torch.load(os.path.join(model_dir, "model.pth")))
    model.eval()
    return model

def input_fn(request_body, request_content_type):
    data = json.loads(request_body)
    return torch.tensor(data["inputs"], dtype=torch.float32)

def predict_fn(input_data, model):
    with torch.no_grad():
        output = model(input_data)
        predictions = torch.argmax(output, dim=1)
    return predictions.tolist()

Don’t forget to delete the endpoint when you’re done testing. SageMaker endpoints bill by the hour whether they’re receiving traffic or not:

1
predictor.delete_endpoint()

Common Errors and Fixes

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation

Usually a permissions issue. Your SageMaker execution role needs access to the S3 bucket containing your training data. Check the role’s trust policy includes sagemaker.amazonaws.com and has s3:GetObject on your data prefix.

AlgorithmError: Framework Error... No module named 'your_module'

Your source_dir is missing dependencies. Add a requirements.txt file inside source_dir and SageMaker installs the packages before running your script.

ResourceLimitExceeded: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0

New AWS accounts have zero quota for GPU instances. Go to Service Quotas in the AWS Console, search for SageMaker, and request an increase for the instance type you need. Approvals typically take 1-2 business days.

UnexpectedStatusException: Error for Training job ... Training job failed

Check CloudWatch Logs under /aws/sagemaker/TrainingJobs. The actual Python traceback is there. Most common causes: wrong data format, missing files in S3, or out-of-memory errors. For OOM, either reduce batch size or use a larger instance.

Spot instance job stuck in Waiting status

Capacity isn’t available. Either increase max_wait, switch to a different instance type (try ml.p3.8xlarge or ml.g4dn.xlarge as alternatives), or switch regions. us-east-1 and us-west-2 tend to have the best availability.

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation

Your inference functions (model_fn, input_fn, predict_fn) have a bug. Test them locally before deploying. The serializer/deserializer on the predictor must match what your inference code expects.

Setting Up Your SageMaker Session#

Training with Built-in Algorithms#

Custom Training Scripts with PyTorch#

Spot Instances for Cost Savings#

Deploying to an Endpoint#

Common Errors and Fixes#

Related Guides#

About the Author