AWS SageMaker
Amazon SageMaker is AWS's fully managed machine learning platform that covers the entire ML lifecycle â from data labeling and feature engineering to training, tuning, deployment, and monitoring. It is the most widely adopted cloud ML platform in enterprise settings.
SageMaker's Core Value Proposition
SageMaker Architecture Overview
SageMaker is organized around several key components:
SageMaker Studio
The integrated IDE for ML development. Provides Jupyter notebooks, experiment tracking, model registry, and pipeline management in a unified web interface.Training Jobs
Managed training infrastructure that:Endpoints
Managed real-time inference infrastructure:Pipelines
ML workflow orchestration:1# SageMaker Training Job â Complete Example
2# This shows how to train an XGBoost model on SageMaker
3
4import sagemaker
5from sagemaker import Session
6from sagemaker.inputs import TrainingInput
7from sagemaker.xgboost import XGBoost
8
9# Initialize SageMaker session
10session = Session()
11role = sagemaker.get_execution_role() # IAM role for SageMaker
12bucket = session.default_bucket()
13
14# --- Step 1: Upload data to S3 ---
15train_path = session.upload_data(
16 path="data/train.csv",
17 bucket=bucket,
18 key_prefix="demo/train"
19)
20val_path = session.upload_data(
21 path="data/val.csv",
22 bucket=bucket,
23 key_prefix="demo/val"
24)
25
26# --- Step 2: Configure the training job ---
27xgb_estimator = XGBoost(
28 entry_point="train.py", # Your training script
29 role=role,
30 instance_count=1,
31 instance_type="ml.m5.xlarge", # CPU instance
32 framework_version="1.7-1",
33 py_version="py3",
34 hyperparameters={
35 "max_depth": 5,
36 "eta": 0.2,
37 "gamma": 4,
38 "min_child_weight": 6,
39 "subsample": 0.8,
40 "objective": "binary:logistic",
41 "num_round": 200,
42 },
43 output_path=f"s3://{bucket}/demo/output",
44)
45
46# --- Step 3: Launch training ---
47xgb_estimator.fit({
48 "train": TrainingInput(train_path, content_type="csv"),
49 "validation": TrainingInput(val_path, content_type="csv"),
50})
51
52# SageMaker provisions an instance, runs training,
53# saves model to S3, and terminates the instance.
54print(f"Model artifact: {xgb_estimator.model_data}")Deploying to a Real-Time Endpoint
Once a model is trained, deploying it to a real-time endpoint takes a single API call:
1# Deploy model to a real-time endpoint
2predictor = xgb_estimator.deploy(
3 initial_instance_count=1,
4 instance_type="ml.m5.large",
5 endpoint_name="my-xgb-endpoint",
6)
7
8# Make predictions
9import numpy as np
10test_data = np.array([[25, 50000, 3], [45, 120000, 7]])
11predictions = predictor.predict(test_data)
12print(f"Predictions: {predictions}")
13
14# IMPORTANT: Delete endpoint when done to stop charges!
15predictor.delete_endpoint()Built-In Algorithms
SageMaker provides optimized implementations of common algorithms that are faster and more cost-effective than custom implementations:
| Algorithm | Use Case | Key Advantage |
|---|---|---|
| XGBoost | Classification, regression | Distributed training, GPU support |
| Linear Learner | Linear/logistic regression | Highly optimized for large datasets |
| K-Means | Clustering | Distributed training |
| Image Classification | Image recognition | Built on ResNet, transfer learning |
| BlazingText | Text classification, Word2Vec | Orders of magnitude faster |
| DeepAR | Time series forecasting | Handles multiple related time series |
| Object Detection | Finding objects in images | Single-shot detection |
SageMaker JumpStart
JumpStart is SageMaker's model hub â a collection of pre-trained models that you can deploy with one click or fine-tune on your data:
Cost Optimization Strategies
| Strategy | Savings | Trade-off |
|---|---|---|
| Spot Instances for training | Up to 90% | Job may be interrupted |
| SageMaker Savings Plans | Up to 64% | 1-3 year commitment |
| Multi-model Endpoints | Share one endpoint across models | Slightly higher latency |
| Serverless Inference | Pay per request | Cold start latency |
| Right-sizing instances | Variable | Requires benchmarking |
| Managed Warm Pools | Reduce startup time | Ongoing instance cost |
Cost Trap: Idle Endpoints
1# SageMaker Pipeline â Multi-Step ML Workflow
2from sagemaker.workflow.pipeline import Pipeline
3from sagemaker.workflow.steps import (
4 ProcessingStep, TrainingStep, CreateModelStep
5)
6from sagemaker.workflow.conditions import ConditionGreaterThan
7from sagemaker.workflow.condition_step import ConditionStep
8from sagemaker.workflow.functions import JsonGet
9from sagemaker.processing import ScriptProcessor
10
11# Step 1: Data preprocessing
12sklearn_processor = ScriptProcessor(
13 framework_version="1.2-1",
14 role=role,
15 instance_type="ml.m5.xlarge",
16 instance_count=1,
17 command=["python3"],
18 image_uri=sagemaker.image_uris.retrieve(
19 "sklearn", session.boto_region_name, "1.2-1"
20 ),
21)
22
23preprocess_step = ProcessingStep(
24 name="PreprocessData",
25 processor=sklearn_processor,
26 code="scripts/preprocess.py",
27)
28
29# Step 2: Model training
30train_step = TrainingStep(
31 name="TrainModel",
32 estimator=xgb_estimator,
33 inputs={
34 "train": TrainingInput(
35 s3_data=preprocess_step.properties.ProcessingOutputConfig
36 .Outputs["train"].S3Output.S3Uri,
37 content_type="csv"
38 ),
39 },
40)
41
42# Step 3: Conditional deployment (only if accuracy > 0.8)
43condition = ConditionGreaterThan(
44 left=JsonGet(
45 step_name=train_step.name,
46 property_file="evaluation",
47 json_path="metrics.accuracy"
48 ),
49 right=0.8,
50)
51
52cond_step = ConditionStep(
53 name="CheckAccuracy",
54 conditions=[condition],
55 if_steps=[], # deploy steps would go here
56 else_steps=[], # alert/retrain steps
57)
58
59# Create and execute pipeline
60pipeline = Pipeline(
61 name="my-ml-pipeline",
62 steps=[preprocess_step, train_step, cond_step],
63 sagemaker_session=session,
64)
65
66pipeline.upsert(role_arn=role)
67execution = pipeline.start()
68print(f"Pipeline execution: {execution.arn}")