05 — Fine-tuning pipeline¶
Repeatable pipeline to fine-tune custom models on top of foundation models, with versioned datasets, deterministic training jobs, evaluations and one-click rollback.
Problem statement¶
Off-the-shelf foundation models are great until they aren't — specific tone, jargon, or output formats often need fine-tuning. You need a pipeline that (1) tracks datasets and their versions, (2) runs deterministic training jobs, (3) evaluates against a golden set, (4) registers new models, (5) promotes them through environments only when evals beat the incumbent.
This is mostly MLOps, not "AI" — the LLM is one component in a CI/CD pipeline for models.
Components¶
- Amazon S3. Raw and processed datasets; model artifacts; eval reports. The single source of truth.
- AWS Glue. ETL for raw → processed dataset transformations (deduping, formatting to the model's expected schema).
- Amazon SageMaker Training Jobs. Where fine-tuning runs. Either custom containers or SageMaker JumpStart for popular base models.
- Amazon SageMaker Model Registry. Tracks model versions with approval gates between staging and production.
- MLflow on Amazon SageMaker. Experiment tracking — metrics, parameters, artifacts per training run.
- AWS Step Functions. Orchestrates: ingest → ETL → train → evaluate → register → (optional) deploy.
- Amazon EventBridge Scheduler. Cron triggers for periodic re-training.
- Amazon SageMaker Endpoints (real-time) or Batch Transform (offline) — depending on inference pattern.
- Amazon CloudWatch. Job-level metrics + custom eval metrics.
Diagram¶
flowchart TB
subgraph DataPlane[Data plane]
Raw[(S3 - raw datasets)]
Glue[AWS Glue ETL]
Curated[(S3 - curated)]
Golden[(S3 - golden eval set)]
end
subgraph Pipeline[Step Functions pipeline]
ETL[Run Glue job]
Train[Train: SageMaker Training Job]
Eval[Eval: SageMaker Processing Job]
Compare{Better than
incumbent?}
Register[Register in Model Registry
pending approval]
Reject[Reject + emit alarm]
end
subgraph Registry[Model registry]
MR[(SageMaker Model Registry)]
MLF[MLflow tracking]
end
Trigger([Schedule or manual]) --> ETL
Raw --> ETL --> Curated
Curated --> Train
Train --> Eval
Golden --> Eval
Eval --> Compare
Compare -->|yes| Register
Compare -->|no| Reject
Register --> MR
Train --> MLF
Eval --> MLF
Decisions¶
D1 — SageMaker, not Bedrock fine-tuning¶
Context. Bedrock supports fine-tuning a small set of models. SageMaker supports anything you can put in a container.
Decision. SageMaker by default. Bedrock fine-tuning is a candidate when (a) the base model is supported, (b) you do not need custom training code, and (c) you want managed deployment.
Alternatives. Bedrock fine-tuning — simpler when applicable. Always-on hosting endpoints — costlier.
Consequences. More flexibility, more responsibility for instance types and training scripts.
D2 — Model registry approval gate, not auto-deploy¶
Context. "Beat the incumbent on the golden set" is a necessary condition, not sufficient. Edge cases, user-facing regressions, governance.
Decision. Eval result is gated by SageMaker Model Registry's Approved status, which requires a human.
Alternatives. Full CD with auto-deploy on metric improvement — fine for low-risk usages, dangerous for user-facing models.
Consequences. Adds latency between training and production. The right trade-off when production matters.
D3 — Golden eval set is versioned and immutable¶
Context. If the eval set drifts, you cannot compare model versions trained months apart.
Decision. Golden eval set lives in S3 with Object Lock (compliance mode). Updates create a new version with an explicit name; old versions stay.
Alternatives. Mutable golden set + Git versioning — fine for small text sets, ugly for binary data.
Consequences. Slight storage cost. Trustworthy comparisons.
D4 — Step Functions, not SageMaker Pipelines¶
Context. SageMaker Pipelines is purpose-built for ML. Step Functions is general-purpose.
Decision. Step Functions. Reason: Step Functions is more flexible (we have non-ML steps — emit EventBridge, push to a Slack webhook, gate on a human approval), and the team already uses it for other workflows.
Alternatives. SageMaker Pipelines — better integrated with the Studio UI, but a second orchestrator to learn and maintain.
Consequences. Less SageMaker-UI integration. Lower cognitive load.
Cost analysis¶
The pipeline itself is cheap; training jobs dominate. Costs are highly model-dependent.
| Sizing | Re-training frequency | Approx. monthly USD (pipeline + training) |
|---|---|---|
| S — prototype | monthly, ml.g5.xlarge, 8h | ~ $120 |
| M — production | weekly, ml.g5.12xlarge, 12h | ~ $2 800 |
| L — multi-model | daily, ml.g5.48xlarge, 6h, 3 models | ~ $22 000 |
Inputs (M sizing):
- SageMaker Training: 4 runs × 12h × ml.g5.12xlarge @ ~$7/hr = ~$340; with spot capacity reductions (50%) → ~$170
- SageMaker Processing (eval): 4 × 1h × ml.g5.xlarge → ~$5
- Glue: 4 × 30 min DPU = ~$20
- S3: 200 GB datasets + model artifacts → ~$5
- Step Functions transitions: trivial
- MLflow on SageMaker: small instance ~$50
- Endpoint hosting (if real-time): 1 × ml.g5.xlarge × 730h = ~$2 200
If the endpoint is on-demand (Serverless Inference) instead of always-on, costs drop dramatically for low-traffic models.
Well-Architected review¶
Operational excellence. Training jobs are tagged with dataset_version, base_model, git_sha. The Model Registry version is linked to those tags so anyone can reproduce a model.
Security. Training role has read-only on the dataset bucket and write-only on the model artifact bucket — separate buckets. KMS encryption everywhere. SageMaker network isolation enabled (no internet access for training jobs).
Reliability. Step Functions retries the training task with exponential backoff. Failures emit to EventBridge with payload — a separate Slack consumer pages the ML team.
Performance efficiency. Spot capacity for training: 50–70% savings when checkpointing is enabled. Right-size instance types — bigger isn't always faster for fine-tuning.
Cost optimization. Cold-start your endpoints with SageMaker Serverless Inference if traffic is intermittent. Auto-tag everything for cost allocation.
Sustainability. Train on the smallest model that meets the eval bar. Re-train only when data drift triggers it (compute drift score as part of ETL).
Trade-offs¶
Use this when:
- Off-the-shelf prompting + RAG doesn't get you to the quality bar.
- You have at least a few hundred high-quality training examples and a meaningful eval set.
- Output format / domain language matter enough to justify ongoing pipeline ownership.
Do NOT use this when:
- You haven't seriously tried prompt engineering + few-shot + RAG first. Fine-tuning is the last lever, not the first.
- Your training data is < 100 examples — adapt prompts, not weights.
- You only have a one-off need — a few thousand dollars in Bedrock calls beats a permanent pipeline.
Terraform skeleton¶
See terraform/ — buckets with Object Lock for the golden set, IAM roles, Step Functions skeleton. Training scripts live in your own ML repo and are referenced by S3 URI.