The Core Workflow
ML CI/CD with GitHub Actions boils down to this: every push or pull request triggers a workflow that trains your model, evaluates it against a baseline, and posts results as a PR comment. If metrics pass your threshold, the model gets pushed to a registry.
Here’s the minimal workflow file that does all three. Create .github/workflows/ml-pipeline.yml:
| |
The permissions block is critical. Without pull-requests: write, you’ll hit this error immediately:
| |
That error means the GITHUB_TOKEN doesn’t have permission to comment on PRs. The fix is always the same: add the permissions block at the top of your workflow, or go to Settings > Actions > General > Workflow permissions and switch to “Read and write.”
Setting Up Data Versioning with DVC
You don’t want to store training data or model artifacts in Git. DVC (Data Version Control) tracks them in remote storage while Git tracks the metadata. This way your CI pipeline can pull exactly the right dataset version for each commit.
Install DVC and initialize it in your repo:
| |
Track your dataset and model output:
| |
Now update the workflow to pull DVC-tracked data before training:
| |
Add your AWS credentials (or GCS/Azure equivalents) as repository secrets under Settings > Secrets and variables > Actions. DVC supports S3, GCS, Azure Blob, SSH, and HTTP remotes.
Automated Model Evaluation with CML
CML (Continuous Machine Learning) by Iterative is the tool that ties GitHub Actions to ML workflows. It posts training metrics, plots, and images directly to your PR comments so reviewers can see exactly what changed.
Here’s a training script that outputs metrics CML can pick up:
| |
Update your workflow to include the plot in the CML report:
| |
The cml asset publish command uploads the image and returns a markdown image reference. Your PR comment now shows the confusion matrix directly in GitHub.
Model Registry Integration
Once your model passes evaluation, push it to a registry. MLflow’s model registry is the most common choice. Add a promotion step that only runs on the main branch:
| |
The if: github.ref == 'refs/heads/main' guard prevents registration on PRs. You only want models promoted to the registry after they’ve been reviewed and merged.
Handling GPU Training in CI
GitHub-hosted runners don’t have GPUs. For GPU-intensive training, you have three options:
Self-hosted runners are the simplest. Install the GitHub runner agent on a GPU machine and tag your workflow:
| |
CML cloud provisioning spins up cloud instances on demand. Add this to your workflow:
| |
GitHub’s larger runners now include GPU options if you’re on a Team or Enterprise plan. Check runs-on: ubuntu-gpu availability in your organization settings.
Common Failures and Fixes
Out of disk space during training. GitHub-hosted runners have about 14 GB free. Large datasets or model checkpoints fill that up fast.
| |
Fix it by cleaning up pre-installed tools before your training step:
| |
DVC pull fails with authentication errors. This usually means your secrets aren’t set or have the wrong scope.
| |
Double-check that AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set in repository secrets (not environment secrets, unless you’ve configured environments). For GCS, use a service account JSON stored as a secret.
Workflow hangs during model training. GitHub Actions has a 6-hour job timeout by default. Set something shorter so you don’t burn minutes:
| |
Putting It All Together
A production-grade ML CI/CD pipeline combines all of these pieces: DVC for data versioning, CML for automated reporting, threshold-based evaluation gates, and model registry promotion on merge. The workflow runs on every PR, gives reviewers concrete metrics to evaluate, and automatically ships approved models to production.
Start with the basic train-and-evaluate workflow. Add DVC when your data gets too large for Git. Layer in CML reporting when you want metrics in PR comments. Add registry integration last, once you’ve established what “good enough” means for your model.
Related Guides
- How to Build a Model CI Pipeline with GitHub Actions and DVC
- How to Version and Deploy Models with MLflow Model Registry
- How to Implement Canary Deployments for ML Models
- How to Build a Model Dependency Scanner and Vulnerability Checker
- How to Build a Model Feature Store Pipeline with Redis and FastAPI
- How to Build a Model Metadata Store with SQLite and FastAPI
- How to A/B Test LLM Prompts and Models in Production
- How to Build Blue-Green Deployments for ML Models
- How to Serve ML Models with BentoML and Build Prediction APIs
- How to Build a Shadow Deployment Pipeline for ML Models