The Quick Version
Install Ultralytics, organize your images and labels in YOLO format, point a data.yaml at them, and run one command. Here’s the entire training loop in under 10 lines:
| |
That’s it. Ultralytics handles augmentation, learning rate scheduling, mixed precision, and checkpointing automatically. Your best weights land in runs/detect/safety_detector_v1/weights/best.pt.
Preparing Your Dataset
YOLO format is dead simple. Each image gets a matching .txt file with one bounding box per line. The format is:
| |
All coordinates are normalized to [0, 1] relative to image dimensions. A bounding box for class 0 (say, “hardhat”) centered at 60% across and 30% down the image, spanning 20% width and 15% height looks like:
| |
Directory Structure
Organize your data like this:
| |
The key rule: label filenames must match image filenames exactly (minus the extension). img_001.jpg pairs with img_001.txt. Get this wrong and Ultralytics silently treats images as having no annotations — you’ll see training “work” but metrics stay near zero.
The data.yaml File
| |
Use absolute paths for path to avoid headaches. The names dict maps class IDs to human-readable labels. Order matters — it must match the class IDs in your .txt label files.
Choosing a Model Size
Ultralytics offers five sizes for both YOLOv8 and YOLOv11:
| Model | Params | Speed (T4 GPU) | mAP (COCO) | Use Case |
|---|---|---|---|---|
yolo11n | 2.6M | ~1.5ms | 39.5 | Edge devices, real-time |
yolo11s | 9.4M | ~2.5ms | 47.0 | Balanced speed/accuracy |
yolo11m | 20.1M | ~4.7ms | 51.5 | General production |
yolo11l | 25.3M | ~6.2ms | 53.4 | High accuracy needed |
yolo11x | 56.9M | ~11.3ms | 54.7 | Maximum accuracy |
My recommendation: start with yolo11s or yolo11m. The nano model loses too much accuracy for most real-world tasks, and the large/xlarge models rarely justify the extra compute unless you have a genuinely hard detection problem with small or overlapping objects.
Training with the CLI
If you prefer the command line, the yolo CLI works great:
| |
This is functionally identical to the Python API. Every argument you pass here maps to a parameter in model.train().
Hyperparameter Tuning
Don’t guess hyperparameters — use Ultralytics’ built-in tuner. It runs a genetic algorithm over your hyperparameter space:
| |
The tuner adjusts learning rate, momentum, augmentation intensity, and more. It’s expensive — budget 30x your normal training time — but it regularly finds 2-5% mAP improvements that manual tuning misses.
Parameters Worth Tweaking Manually
A few knobs have outsized impact:
imgsz: Bigger images catch smaller objects. Try 1280 if you’re missing detections on small targets.batch: As large as your GPU allows. Larger batches give more stable gradients.lr0: Starting learning rate. Lower it (0.001) if training is unstable. Raise it (0.02) with large batch sizes.mosaic: Mosaic augmentation (default 1.0). Set to 0.0 for the last 10 epochs — this is a known trick that stabilizes final metrics.close_mosaic: Number of final epochs without mosaic augmentation. Default is 10, which is usually fine.
Evaluating Your Model
After training, check your metrics:
| |
What the numbers mean:
- mAP50: Mean average precision at IoU 0.50. Your primary metric for most applications. Aim for 0.85+ on well-labeled data.
- mAP50-95: Averaged across IoU thresholds from 0.50 to 0.95. Much stricter — 0.55+ is solid.
- Precision: Of all detections, how many were correct. High precision = few false positives.
- Recall: Of all ground truth objects, how many did the model find. High recall = few missed objects.
If precision is high but recall is low, your model is conservative — lower the confidence threshold. If recall is high but precision is low, you’re getting too many false positives — raise the threshold or add more negative examples to training.
Exporting Your Model
Training produces a PyTorch .pt file. For production, export to a runtime-optimized format:
| |
For server-side inference, TensorRT is the clear winner — expect 2-4x speedup over PyTorch on the same GPU. For edge and mobile, ONNX Runtime or CoreML depending on your target hardware. The dynamic=True flag on ONNX export lets you vary batch size and image dimensions at inference time, which is critical for production flexibility.
Common Errors
No labels found during training
Your label .txt files aren’t where Ultralytics expects them. Double-check that labels/ is a sibling to images/ and filenames match exactly. Also verify data.yaml paths are correct — use absolute paths.
Training loss goes to NaN
Usually caused by a learning rate that’s too high or corrupted label files. Check for label lines with coordinates outside [0, 1] or negative values. Run this to validate:
| |
Low mAP despite good training loss
This almost always means label quality issues. Common causes: inconsistent labeling (sometimes you label an object, sometimes you don’t), bounding boxes that are too tight or too loose, or class confusion between similar categories. Review a sample of your annotations in a tool like Label Studio or CVAT before blaming the model.
CUDA out of memory
Reduce batch size or imgsz. If you’re on a 12GB GPU, batch=8 with imgsz=640 using yolo11m is a safe starting point. You can also enable gradient accumulation by setting batch=16 with accumulate=2 — this simulates a batch size of 32 using only 16 images of GPU memory per step.
Export fails with opset version error
Bump the opset parameter. ONNX opset 17 works for most YOLO models. If you’re targeting an older inference runtime, try opset 13 or 14, but some operations may not be supported.
Tips from Production
A few things I’ve learned shipping YOLO models to production:
- Start with fewer classes. A 3-class model will outperform a 20-class model on the same data budget. Merge similar classes early, split them later when you have enough examples.
- 500 images per class is the minimum. Below that, you’re fighting data scarcity. Above 2000 per class, you’ll see diminishing returns unless the class has high visual variance.
- Freeze the backbone for small datasets. If you have under 1000 images total, freeze the first 10 layers with
freeze=10. This prevents overfitting the feature extractor and typically adds 3-5 mAP points on small datasets. - Always validate on a separate set. Never train and validate on the same images. Sounds obvious, but I’ve seen teams accidentally leak validation images into training through augmentation pipelines.
- Version your datasets. When you retrain with more data, keep the old dataset around. If the new model regresses on a specific class, you need to compare annotations.
Related Guides
- How to Build Multi-Object Tracking with DeepSORT and YOLOv8
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Build a Product Defect Detector with YOLOv8 and OpenCV
- How to Detect Objects in Images with YOLOv8
- How to Build a Wildlife Camera Trap Classifier with YOLOv8 and FastAPI
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build a Visual Grounding Pipeline with Grounding DINO
- How to Build Real-Time Object Segmentation with SAM 2 and WebSocket
- How to Build a Video Shot Boundary Detection Pipeline with PySceneDetect
- How to Build a Visual Inspection Pipeline with Anomaly Detection