The Core Idea in 30 Seconds
Active learning lets your model pick which samples to label next, instead of labeling everything randomly. You train on a tiny labeled set, ask the model which unlabeled points it’s most confused about, label those, retrain, and repeat. This cuts labeling costs by 50-80% in practice.
| |
The modAL library wraps scikit-learn estimators with active learning query strategies out of the box. It’s the fastest path from zero to a working active learning loop in Python.
Pool-Based Active Learning Loop
The most common setup is pool-based active learning. You start with a large pool of unlabeled data, a small labeled seed set, and a model. Each iteration, the model scores every unlabeled sample, picks the most informative ones, and you (or an oracle) provide labels for just those.
| |
After 50 queries you’ll typically see accuracy jump from around 0.55-0.65 to 0.80+, depending on the dataset. That’s 70 total labeled samples doing the work of hundreds labeled randomly.
Query Strategies: Picking the Right One
The query strategy decides which samples the model asks about. This is where active learning either shines or falls flat. modAL ships with three uncertainty-based strategies, and they’re not interchangeable.
Least Confidence
Picks the sample where the model’s most confident class still has the lowest probability. Good default for multiclass problems.
| |
Margin Sampling
Looks at the gap between the top two predicted class probabilities. A small margin means the model can’t decide between two classes. This is my preferred strategy for most classification tasks because it targets the actual decision boundary rather than general confusion.
| |
Entropy Sampling
Computes the Shannon entropy of the predicted class distribution. High entropy means the model is spread across many classes. Use this when you have many classes (10+) and want to capture overall confusion rather than pairwise ambiguity.
| |
My recommendation: start with margin sampling. It consistently outperforms least confidence across datasets I’ve worked with because it focuses on the hardest decision boundary. Entropy is better for high-cardinality problems but can waste budget on samples that are confusing across irrelevant classes.
Batch Active Learning
Querying one sample at a time is clean but slow. In practice, you want to query batches because sending one sample at a time to human annotators is impractical. The trick is avoiding redundancy within the batch.
| |
Batch sizes of 5-20 work well for most setups. Go larger than 50 and you start losing the benefit of active learning because you’re approaching random sampling territory.
Evaluating Labeling Efficiency
The whole point of active learning is to label less. You need to compare your active learner against random sampling to prove it’s actually helping.
| |
If the active learning curve sits above the random curve, you’re saving labeling effort. The horizontal gap between the curves at a given accuracy tells you exactly how many labels you saved. In a typical run, active learning reaches 85% accuracy with 40 labels while random sampling needs 120+ to get there.
Writing a Custom Query Strategy
Sometimes the built-in strategies aren’t enough. Maybe you want to combine uncertainty with diversity, or weight samples by their domain. modAL makes custom strategies straightforward.
| |
The 70/30 split between uncertainty and diversity is a good starting point. Pure uncertainty sampling can get stuck querying clusters of nearly identical hard samples. Adding a diversity term pushes the learner to explore underrepresented regions of the feature space.
Common Errors
ValueError: query_idx is out of range for the pool. You’re querying more samples than remain in the pool. Check len(X_pool) before calling learner.query() and cap n_instances accordingly.
NotFittedError: This RandomForestClassifier instance is not fitted yet. The estimator needs to be fitted before querying. Make sure you pass X_training and y_training to ActiveLearner(), or call learner.fit(X, y) before the first learner.query().
Active learning performs worse than random sampling. This happens more than people admit. Common causes: your seed set doesn’t cover all classes (fix by stratified sampling), your model is too simple to give meaningful uncertainty estimates (try a deeper model), or your feature space is so noisy that uncertainty doesn’t correlate with informativeness.
ModuleNotFoundError: No module named 'modAL'. The package name on PyPI is modAL-python, not modAL. Run pip install modAL-python.
Memory errors with large pools. Calling predict_proba on 500K samples at once will blow up RAM. Process the pool in chunks of 10K-50K and concatenate the results, or use np.memmap for the pool array.
When Active Learning Is Not Worth It
Active learning adds complexity. If labeling is cheap (automated rules, crowdsourcing at $0.01/label), the engineering overhead of an active learning loop probably isn’t worth the savings. It pays off when labels are expensive: medical imaging where a radiologist charges $50/annotation, legal document review, or any domain where expert time is the bottleneck.
Also skip it if your dataset is tiny (under 500 samples total). Active learning needs enough unlabeled data to meaningfully select from. With a small pool, random sampling and active learning converge to the same result.
Related Guides
- How to Generate Synthetic Training Data with Hugging Face’s Synthetic Data Generator Without Triggering Model Collapse
- How to Build a Data Sampling Pipeline for Large-Scale ML Training
- How to Anonymize Training Data for ML Privacy
- How to Build a Data Reconciliation Pipeline for ML Training Sets
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Label Training Data with LLM-Assisted Annotation
- How to Build a Data Annotation Pipeline with Argilla
- How to Augment Training Data with Albumentations and NLP Augmenter
- How to Create Synthetic Training Data with LLMs
- How to Handle Imbalanced Datasets for ML Training