The Core Idea in 30 Seconds

Active learning lets your model pick which samples to label next, instead of labeling everything randomly. You train on a tiny labeled set, ask the model which unlabeled points it’s most confused about, label those, retrain, and repeat. This cuts labeling costs by 50-80% in practice.

1
pip install modAL-python scikit-learn numpy matplotlib

The modAL library wraps scikit-learn estimators with active learning query strategies out of the box. It’s the fastest path from zero to a working active learning loop in Python.

Pool-Based Active Learning Loop

The most common setup is pool-based active learning. You start with a large pool of unlabeled data, a small labeled seed set, and a model. Each iteration, the model scores every unlabeled sample, picks the most informative ones, and you (or an oracle) provide labels for just those.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling

# Generate a dataset — pretend only a few samples are labeled
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=10,
    n_classes=3,
    random_state=42,
)

X_test, X_pool, y_test, y_pool = train_test_split(
    X, y, test_size=0.9, random_state=42,
)

# Start with just 20 labeled samples as the seed
n_initial = 20
initial_idx = np.random.choice(range(len(X_pool)), size=n_initial, replace=False)
X_initial = X_pool[initial_idx]
y_initial = y_pool[initial_idx]

# Remove initial samples from the pool
X_pool = np.delete(X_pool, initial_idx, axis=0)
y_pool = np.delete(y_pool, initial_idx, axis=0)

# Create the active learner with uncertainty sampling
learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    query_strategy=uncertainty_sampling,
    X_training=X_initial,
    y_training=y_initial,
)

print(f"Initial accuracy: {learner.score(X_test, y_test):.3f}")

# Active learning loop — query 10 samples per iteration
n_queries = 50
accuracies = [learner.score(X_test, y_test)]

for i in range(n_queries):
    # Model picks the most uncertain sample from the pool
    query_idx, query_inst = learner.query(X_pool)

    # Simulate labeling (in reality, a human labels this)
    learner.teach(X_pool[query_idx], y_pool[query_idx])

    # Remove queried sample from pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)

    accuracies.append(learner.score(X_test, y_test))

print(f"Final accuracy after {n_queries} queries: {accuracies[-1]:.3f}")

After 50 queries you’ll typically see accuracy jump from around 0.55-0.65 to 0.80+, depending on the dataset. That’s 70 total labeled samples doing the work of hundreds labeled randomly.

Query Strategies: Picking the Right One

The query strategy decides which samples the model asks about. This is where active learning either shines or falls flat. modAL ships with three uncertainty-based strategies, and they’re not interchangeable.

Least Confidence

Picks the sample where the model’s most confident class still has the lowest probability. Good default for multiclass problems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from modAL.uncertainty import uncertainty_sampling

# This is least confidence under the hood
# Selects sample where max(P(class)) is lowest
learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100),
    query_strategy=uncertainty_sampling,
    X_training=X_initial,
    y_training=y_initial,
)

Margin Sampling

Looks at the gap between the top two predicted class probabilities. A small margin means the model can’t decide between two classes. This is my preferred strategy for most classification tasks because it targets the actual decision boundary rather than general confusion.

1
2
3
4
5
6
7
8
from modAL.uncertainty import margin_sampling

learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100),
    query_strategy=margin_sampling,
    X_training=X_initial,
    y_training=y_initial,
)

Entropy Sampling

Computes the Shannon entropy of the predicted class distribution. High entropy means the model is spread across many classes. Use this when you have many classes (10+) and want to capture overall confusion rather than pairwise ambiguity.

1
2
3
4
5
6
7
8
from modAL.uncertainty import entropy_sampling

learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100),
    query_strategy=entropy_sampling,
    X_training=X_initial,
    y_training=y_initial,
)

My recommendation: start with margin sampling. It consistently outperforms least confidence across datasets I’ve worked with because it focuses on the hardest decision boundary. Entropy is better for high-cardinality problems but can waste budget on samples that are confusing across irrelevant classes.

Batch Active Learning

Querying one sample at a time is clean but slow. In practice, you want to query batches because sending one sample at a time to human annotators is impractical. The trick is avoiding redundancy within the batch.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from modAL.batch import uncertainty_batch_sampling

learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100),
    query_strategy=uncertainty_batch_sampling,
    X_training=X_initial,
    y_training=y_initial,
)

# Query 5 samples at once
query_idx, query_inst = learner.query(X_pool, n_instances=5)

# Label and teach the entire batch
learner.teach(X_pool[query_idx], y_pool[query_idx])
X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)

Batch sizes of 5-20 work well for most setups. Go larger than 50 and you start losing the benefit of active learning because you’re approaching random sampling territory.

Evaluating Labeling Efficiency

The whole point of active learning is to label less. You need to compare your active learner against random sampling to prove it’s actually helping.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Run active learning (reuse accuracies from the loop above)
# Also run random sampling for comparison
X_rand_pool = X_pool.copy()
y_rand_pool = y_pool.copy()

random_clf = RandomForestClassifier(n_estimators=100, random_state=42)
random_clf.fit(X_initial, y_initial)
random_accuracies = [random_clf.score(X_test, y_test)]

X_rand_train = X_initial.copy()
y_rand_train = y_initial.copy()

for i in range(n_queries):
    # Random sampling — pick a sample at random
    rand_idx = np.random.randint(0, len(X_rand_pool))
    X_rand_train = np.vstack([X_rand_train, X_rand_pool[rand_idx].reshape(1, -1)])
    y_rand_train = np.append(y_rand_train, y_rand_pool[rand_idx])

    X_rand_pool = np.delete(X_rand_pool, rand_idx, axis=0)
    y_rand_pool = np.delete(y_rand_pool, rand_idx, axis=0)

    random_clf.fit(X_rand_train, y_rand_train)
    random_accuracies.append(random_clf.score(X_test, y_test))

# Plot the learning curves
plt.figure(figsize=(10, 6))
plt.plot(range(len(accuracies)), accuracies, label="Active Learning (margin)")
plt.plot(range(len(random_accuracies)), random_accuracies, label="Random Sampling")
plt.xlabel("Number of Labeled Samples Added")
plt.ylabel("Test Accuracy")
plt.title("Active Learning vs Random Sampling")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("active_learning_curve.png", dpi=150, bbox_inches="tight")
plt.show()

If the active learning curve sits above the random curve, you’re saving labeling effort. The horizontal gap between the curves at a given accuracy tells you exactly how many labels you saved. In a typical run, active learning reaches 85% accuracy with 40 labels while random sampling needs 120+ to get there.

Writing a Custom Query Strategy

Sometimes the built-in strategies aren’t enough. Maybe you want to combine uncertainty with diversity, or weight samples by their domain. modAL makes custom strategies straightforward.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from modAL.utils.data import modALinput

def combined_uncertainty_diversity(classifier, X, n_instances=1):
    """Pick uncertain samples that are also far from existing training data."""
    from sklearn.metrics.pairwise import euclidean_distances

    # Get uncertainty scores
    proba = classifier.predict_proba(X)
    uncertainty = 1 - np.max(proba, axis=1)

    # Get diversity scores (distance to nearest training sample)
    distances = euclidean_distances(X, classifier.X_training)
    min_distances = np.min(distances, axis=1)

    # Normalize both to [0, 1]
    uncertainty = (uncertainty - uncertainty.min()) / (uncertainty.max() - uncertainty.min() + 1e-10)
    min_distances = (min_distances - min_distances.min()) / (min_distances.max() - min_distances.min() + 1e-10)

    # Combined score: 70% uncertainty, 30% diversity
    scores = 0.7 * uncertainty + 0.3 * min_distances

    top_idx = np.argsort(scores)[-n_instances:]
    return top_idx, X[top_idx]

learner = ActiveLearner(
    estimator=RandomForestClassifier(n_estimators=100),
    query_strategy=combined_uncertainty_diversity,
    X_training=X_initial,
    y_training=y_initial,
)

The 70/30 split between uncertainty and diversity is a good starting point. Pure uncertainty sampling can get stuck querying clusters of nearly identical hard samples. Adding a diversity term pushes the learner to explore underrepresented regions of the feature space.

Common Errors

ValueError: query_idx is out of range for the pool. You’re querying more samples than remain in the pool. Check len(X_pool) before calling learner.query() and cap n_instances accordingly.

NotFittedError: This RandomForestClassifier instance is not fitted yet. The estimator needs to be fitted before querying. Make sure you pass X_training and y_training to ActiveLearner(), or call learner.fit(X, y) before the first learner.query().

Active learning performs worse than random sampling. This happens more than people admit. Common causes: your seed set doesn’t cover all classes (fix by stratified sampling), your model is too simple to give meaningful uncertainty estimates (try a deeper model), or your feature space is so noisy that uncertainty doesn’t correlate with informativeness.

ModuleNotFoundError: No module named 'modAL'. The package name on PyPI is modAL-python, not modAL. Run pip install modAL-python.

Memory errors with large pools. Calling predict_proba on 500K samples at once will blow up RAM. Process the pool in chunks of 10K-50K and concatenate the results, or use np.memmap for the pool array.

When Active Learning Is Not Worth It

Active learning adds complexity. If labeling is cheap (automated rules, crowdsourcing at $0.01/label), the engineering overhead of an active learning loop probably isn’t worth the savings. It pays off when labels are expensive: medical imaging where a radiologist charges $50/annotation, legal document review, or any domain where expert time is the bottleneck.

Also skip it if your dataset is tiny (under 500 samples total). Active learning needs enough unlabeled data to meaningfully select from. With a small pool, random sampling and active learning converge to the same result.