Machine learning

Active Learning Pipelines: Reducing Labeling Costs Without Sacrificing Quality

khaled October 5, 2025 4 mins read
Active Learning Pipelines: Reducing Labeling Costs Without Sacrificing Quality

Active Learning Pipelines: Reducing Labeling Costs Without Sacrificing Quality

Labeling data is expensive. For a text classification task, professional annotation at $0.10-0.50 per example means that 10,000 labeled examples cost $1,000-5,000. For image annotation with bounding boxes, costs are higher; for specialised medical or legal annotation, they are higher still. Active learning is the strategy of being smart about which examples you ask annotators to label — selecting the examples that will teach the model the most, rather than sampling randomly.

The Core Principle

A randomly selected batch of 1,000 examples contains many easy, redundant examples the model already handles correctly. An actively selected batch of 1,000 examples contains the hard, informative examples near the decision boundary — the ones the model is most uncertain about. Training on the second batch produces larger performance gains per annotation dollar.

Research consistently shows that active learning achieves the same model performance as random sampling with 30-70% fewer labeled examples, depending on the task and query strategy.

Query Strategies

Uncertainty Sampling

Select the examples the current model is most uncertain about. For a binary classifier, uncertainty is maximum at predicted probability = 0.5. For multiclass, use entropy:

from scipy.stats import entropy

probabilities = model.predict_proba(unlabeled_pool)
uncertainty_scores = entropy(probabilities, axis=1)
top_uncertain_indices = uncertainty_scores.argsort()[-batch_size:]

Uncertainty sampling is the default strategy and works well for most tasks. Its weakness: it can select redundant examples that are all uncertain for the same reason (e.g., all from the same confusing domain).

Query by Committee (QBC)

Train multiple models (a "committee") on different subsets of the labeled data. Select examples where the committee members disagree most. Disagreement indicates genuine ambiguity rather than a single model's blind spot.

Diversity Sampling (Core-Set)

Select examples that are maximally diverse — spread across the unlabeled feature space to ensure coverage rather than focusing on the most uncertain region. The Core-Set approach selects the smallest set of points such that every unlabeled point is within a given distance of a selected point.

BADGE (Batch Active Learning by Diverse Gradient Embeddings)

A state-of-the-art approach that combines uncertainty (using gradient magnitudes as uncertainty proxies) with diversity (using k-means++ on gradient embeddings to ensure diverse selection). BADGE consistently outperforms pure uncertainty sampling by ~10% in labeling efficiency.

Building an Active Learning Pipeline

A production active learning loop:

  1. Initial labeled set: start with 200-500 randomly sampled, labeled examples. Random sampling for the initial pool prevents biases in the first model.
  2. Train initial model: fit on the labeled set.
  3. Score unlabeled pool: run inference on all unlabeled examples; compute query scores.
  4. Select batch: choose the top-K examples by query score (typically 200-500 per iteration).
  5. Send to annotators: integrate with your annotation platform (Label Studio, Scale AI, Prodigy).
  6. Receive labeled examples: add to the labeled set.
  7. Retrain: fine-tune or retrain on the full labeled set.
  8. Evaluate: run on the held-out evaluation set; decide whether to continue or stop.

The stopping criterion is typically: diminishing returns on evaluation performance per annotation batch, or reaching a target evaluation metric.

Practical Considerations

Annotation platform integration: the query-annotate loop should be automated. Build APIs or webhooks between your model inference pipeline and your annotation platform so that selected examples flow directly to annotators without manual CSV exports.

Evaluation set must be random: the evaluation set should be randomly sampled (not actively selected) to avoid evaluation bias. A model evaluated on actively selected examples appears better than it is.

Monitor for bias: active learning can skew the labeled distribution — if the model is uncertain about a specific demographic group, it will over-sample from that group. Track demographic and domain composition of the labeled set.

Cold start problem: the initial model trained on random examples may be poor, leading to poor initial query selection. Mitigate with a larger random initial pool, or seed with examples across all known class categories.

Conclusion

Active learning is one of the highest-ROI investments a data science team can make when labeling costs are a constraint. A 50% reduction in annotation costs for the same model quality is achievable for most NLP and computer vision tasks with standard uncertainty sampling. BADGE and QBC offer incremental improvements. The pipeline investment is modest; the payoff, particularly for tasks requiring expensive specialised annotation, is substantial.

Keywords: active learning, annotation efficiency, query strategy, uncertainty sampling, BADGE, machine learning labeling, data annotation, active learning pipeline