Semi-Supervised Learning: Making the Most of Unlabeled Data
Semi-Supervised Learning: Making the Most of Unlabeled Data
The most common bottleneck in supervised ML is not model architecture or compute — it is labeled data. Annotation is expensive, slow, and requires domain expertise. Yet in most domains, vast quantities of unlabeled data are freely available: product reviews without sentiment labels, medical images without diagnoses, code repositories without bug classifications. Semi-supervised learning exploits this unlabeled abundance to improve models trained on a small labeled set.
The Core Assumption
Semi-supervised learning rests on the smoothness assumption: data points that are close in the feature space should have similar labels. If unlabeled data reveals the geometric structure of the feature space, that structure should inform the decision boundary even for unseen labels.
A related assumption: the low-density separation assumption — decision boundaries should lie in low-density regions of the feature space, not cutting through clusters of points. Unlabeled data reveals where the density is low; the classifier should prefer those regions for its boundaries.
Self-Training (Pseudo-Labeling)
The simplest semi-supervised approach:
- Train a classifier on your labeled dataset
- Run inference on unlabeled data
- Add high-confidence predictions as "pseudo-labels" to the training set
- Retrain on labeled + pseudo-labeled data
- Repeat
The key is the confidence threshold: only add pseudo-labels where the model's predicted probability exceeds a threshold (typically 0.9-0.95). High-confidence predictions are more likely to be correct, so they enrich the training set with reliable examples the model was already fairly certain about.
Self-training is simple, requires no changes to the model architecture, and works with any classifier. It is particularly effective when the labeled set is small but representative of the class distribution.
FixMatch and Consistency Regularisation
Consistency regularisation exploits the idea that a good classifier should produce consistent predictions for augmented versions of the same unlabeled example. FixMatch (Sohn et al., 2020) applies this principle aggressively:
- Apply a weak augmentation to an unlabeled image; compute the model's prediction
- If the weak-augmentation prediction confidence exceeds a threshold, use it as a pseudo-label
- Apply a strong augmentation to the same image; train the model to predict the pseudo-label on the strongly augmented version
This forces the model to produce consistent predictions across very different views of the same image, leveraging unlabeled data to improve robustness and generalisation. FixMatch achieves remarkable accuracy on image classification with as few as 40 labeled examples per class.
Graph-Based Methods
Graph-based methods construct a similarity graph over all data points (labeled and unlabeled) and propagate labels through the graph based on edge weights. The Label Propagation algorithm iteratively spreads label information from labeled nodes to neighboring unlabeled nodes.
Graph methods are most effective when the feature space has meaningful cluster structure and similarity computation is reliable. They struggle with high-dimensional raw data where nearest-neighbour relationships are noisy.
Self-Supervised Pretraining
A different but related approach: train a model on a self-supervised task (predicting masked tokens in text, predicting the rotation of an image) on large unlabeled datasets, then fine-tune on the small labeled set. This is the foundation of BERT, GPT, and SimCLR.
The distinction from other semi-supervised methods: self-supervised pretraining does not use unlabeled data to improve a supervised classifier — it uses unlabeled data to learn representations that transfer to supervised fine-tuning. In practice, this is often the most effective approach when a large unlabeled corpus and a suitable self-supervised objective are available.
When Semi-Supervised Learning Helps Most
- Labeled data is scarce (< 1000 examples) and unlabeled data is plentiful
- Labeled and unlabeled data are from the same distribution — if the unlabeled data comes from a different domain, semi-supervised methods can hurt performance
- The task has structure in the unlabeled data — image classification, text classification, and audio classification all have strong cluster structure that semi-supervised methods can exploit
Conclusion
Semi-supervised learning is not a substitute for good labeled data, but it is a powerful tool for making the most of what you have. Self-training is the practical first choice for most teams: simple to implement, model-agnostic, and effective when labeled data is the primary bottleneck. For image classification specifically, FixMatch-style consistency regularisation represents the state of the art with very limited labels.
Keywords: semi-supervised learning, pseudo-labeling, self-training, FixMatch, label propagation, consistency regularisation, machine learning unlabeled data, data labeling