Building a Custom Text Classifier Without a Single Line of Deep Learning Code
Building a Custom Text Classifier Without a Single Line of Deep Learning Code
The standard narrative around NLP in 2024 involves fine-tuning BERT, managing GPU clusters, and wrangling PyTorch DataLoaders. This is appropriate when you have large datasets and strict performance requirements. But for a surprisingly large range of real-world text classification problems, a simpler approach produces 90% of the performance with 10% of the complexity: use pretrained embeddings as features and train a lightweight classical classifier on top.
This article walks through a complete, production-ready approach that requires only Python, scikit-learn, and an API call or a lightweight embedding library.
When This Approach Works
This approach is appropriate when:
- You have 50 to 50,000 labeled examples (for very large datasets, full fine-tuning begins to win decisively)
- You need fast iteration — training in seconds, not hours
- You are CPU-only or have minimal compute available
- You want interpretable features and a simple deployment story
- Your classes are reasonably distinct in semantic space
It works poorly for tasks requiring deep contextual understanding (long-document classification, nuanced sentiment) or for domains with very specialized vocabulary not covered by your embedding model.
Step 1: Choose an Embedding Model
The quality of your classifier depends almost entirely on the quality of your embeddings. Good choices:
- OpenAI text-embedding-3-small (via API): excellent quality, easy to call, costs ~$0.02 per million tokens
- sentence-transformers/all-MiniLM-L6-v2: fast, open-source, runs on CPU in milliseconds per sentence, good general quality
- sentence-transformers/all-mpnet-base-v2: higher quality, slower, still CPU-feasible
- FastText or GloVe word vectors (average pooled): much weaker baseline, only use if other options are unavailable
For most teams, all-MiniLM-L6-v2 from the sentence-transformers library is the right default. Install with pip install sentence-transformers.
Step 2: Embed Your Training Data
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [row['text'] for row in your_dataset]
labels = [row['label'] for row in your_dataset]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
# embeddings.shape: (n_samples, 384)
This produces a 384-dimensional vector for each text. These vectors are your features.
Step 3: Train a Classifier
Logistic Regression is a strong default. Linear SVMs are also excellent for this use case.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
embeddings, labels, test_size=0.2, random_state=42, stratify=labels
)
clf = LogisticRegression(max_iter=1000, C=1.0)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))
Training takes seconds on a CPU for up to ~100,000 samples.
Step 4: Handle Class Imbalance
Real-world classification data is almost never balanced. If your positive class is 5% of the data, a model that always predicts negative achieves 95% accuracy — but is useless.
Solutions:
- Set
class_weight='balanced'inLogisticRegressionto automatically reweight - Oversample the minority class with SMOTE (applied to the embedding vectors, not raw text)
- Collect more labeled examples for underrepresented classes
Step 5: Evaluate Correctly
Accuracy is the wrong metric for imbalanced problems. Use:
- Precision and Recall per class — what fraction of predicted positives are true positives? What fraction of actual positives did you find?
- F1-score — harmonic mean of precision and recall
- Confusion matrix — visualize systematic errors
Pay attention to which classes the model confuses. Confusion between similar classes (e.g., "billing question" vs "refund request") is expected; confusion between obviously different classes indicates an embedding or data quality problem.
Step 6: Iterate on Your Data First, Not Your Model
The single most impactful improvement is usually more labeled data for underperforming classes, not a better model. Before reaching for BERT fine-tuning, ask: does my test set have at least 100 examples per class? Are my label guidelines clear enough that two annotators would agree 90% of the time?
Fixing data quality and quantity almost always beats model switching.
Deployment
Export the embedding model and classifier:
import pickle
pickle.dump(clf, open('classifier.pkl', 'wb'))
# The SentenceTransformer model is loaded at inference time
At inference: embed incoming text with the same model, run clf.predict_proba(), return the class probabilities. The entire pipeline runs in under 50ms on a standard CPU, making it compatible with synchronous API serving.
Conclusion
Deep learning is not always the right tool. For text classification with moderate data sizes, the combination of pretrained sentence embeddings and a logistic regression classifier is fast, interpretable, maintainable, and often competitive with fine-tuned transformers. Reach for this approach first — and only graduate to full fine-tuning when you have a clear, measured need.
Keywords: text classification, NLP, sentence embeddings, sentence-transformers, scikit-learn, logistic regression, text classifier, no GPU NLP, machine learning classification