Natural Language Processing

Building a Custom Text Classifier Without a Single Line of Deep Learning Code

khaled December 2, 2024 4 mins read
Building a Custom Text Classifier Without a Single Line of Deep Learning Code

Building a Custom Text Classifier Without a Single Line of Deep Learning Code

The standard narrative around NLP in 2024 involves fine-tuning BERT, managing GPU clusters, and wrangling PyTorch DataLoaders. This is appropriate when you have large datasets and strict performance requirements. But for a surprisingly large range of real-world text classification problems, a simpler approach produces 90% of the performance with 10% of the complexity: use pretrained embeddings as features and train a lightweight classical classifier on top.

This article walks through a complete, production-ready approach that requires only Python, scikit-learn, and an API call or a lightweight embedding library.

When This Approach Works

This approach is appropriate when:

  • You have 50 to 50,000 labeled examples (for very large datasets, full fine-tuning begins to win decisively)
  • You need fast iteration — training in seconds, not hours
  • You are CPU-only or have minimal compute available
  • You want interpretable features and a simple deployment story
  • Your classes are reasonably distinct in semantic space

It works poorly for tasks requiring deep contextual understanding (long-document classification, nuanced sentiment) or for domains with very specialized vocabulary not covered by your embedding model.

Step 1: Choose an Embedding Model

The quality of your classifier depends almost entirely on the quality of your embeddings. Good choices:

  • OpenAI text-embedding-3-small (via API): excellent quality, easy to call, costs ~$0.02 per million tokens
  • sentence-transformers/all-MiniLM-L6-v2: fast, open-source, runs on CPU in milliseconds per sentence, good general quality
  • sentence-transformers/all-mpnet-base-v2: higher quality, slower, still CPU-feasible
  • FastText or GloVe word vectors (average pooled): much weaker baseline, only use if other options are unavailable

For most teams, all-MiniLM-L6-v2 from the sentence-transformers library is the right default. Install with pip install sentence-transformers.

Step 2: Embed Your Training Data

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

texts  = [row['text'] for row in your_dataset]
labels = [row['label'] for row in your_dataset]

embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
# embeddings.shape: (n_samples, 384)

This produces a 384-dimensional vector for each text. These vectors are your features.

Step 3: Train a Classifier

Logistic Regression is a strong default. Linear SVMs are also excellent for this use case.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    embeddings, labels, test_size=0.2, random_state=42, stratify=labels
)

clf = LogisticRegression(max_iter=1000, C=1.0)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test)))

Training takes seconds on a CPU for up to ~100,000 samples.

Step 4: Handle Class Imbalance

Real-world classification data is almost never balanced. If your positive class is 5% of the data, a model that always predicts negative achieves 95% accuracy — but is useless.

Solutions:

  • Set class_weight='balanced' in LogisticRegression to automatically reweight
  • Oversample the minority class with SMOTE (applied to the embedding vectors, not raw text)
  • Collect more labeled examples for underrepresented classes

Step 5: Evaluate Correctly

Accuracy is the wrong metric for imbalanced problems. Use:

  • Precision and Recall per class — what fraction of predicted positives are true positives? What fraction of actual positives did you find?
  • F1-score — harmonic mean of precision and recall
  • Confusion matrix — visualize systematic errors

Pay attention to which classes the model confuses. Confusion between similar classes (e.g., "billing question" vs "refund request") is expected; confusion between obviously different classes indicates an embedding or data quality problem.

Step 6: Iterate on Your Data First, Not Your Model

The single most impactful improvement is usually more labeled data for underperforming classes, not a better model. Before reaching for BERT fine-tuning, ask: does my test set have at least 100 examples per class? Are my label guidelines clear enough that two annotators would agree 90% of the time?

Fixing data quality and quantity almost always beats model switching.

Deployment

Export the embedding model and classifier:

import pickle
pickle.dump(clf, open('classifier.pkl', 'wb'))
# The SentenceTransformer model is loaded at inference time

At inference: embed incoming text with the same model, run clf.predict_proba(), return the class probabilities. The entire pipeline runs in under 50ms on a standard CPU, making it compatible with synchronous API serving.

Conclusion

Deep learning is not always the right tool. For text classification with moderate data sizes, the combination of pretrained sentence embeddings and a logistic regression classifier is fast, interpretable, maintainable, and often competitive with fine-tuned transformers. Reach for this approach first — and only graduate to full fine-tuning when you have a clear, measured need.

Keywords: text classification, NLP, sentence embeddings, sentence-transformers, scikit-learn, logistic regression, text classifier, no GPU NLP, machine learning classification