Tokenization Tactics: Why Your NLP Pipeline Starts With This Critical Step
Tokenization Tactics: Why Your NLP Pipeline Starts With This Critical Step
Every NLP pipeline begins with the same invisible step: tokenization — splitting raw text into discrete units a model can process. It sounds trivial. It is not. The tokenizer you choose determines your vocabulary size, out-of-vocabulary behavior, multilingual capability, and even how well your model handles code, numbers, and punctuation. A poorly designed tokenizer silently degrades model performance in ways that are difficult to diagnose downstream.
What Is a Token?
A token is the atomic unit of text that a language model sees. Tokens are not necessarily words. In GPT-4's tokenizer, "unbelievable" might be a single token, while "ChatGPT" might be split into "Chat" and "GPT". The rule that governs how text is segmented is the tokenization algorithm, and different algorithms make very different choices.
The Three Major Approaches
1. Word-Level Tokenization
The simplest approach: split on whitespace and punctuation. This produces an intuitive vocabulary but has a critical flaw — out-of-vocabulary (OOV) words. Any word not seen during training gets mapped to a generic <UNK> token, losing all information. A model trained on English news articles will silently fail on medical jargon, code snippets, or tweets.
2. Character-Level Tokenization
The other extreme: each character is a token. This eliminates OOV entirely but produces extremely long sequences — a sentence becomes 200+ tokens — making it computationally prohibitive for transformer models with quadratic attention complexity. Rare characters in non-Latin scripts also remain problematic.
3. Subword Tokenization (the industry standard)
Byte-Pair Encoding (BPE), WordPiece, and SentencePiece all fall into this category. They segment text into variable-length subword units, finding a middle ground between word and character level.
- BPE (used by GPT models): starts with individual characters, then iteratively merges the most frequent adjacent pair until a vocabulary size limit is reached.
- WordPiece (used by BERT): similar to BPE, but merges based on the pair that maximizes likelihood of the training corpus rather than raw frequency.
- SentencePiece (used by LLaMA, T5): language-agnostic; treats the input as a raw byte sequence, which handles spaces as first-class tokens and works across scripts natively.
Why Tokenization Choices Matter in Practice
Vocabulary Efficiency
A 50,000-token vocabulary sounds generous until you are working in a morphologically rich language like Turkish or Finnish, where words can have 30+ inflected forms. BPE handles this by learning common morpheme-like subwords, but the efficiency gap with English is still significant. Models trained primarily on English corpora tend to over-tokenize non-English text, using more tokens per semantic unit than English requires, effectively reducing the functional context window for those languages.
Numbers and Special Domains
Numbers are notoriously poorly handled by standard tokenizers. "10,000" might tokenize as five separate tokens. This is a known pain point for arithmetic, financial modeling, and scientific text. Some research tokenizers introduce number-specific schemes, but most production tokenizers still struggle here.
Tokenizer Mismatch at Fine-Tuning
A subtle but consequential bug: if you fine-tune a model with a different tokenizer than was used at pretraining, or if you apply a pretrained tokenizer to a domain (legal, medical, code) with very different character distributions, you can silently degrade performance. Always verify that the tokenizer accompanying a checkpoint matches the one used during its original training.
Practical Tips for NLP Engineers
- Inspect token counts before training: pass a representative sample of your data through the tokenizer and check average tokens per sentence. Outliers indicate tokenization surprises.
- Avoid truncating blindly: if your model's context window is 512 tokens, truncating long legal documents at token 512 may cut off the most relevant clause. Use sliding windows or hierarchical chunking instead.
- Test multilingual pipelines with non-ASCII characters, right-to-left scripts, and mixed-script strings before deploying.
- Cache tokenized datasets: tokenization is CPU-intensive. Pre-tokenize and cache to disk to avoid repeating the work during every training epoch.
Conclusion
Tokenization is not glamorous, but it sets the ceiling for everything that follows. Understanding why your tokenizer splits the way it does — and where it fails — is a fundamental skill for any NLP practitioner. The best model in the world cannot recover information that was lost before it ever saw the input.
Keywords: tokenization, NLP pipeline, byte-pair encoding, BPE, WordPiece, SentencePiece, subword tokenization, language model tokenizer, out-of-vocabulary