GroveAI
Glossary

Tokenisation

Tokenisation is the process of breaking text into smaller units called tokens (words, sub-words, or characters) that AI models can process, forming the fundamental input and output unit for language models.

What is Tokenisation?

Tokenisation is the first step in how a language model processes text. Before a model can work with text, it must convert it into a sequence of numerical tokens — discrete units that the model's architecture can manipulate. A token might represent a whole word ("hello"), a sub-word ("un" + "likely"), a single character, or even a punctuation mark. The specific mapping between text and tokens is defined by the model's tokeniser, which is trained alongside the model. Different models use different tokenisers, which means the same text can be split into different tokens depending on the model being used. Understanding tokenisation is important because tokens directly determine context limits, costs, and processing speed.

How Tokenisation Works

Most modern language models use subword tokenisation algorithms like Byte-Pair Encoding (BPE) or SentencePiece. These algorithms learn to split text into a vocabulary of common subword units. Frequently used words remain as single tokens ("the", "and"), while rare or complex words are split into multiple tokens ("tokenisation" might become "token" + "isation"). A typical tokeniser vocabulary contains 30,000 to 100,000 tokens. The vocabulary size represents a trade-off: larger vocabularies mean fewer tokens per text (more efficient) but require more model parameters. Smaller vocabularies are more compact but produce longer token sequences. As a rough guide, one token corresponds to approximately 3-4 characters of English text, or about 75% of a word. A 1,000-word document is typically around 1,300-1,500 tokens, though this varies significantly by language and content type. Code, technical text, and non-English languages often require more tokens per word.

Why Tokenisation Matters for Business

Tokenisation directly impacts three critical aspects of AI deployment: cost, context limits, and performance. Most AI API providers charge per token, so understanding how your content tokenises is essential for cost forecasting. A document that tokenises into twice as many tokens costs twice as much to process. Context windows are measured in tokens, not words. A model with a 128,000-token context window can process roughly 96,000 words — but this varies based on the content. Understanding this relationship is important for designing applications that stay within context limits. Tokenisation quality also affects model performance. Models perform better on text that tokenises efficiently (common words, standard English) and may struggle with heavily tokenised content (unusual terminology, non-Latin scripts, code in uncommon languages). This has practical implications for multilingual applications and specialised domains.

Practical Implications

When building AI applications, tokenisation considerations include estimating API costs based on average token counts per request, designing prompts that are token-efficient without sacrificing clarity, chunking documents appropriately for RAG systems (splitting at meaningful token boundaries), and testing how domain-specific terminology tokenises in your chosen model. Tools like OpenAI's tiktoken library allow developers to count tokens before making API calls, enabling precise cost estimation and context management. Most AI development frameworks provide similar tokenisation utilities for their respective models.

FAQ

Frequently asked questions

In English, a word averages about 1.3 tokens. Common short words like "the" or "and" are single tokens, while longer or uncommon words may be split into 2-4 tokens. Numbers, code, and non-English text typically have higher token-to-word ratios.

Each model uses its own tokeniser with a different vocabulary. The same text might be split differently by GPT-4's tokeniser versus Claude's or Llama's. This means token counts and costs are not directly comparable across providers without checking their specific tokenisation.

Yes. Text that tokenises into very small fragments (individual characters or byte-level tokens) can be harder for models to process effectively. This is one reason models sometimes struggle with unusual formatting, rare languages, or heavily encoded text.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.