Tokenisation
Tokenisation is the process of breaking text into smaller units called tokens (words, sub-words, or characters) that AI models can process, forming the fundamental input and output unit for language models.
What is Tokenisation?
How Tokenisation Works
Why Tokenisation Matters for Business
Practical Implications
Related Terms
Explore further
FAQ
Frequently asked questions
In English, a word averages about 1.3 tokens. Common short words like "the" or "and" are single tokens, while longer or uncommon words may be split into 2-4 tokens. Numbers, code, and non-English text typically have higher token-to-word ratios.
Each model uses its own tokeniser with a different vocabulary. The same text might be split differently by GPT-4's tokeniser versus Claude's or Llama's. This means token counts and costs are not directly comparable across providers without checking their specific tokenisation.
Yes. Text that tokenises into very small fragments (individual characters or byte-level tokens) can be harder for models to process effectively. This is one reason models sometimes struggle with unusual formatting, rare languages, or heavily encoded text.
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.