In real-world terms, the context length of a language model is measured not in words, but in tokens. To understand how context windows work in practice, it’s important to understand how these tokens work.
The way LLMs process language is fundamentally different from the way humans do. Whereas the smallest unit of information we use to represent language is a single character—such as a letter, number or punctuation mark—the smallest unit of language that AI models use is a token. To train a model to understand language, each token is assigned an ID number; these ID numbers, rather than the words or even the tokens themselves, are used to train the model. This tokenization of language significantly reduces the computational power needed to process and learn from the text.
There is a wide variance in the amount of text that one token can represent: a token can stand in for a single character, a part of a word (such as a suffix or prefix), a whole word or even a short multiword phrase. Consider the different roles played by the letter “a ” in the following examples:
“Jeff drove a car.”
Here, ““ is an entire word. In this situation, it would be represented by a distinct token.
“Jeff is amoral.”
Here, ““ is not a word, but its addition to significantly changes the meaning of the word. would therefore be represented by two distinct tokens: a token for and another for .
"Jeff loves his cat."
Here, is simply a letter in the word “.” It carries no semantic meaning unto itself and would, therefore, not need to be represented by a distinct token.
There is no fixed word-to-token “exchange rate,” and different models or tokenizers—a modular subset of a larger model responsible for tokenization—might tokenize the same passage of writing differently. Efficient tokenization can help increase the actual amount of text that fits within the confines of a context window. But for general purposes, a decent estimate would be roughly 1.5 tokens per word. The Tokenizer Playgroundon Hugging Face is an easy way to see and experiment with how different models tokenize text inputs.
Variations in linguistic structure and representation in training data can result in some languages being more efficiently tokenized than others. For example, an October 2024 study explored an example of the same sentence being tokenized in both English and Telugu. Despite the Telugu translation having significantly fewer characters than its English equivalent, it resulted in over 7 times the number of tokens in context.