Llm Tokenizer

January 7, 2024

This is is something that is obvious to those who work with Large Language Models, but it was not immediately evident to me when someone asked me about it.

Why do the tokenizer require training? Indeed coming from computer languages Tokenizers this is a bit surprising. For them the tokenizer is fixed and does not change the more data about a language is available. Tokenizers are often a bit ugly, the interface to the language grammar not totally clean, but are simple and fixed.

In LLM try to map token to an internal representation, and from it “predict” the next token. But what are token in this case? Thinking about what an LLM does it should be clear that an ideal tokenizer does not just separates out words, a few structural elements and keywords (as for computer languages), but should try to extract the common sequences, things like the root of a word common suffixes, and in general grammar and structure of the the language. An LLM will have a much simpler and efficient representation of a sentence if it bases it on roots of the word, and their modifiers (prefixes and suffixes), instead of having to learn every variation as a completely different word. This should be especially clear to a german speaking person with all its composed words, but also other languages have words and verbs, words with a common roots, and common suffixes.

The similarities in the sounds of words within the big language families makes one think that trying to take advantage of it is probably useful. A surprisingly simple and good way to do this is to try to find a dictionary that can compress the language well. Common suffixes and prefixes will be represented with a single token, and the same goes with repeated roots. This is not perfect, but is simple and language agnositc, but to work it also has to be trained on the language.

The idea of compression is not so surprising given that prediction and compression are related activities (see for example this post)