Tokenization Matters?
Published:
TL;DR
Tokenization matters!
Introduction
It’s become a fairly mainstream idea that many model errors can be blamed on tokenization. More modern examples can include the “9.11 is larger than 9.9” and “there are 2 r’s in strawberry”. I’m not so sure about this view. I think that for every scenario where a model failure can be attributed to bad tokenization, there’s hundreds of scenarios where they succeed despite bad tokenization.
But the influence of tokenization is very real, and possibly underappreciated. I noticed a common frustration across multiple works that is quite evident from their titles. Tokenization Matters:…, Tokenization counts:…, Tokenization Matters!…, and of course wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??. Clearly, many believe the effects of tokenization is being overlooked. I tend to agree!
As model training runs become more and more extravagant, we should be paying attention to how the tokenizer shapes the model. Keep in mind that once a model is trained with a particular tokenizer, it is not trivial to substitute tokenizers afterwards.
It’s also very challenging to evaluate tokenizers. There are some intrisic evaluations such as fertility and compression effectiveness, but they are not perfect indicators of performance. On the other hand, downstream performance evaluations will require fully training a corresponding model (which would be subject to their own hyperparameters), making it prohibitively costly.
My thoughts
Some quick writeups of my personal thoughts regarding tokenization.
The tokenizer has so many design choices, and many current implementations may be underengineered. Even great models have some odd tokenizer oversights. Many heuristic tricks are inherited from much older models (e.g., reusing GPT2’s regex), while new tricks are… confusing.
I am very optimistic about tokenizer transfer, as it will enable efficient specialist models from well-trained base models. At the moment, we know so little about what’s in a token representation. I think we’re getting closer though, like understanding the method in which tokens become words.
LLMs are still remarkably robust to suboptimal tokenization. I think this has allowed tokenization to go underappreciated, since the good ol’ “stack more layers/training data” seems to be good enough in preventing problems. However, slightly more adversarial scenarios of tokenization will fully mislead models. We should think about what does/doesn’t happen in training, and how it can shape behaviors such as bias or fragility.
Curently, tokenizers are mostly linguistics-agnostic. For instance, the BPE algorithm is purely optimized for data compression. Admittedly, this works very well, but surely it is not optimal? I am becoming more convinced that tokenization is more than compression, and feel morpheme-backed subwords are bound to be better. Point 3 makes this very tricky to verify though.
To make safer models, we will need safer tokenizers. Most tokenizers have a bunch of undertrained tokens that can cause model errors or aid in jailbreaks. I really appreciate this paper; it kickstarted a lot of systematic tokenizer analysis. But I am of the opinion that undertrained tokens are only one type of vulnerability rooted in the tokenizer. I discovered one such vulnerability in byte-level BPE tokenizers recently.
Multilingual (Korean) tokenization
I have been thinking a lot about multilingual tokenization, mostly kickstarted by some intutions afforded by seeing how bad Korean tokenization is. Modern Korean characters are represented by 3 bytes in UTF-8. It’s a shame that these 3 bytes have almost no relation to the character. This is especially upsetting since Korean characters are made of 3 syllables anyway, and it would most likely improve if we used the syllables (jamo) as tokenization blocks instead of bytes. I am very surprised that no major Korean company has put out a model with Jamo Korean tokenization, since this research direction has been established in literature.
Other languages with multi-byte characters suffer from similar problems, and I wonder how much is being overlooked because they are not immediately obvious to many in the English-centric research community. Having multiple bytes to represent a character puts a lot of burden on the tokenizer and the model, opening up to lots of potential fragility. Some languages, like Korean, may have characters with compositional structures that are lost after being processed by the grinder that is UTF-8. I think many languages could benefit from specialized tokenizers that utilize linguistic features of the language. Tokenizer transfer would be very important in this case.