The Magic of Tokenization: from words to numbers:

When you talk to an LLM, you type in words. But computers don’t understand words the way humans do. They only work with numbers. So, how does your sentence, “Good morning, how are you?” turn into something a machine can process? The answer is tokenization.

Tokenization is the process of breaking down text into smaller pieces called tokens. These tokens are then converted into numbers so that the model can work with them. A token can be as small as a single character, like “a” or “!”, or as large as a word or even part of a word. For example, the sentence “cats are running” might be split into the tokens [“cats”, “are”, “run”, “ning”]. Notice how the word “running” got split into two tokens—models often use this trick to handle all sorts of words, including rare or very long ones.

Once split into tokens, each token is mapped to a unique number using something like a dictionary. Imagine a giant lookup table where “cat” might correspond to 105, “run” to 876, and “morning” to 4521. Now, instead of working with words, the computer sees a sequence of numbers. In this way, “Good morning” could become [210, 4521]. Numbers are the language of machines, and this transformation allows the LLM to start doing its work.

Think of tokenization like translating human language into a secret code that a machine understands. A nice analogy is LEGO bricks. Each word you write is like a toy structure, and tokenization breaks it down into smaller LEGO pieces. Some words stay whole, like a single block, while others are split into multiple smaller bricks. Once in this form, the computer can reassemble and rearrange the bricks to generate something new.

Another helpful analogy is cooking. Imagine you have a recipe written in full sentences. Before you can cook, you first chop the vegetables, measure the spices, and prepare each ingredient. Tokenization is like this preparation step—it breaks sentences into neat, measurable pieces so the “chef” (the model) can mix them and create something meaningful.

How text is tokenized affects how well an LLM can understand and generate language. If you tokenize poorly, the model struggles. If you tokenize efficiently, the model can capture meaning and relationships between words more accurately. For example, the tokens [“New”, “York”] tell the model this is likely a place, while treating it as just [“New”] and [“York”] separately could confuse things.

Before an LLM can generate fluent sentences or answer your questions, it first needs to turn your words into numbers through tokenization. It’s the very first step in the pipeline that makes everything else possible. Without tokenization, an LLM would be staring at raw letters without any way to process them.

The Magic of Tokenization: from words to numbers:

Leave a Reply Cancel reply

Archives