Friday, April 25, 2025

How Do Tokenizers Work for Large Language Models (LLMs)

=========

Introduction

=========


Tokenization is the process of splitting text into smaller units called tokens. Tokens can be words, subwords, or characters. Tokenization is crucial for Large Language Models (LLMs) because it converts human-readable text into numerical representations that the model can process.


For example, the sentence:

"I love machine learning!"

can be tokenized into:

["I", "love", "machine", "learning", "!"]


However, modern tokenizers often use subword tokenization methods like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. These methods break down words into smaller units, allowing the model to handle rare or unknown words effectively.


===============================

Using a Hugging Face Tokenizer in Python

===============================


Hugging Face provides easy-to-use tokenizers for popular pretrained language models. Here is how you can use one in Python:


First, install the necessary library:

pip install transformers


Now, let's see how to use a tokenizer from Hugging Face with Python:


# Importing the tokenizer class from Hugging Face transformers

from transformers import AutoTokenizer


# Load a pretrained tokenizer (for example, GPT-2 tokenizer)

tokenizer = AutoTokenizer.from_pretrained("gpt2")


# Sample text to tokenize

text = "Hello, tokenizer! How are you doing today?"


# Tokenize the input text

tokens = tokenizer.tokenize(text)


# Convert tokens to token IDs (numerical representation)

token_ids = tokenizer.convert_tokens_to_ids(tokens)


# Display the results

print("Original text:", text)

print("Tokens:", tokens)

print("Token IDs:", token_ids)


When running the above code, you might see output similar to:

Original text: Hello, tokenizer! How are you doing today?

Tokens: ['Hello', ',', 'Ġtoken', 'izer', '!', 'ĠHow', 'Ġare', 'Ġyou', 'Ġdoing', 'Ġtoday', '?']

Token IDs: [15496, 11, 11241, 7509, 0, 2437, 389, 345, 1833, 1591, 30]


(Note: The character 'Ġ' represents a whitespace in GPT-2 tokenizer.)


============================

Building a Custom Tokenizer in Python

============================


Sometimes, you might need a simple custom tokenizer tailored to specific needs. Let's create a basic tokenizer from scratch in Python that splits text into words and punctuation marks separately.


Here is a simple custom tokenizer implementation:


# Importing regular expressions library

import re


# Custom tokenizer function

def custom_tokenizer(text):

    # Define a regular expression pattern to match words and punctuation

    pattern = r"\w+|[^\w\s]"

    

    # Find all matches of the pattern in the input text

    tokens = re.findall(pattern, text)

    

    return tokens


# Example usage

sample_text = "Hello, tokenizer! How are you today?"


# Tokenize the sample text using our custom tokenizer

custom_tokens = custom_tokenizer(sample_text)


# Display the results

print("Original text:", sample_text)

print("Custom tokens:", custom_tokens)


Output:

Original text: Hello, tokenizer! How are you today?

Custom tokens: ['Hello', ',', 'tokenizer', '!', 'How', 'are', 'you', 'today', '?']


Explanation of the custom tokenizer:

- The regular expression pattern "\w+|[^\w\s]" matches either:

  1. "\w+": one or more word characters (letters, digits, underscores)

  2. "[^\w\s]": any single character that is not a word character or whitespace (thus, punctuation marks)

- The re.findall() method returns all matching substrings as a list, effectively splitting the text into words and punctuation separately.


===============================================

Custom Tokenizer Implementation in Python (Real-World Usage)

===============================================


This tokenizer performs the following tasks:

- Splits text into tokens based on words and punctuation

- Builds a vocabulary from given text corpus

- Encodes text into numerical token IDs

- Decodes numerical token IDs back into text

- Saves and loads vocabulary for reuse


=================

Complete Python Code

=================


# Import required libraries

import re

import json

import os


class CustomTokenizer:

    def __init__(self, vocab=None):

        """

        Initialize the tokenizer with an optional existing vocabulary.

        If no vocabulary is provided, start with empty dictionaries.

        """

        if vocab:

            self.token_to_id = vocab

            self.id_to_token = {id_: tok for tok, id_ in vocab.items()}

        else:

            self.token_to_id = {}

            self.id_to_token = {}


    def tokenize(self, text):

        """

        Tokenize input text into words and punctuation marks.

        """

        # Regular expression pattern to match words and punctuation

        pattern = r"\w+|[^\w\s]"

        tokens = re.findall(pattern, text)

        return tokens


    def build_vocab(self, texts, min_freq=1):

        """

        Build vocabulary from a list of texts.

        Only include tokens that appear at least 'min_freq' times.

        """

        token_freq = {}


        # Count frequency of each token across all texts

        for text in texts:

            tokens = self.tokenize(text.lower())

            for token in tokens:

                token_freq[token] = token_freq.get(token, 0) + 1


        # Filter tokens by minimum frequency and sort by frequency

        filtered_tokens = sorted(

            [tok for tok, freq in token_freq.items() if freq >= min_freq],

            key=lambda x: -token_freq[x]

        )


        # Assign unique IDs to tokens

        self.token_to_id = {tok: idx for idx, tok in enumerate(filtered_tokens, start=1)}

        self.token_to_id["<UNK>"] = 0  # Unknown token


        # Create inverse mapping

        self.id_to_token = {id_: tok for tok, id_ in self.token_to_id.items()}


    def encode(self, text):

        """

        Convert input text into a list of token IDs.

        Unknown tokens are mapped to the <UNK> token ID (0).

        """

        tokens = self.tokenize(text.lower())

        token_ids = [self.token_to_id.get(tok, 0) for tok in tokens]

        return token_ids


    def decode(self, token_ids):

        """

        Convert a list of token IDs back into a readable string.

        """

        tokens = [self.id_to_token.get(id_, "<UNK>") for id_ in token_ids]

        

        # Reconstruct text with proper spacing

        text = ""

        for tok in tokens:

            if re.match(r"[^\w\s]", tok):  # punctuation

                text += tok

            elif len(text) == 0:

                text += tok

            else:

                text += " " + tok

        return text


    def save_vocab(self, filepath):

        """

        Save vocabulary to a JSON file.

        """

        with open(filepath, "w", encoding="utf-8") as f:

            json.dump(self.token_to_id, f, ensure_ascii=False, indent=4)


    def load_vocab(self, filepath):

        """

        Load vocabulary from a JSON file.

        """

        if not os.path.exists(filepath):

            raise FileNotFoundError(f"Vocabulary file {filepath} not found.")


        with open(filepath, "r", encoding="utf-8") as f:

            self.token_to_id = json.load(f)

            self.id_to_token = {id_: tok for tok, id_ in self.token_to_id.items()}


===========================

Example Usage of Custom Tokenizer

===========================


# Example corpus to build vocabulary

corpus = [

    "Hello, tokenizer! How are you today?",

    "I am testing a custom tokenizer.",

    "Tokenizers are crucial for NLP tasks.",

    "Hello again, tokenizer!"

]


# Create tokenizer instance

tokenizer = CustomTokenizer()


# Build vocabulary from corpus (minimum frequency = 1)

tokenizer.build_vocab(corpus, min_freq=1)


# Display vocabulary

print("Vocabulary:", tokenizer.token_to_id)


# Encode a sample sentence into token IDs

sample_text = "Hello, how is your tokenizer doing today?"

encoded_ids = tokenizer.encode(sample_text)

print("Encoded token IDs:", encoded_ids)


# Decode the token IDs back into readable text

decoded_text = tokenizer.decode(encoded_ids)

print("Decoded text:", decoded_text)


# Save vocabulary to disk

vocab_filepath = "vocab.json"

tokenizer.save_vocab(vocab_filepath)

print(f"Vocabulary saved to {vocab_filepath}")


# Load vocabulary from disk (demonstration)

new_tokenizer = CustomTokenizer()

new_tokenizer.load_vocab(vocab_filepath)

print("Loaded vocabulary:", new_tokenizer.token_to_id)


===========================

Example Output of the Above Usage

===========================


When running the example usage above, you might see output similar to this:


Vocabulary: {

    "tokenizer": 1,

    "hello": 2,

    ",": 3,

    "!": 4,

    "are": 5,

    "you": 6,

    "today": 7,

    "?": 8,

    "a": 9,

    "custom": 10,

    ".": 11,

    "how": 12,

    "i": 13,

    "am": 14,

    "testing": 15,

    "tokenizers": 16,

    "crucial": 17,

    "for": 18,

    "nlp": 19,

    "tasks": 20,

    "again": 21,

    "<UNK>": 0

}


Encoded token IDs: [2, 3, 12, 0, 0, 0, 1, 0, 7, 8]


Decoded text: hello, how <UNK> <UNK> <UNK> tokenizer <UNK> today?


Vocabulary saved to vocab.json


Loaded vocabulary: {'tokenizer': 1, 'hello': 2, ',': 3, '!': 4, 'are': 5, 'you': 6, 'today': 7, '?': 8, 'a': 9, 'custom': 10, '.': 11, 'how': 12, 'i': 13, 'am': 14, 'testing': 15, 'tokenizers': 16, 'crucial': 17, 'for': 18, 'nlp': 19, 'tasks': 20, 'again': 21, '<UNK>': 0}


============================

Summary

============================


Tokenization is a fundamental step when working with Large Language Models. Hugging Face provides convenient tokenizers that are pretrained and optimized for popular models. However, sometimes you may need a custom tokenizer tailored to your specific requirements. Python's regular expressions offer an easy way to implement simple custom tokenizers.

No comments: