=========
Introduction
=========
Tokenization is the process of splitting text into smaller units called tokens. Tokens can be words, subwords, or characters. Tokenization is crucial for Large Language Models (LLMs) because it converts human-readable text into numerical representations that the model can process.
For example, the sentence:
"I love machine learning!"
can be tokenized into:
["I", "love", "machine", "learning", "!"]
However, modern tokenizers often use subword tokenization methods like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. These methods break down words into smaller units, allowing the model to handle rare or unknown words effectively.
===============================
Using a Hugging Face Tokenizer in Python
===============================
Hugging Face provides easy-to-use tokenizers for popular pretrained language models. Here is how you can use one in Python:
First, install the necessary library:
pip install transformers
Now, let's see how to use a tokenizer from Hugging Face with Python:
# Importing the tokenizer class from Hugging Face transformers
from transformers import AutoTokenizer
# Load a pretrained tokenizer (for example, GPT-2 tokenizer)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Sample text to tokenize
text = "Hello, tokenizer! How are you doing today?"
# Tokenize the input text
tokens = tokenizer.tokenize(text)
# Convert tokens to token IDs (numerical representation)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
# Display the results
print("Original text:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
When running the above code, you might see output similar to:
Original text: Hello, tokenizer! How are you doing today?
Tokens: ['Hello', ',', 'Ġtoken', 'izer', '!', 'ĠHow', 'Ġare', 'Ġyou', 'Ġdoing', 'Ġtoday', '?']
Token IDs: [15496, 11, 11241, 7509, 0, 2437, 389, 345, 1833, 1591, 30]
(Note: The character 'Ġ' represents a whitespace in GPT-2 tokenizer.)
============================
Building a Custom Tokenizer in Python
============================
Sometimes, you might need a simple custom tokenizer tailored to specific needs. Let's create a basic tokenizer from scratch in Python that splits text into words and punctuation marks separately.
Here is a simple custom tokenizer implementation:
# Importing regular expressions library
import re
# Custom tokenizer function
def custom_tokenizer(text):
# Define a regular expression pattern to match words and punctuation
pattern = r"\w+|[^\w\s]"
# Find all matches of the pattern in the input text
tokens = re.findall(pattern, text)
return tokens
# Example usage
sample_text = "Hello, tokenizer! How are you today?"
# Tokenize the sample text using our custom tokenizer
custom_tokens = custom_tokenizer(sample_text)
# Display the results
print("Original text:", sample_text)
print("Custom tokens:", custom_tokens)
Output:
Original text: Hello, tokenizer! How are you today?
Custom tokens: ['Hello', ',', 'tokenizer', '!', 'How', 'are', 'you', 'today', '?']
Explanation of the custom tokenizer:
- The regular expression pattern "\w+|[^\w\s]" matches either:
1. "\w+": one or more word characters (letters, digits, underscores)
2. "[^\w\s]": any single character that is not a word character or whitespace (thus, punctuation marks)
- The re.findall() method returns all matching substrings as a list, effectively splitting the text into words and punctuation separately.
===============================================
Custom Tokenizer Implementation in Python (Real-World Usage)
===============================================
This tokenizer performs the following tasks:
- Splits text into tokens based on words and punctuation
- Builds a vocabulary from given text corpus
- Encodes text into numerical token IDs
- Decodes numerical token IDs back into text
- Saves and loads vocabulary for reuse
=================
Complete Python Code
=================
# Import required libraries
import re
import json
import os
class CustomTokenizer:
def __init__(self, vocab=None):
"""
Initialize the tokenizer with an optional existing vocabulary.
If no vocabulary is provided, start with empty dictionaries.
"""
if vocab:
self.token_to_id = vocab
self.id_to_token = {id_: tok for tok, id_ in vocab.items()}
else:
self.token_to_id = {}
self.id_to_token = {}
def tokenize(self, text):
"""
Tokenize input text into words and punctuation marks.
"""
# Regular expression pattern to match words and punctuation
pattern = r"\w+|[^\w\s]"
tokens = re.findall(pattern, text)
return tokens
def build_vocab(self, texts, min_freq=1):
"""
Build vocabulary from a list of texts.
Only include tokens that appear at least 'min_freq' times.
"""
token_freq = {}
# Count frequency of each token across all texts
for text in texts:
tokens = self.tokenize(text.lower())
for token in tokens:
token_freq[token] = token_freq.get(token, 0) + 1
# Filter tokens by minimum frequency and sort by frequency
filtered_tokens = sorted(
[tok for tok, freq in token_freq.items() if freq >= min_freq],
key=lambda x: -token_freq[x]
)
# Assign unique IDs to tokens
self.token_to_id = {tok: idx for idx, tok in enumerate(filtered_tokens, start=1)}
self.token_to_id["<UNK>"] = 0 # Unknown token
# Create inverse mapping
self.id_to_token = {id_: tok for tok, id_ in self.token_to_id.items()}
def encode(self, text):
"""
Convert input text into a list of token IDs.
Unknown tokens are mapped to the <UNK> token ID (0).
"""
tokens = self.tokenize(text.lower())
token_ids = [self.token_to_id.get(tok, 0) for tok in tokens]
return token_ids
def decode(self, token_ids):
"""
Convert a list of token IDs back into a readable string.
"""
tokens = [self.id_to_token.get(id_, "<UNK>") for id_ in token_ids]
# Reconstruct text with proper spacing
text = ""
for tok in tokens:
if re.match(r"[^\w\s]", tok): # punctuation
text += tok
elif len(text) == 0:
text += tok
else:
text += " " + tok
return text
def save_vocab(self, filepath):
"""
Save vocabulary to a JSON file.
"""
with open(filepath, "w", encoding="utf-8") as f:
json.dump(self.token_to_id, f, ensure_ascii=False, indent=4)
def load_vocab(self, filepath):
"""
Load vocabulary from a JSON file.
"""
if not os.path.exists(filepath):
raise FileNotFoundError(f"Vocabulary file {filepath} not found.")
with open(filepath, "r", encoding="utf-8") as f:
self.token_to_id = json.load(f)
self.id_to_token = {id_: tok for tok, id_ in self.token_to_id.items()}
===========================
Example Usage of Custom Tokenizer
===========================
# Example corpus to build vocabulary
corpus = [
"Hello, tokenizer! How are you today?",
"I am testing a custom tokenizer.",
"Tokenizers are crucial for NLP tasks.",
"Hello again, tokenizer!"
]
# Create tokenizer instance
tokenizer = CustomTokenizer()
# Build vocabulary from corpus (minimum frequency = 1)
tokenizer.build_vocab(corpus, min_freq=1)
# Display vocabulary
print("Vocabulary:", tokenizer.token_to_id)
# Encode a sample sentence into token IDs
sample_text = "Hello, how is your tokenizer doing today?"
encoded_ids = tokenizer.encode(sample_text)
print("Encoded token IDs:", encoded_ids)
# Decode the token IDs back into readable text
decoded_text = tokenizer.decode(encoded_ids)
print("Decoded text:", decoded_text)
# Save vocabulary to disk
vocab_filepath = "vocab.json"
tokenizer.save_vocab(vocab_filepath)
print(f"Vocabulary saved to {vocab_filepath}")
# Load vocabulary from disk (demonstration)
new_tokenizer = CustomTokenizer()
new_tokenizer.load_vocab(vocab_filepath)
print("Loaded vocabulary:", new_tokenizer.token_to_id)
===========================
Example Output of the Above Usage
===========================
When running the example usage above, you might see output similar to this:
Vocabulary: {
"tokenizer": 1,
"hello": 2,
",": 3,
"!": 4,
"are": 5,
"you": 6,
"today": 7,
"?": 8,
"a": 9,
"custom": 10,
".": 11,
"how": 12,
"i": 13,
"am": 14,
"testing": 15,
"tokenizers": 16,
"crucial": 17,
"for": 18,
"nlp": 19,
"tasks": 20,
"again": 21,
"<UNK>": 0
}
Encoded token IDs: [2, 3, 12, 0, 0, 0, 1, 0, 7, 8]
Decoded text: hello, how <UNK> <UNK> <UNK> tokenizer <UNK> today?
Vocabulary saved to vocab.json
Loaded vocabulary: {'tokenizer': 1, 'hello': 2, ',': 3, '!': 4, 'are': 5, 'you': 6, 'today': 7, '?': 8, 'a': 9, 'custom': 10, '.': 11, 'how': 12, 'i': 13, 'am': 14, 'testing': 15, 'tokenizers': 16, 'crucial': 17, 'for': 18, 'nlp': 19, 'tasks': 20, 'again': 21, '<UNK>': 0}
============================
Summary
============================
Tokenization is a fundamental step when working with Large Language Models. Hugging Face provides convenient tokenizers that are pretrained and optimized for popular models. However, sometimes you may need a custom tokenizer tailored to your specific requirements. Python's regular expressions offer an easy way to implement simple custom tokenizers.
No comments:
Post a Comment