FOREWORD: WHY WOULD ANYONE DO THIS?
Let us be honest with each other right from the start. The overwhelming majority of developers and architects who read this tutorial will never actually train a frontier-scale LLM from scratch. OpenAI, Google DeepMind, Anthropic, Meta, and Mistral have already spent hundreds of millions of dollars doing exactly that, and the resulting models are freely available or cheaply accessible via API. So why bother learning how it all works?
The answer is surprisingly practical. Understanding the full construction pipeline of an LLM or VLM makes you a dramatically better consumer, integrator, fine-tuner, and evaluator of these systems. When a model hallucinates, you will know exactly why. When a model fails on German legal text, you will know precisely which stage of the pipeline to blame. When your company decides to fine-tune an open-weight model on proprietary data, you will know what you are actually doing and what can go wrong. And if your organization ever does decide to build a domain-specific model, perhaps a mid-sized 7B or 13B model focused on industrial automation, medical coding, or multilingual customer service, you will have the knowledge to lead that project with confidence.
This tutorial is therefore both a practical engineering guide and a conceptual map. We will walk through every major stage of building an LLM or VLM: choosing the right model type for your goals, designing the architecture, collecting and processing training data, tokenizing text and images, running pre-training, performing alignment through fine-tuning and reinforcement learning, evaluating the result, and finally deploying it. We will cover dense models, Mixture-of-Experts models, reasoning/thinking models, and Vision-Language Models. We will also discuss how to target specific capabilities like mathematics and coding, and how to make a model genuinely multilingual across English, German, Spanish, and beyond.
CHAPTER ONE: THE BIG PICTURE - WHAT IS AN LLM AND WHAT ARE WE ACTUALLY BUILDING?
Before we dive into architecture diagrams and training loops, let us make sure we share a precise mental model of what a Large Language Model actually is, because the popular conception is often subtly wrong in ways that cause real engineering mistakes.
An LLM is, at its mathematical core, a function that takes a sequence of tokens as input and produces a probability distribution over the next token as output. That is it. Everything else, the apparent intelligence, the ability to write code, the multilingual fluency, the capacity to reason through a math problem step by step, all of it emerges from training this relatively simple function on enormous amounts of text data using a specific neural network architecture called the Transformer.
The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, replaced recurrent neural networks with a mechanism called self-attention. Self-attention allows every token in a sequence to directly attend to every other token, regardless of distance, which solved the long-range dependency problem that plagued earlier architectures. This single architectural insight, combined with massive scale in data and compute, produced the LLM revolution we are living through today.
A Vision-Language Model (VLM) extends this idea by adding a vision encoder, a neural network that converts images into sequences of embeddings that the language model can then process alongside text tokens. The result is a model that can look at an image and answer questions about it, generate captions, reason about visual content, or even write code that processes the depicted data.
Now, when we say we are "building" an LLM, we mean something quite specific. We are making decisions across six major dimensions: the model type and architecture, the training data and its composition, the tokenization strategy, the training procedure including pre-training and alignment, the evaluation methodology, and the deployment infrastructure. Each of these dimensions has profound consequences for the final model's behavior, and they interact with each other in non-obvious ways.
Let us now look at the different model types you can choose from, because this is the first major architectural decision and it shapes everything that follows.
CHAPTER TWO: MODEL TYPES - CHOOSING YOUR WEAPON
The landscape of LLM architectures has diversified enormously since 2020. Where once there was essentially one dominant paradigm (the dense autoregressive Transformer decoder), there are now at least four distinct model types that a builder must choose between, each with different strengths, weaknesses, and engineering requirements.
DENSE DECODER-ONLY MODELS
The dense decoder-only Transformer is the original and still most common LLM architecture. Models like GPT-2, GPT-3, LLaMA, LLaMA 2, LLaMA 3, Mistral 7B, Falcon, and Gemma all belong to this family. The word "dense" means that every parameter in the model is used for every token processed. The word "decoder-only" means that the model uses a causal (left-to-right) attention mask, so each token can only attend to previous tokens, not future ones. This makes the model naturally suited for text generation.
The architecture of a dense decoder-only model consists of an embedding layer that converts token IDs into vectors, followed by a stack of Transformer decoder blocks, followed by a final linear projection and softmax that produces the next-token probability distribution. Each decoder block contains a multi-head self-attention layer and a feed-forward network (FFN), with layer normalization applied before or after each sub-layer (the modern preference, following the LLaMA design, is "pre-norm" using RMSNorm rather than LayerNorm).
Here is a schematic of a single Transformer decoder block to make this concrete:
INPUT TOKENS (sequence of token IDs)
|
[Embedding Layer] -- converts each token ID to a d_model-dimensional vector
|
[Positional Encoding / RoPE] -- injects position information
|
+------+------+
| |
| [RMSNorm] |
| | |
| [Multi-Head Self-Attention]
| | |
| [Residual Add]
| | |
| [RMSNorm] |
| | |
| [Feed-Forward Network (SwiGLU)]
| | |
| [Residual Add]
| |
+------+------+
|
(repeated N times, where N is the number of layers)
|
[Final RMSNorm]
|
[Linear Projection to vocab_size]
|
[Softmax -> probability distribution over next token]
The multi-head self-attention mechanism is the heart of the Transformer. For each token, it computes three vectors: a Query (Q), a Key (K), and a Value (V), each derived by multiplying the token's embedding by learned weight matrices. The attention score between two tokens is computed as the dot product of the Query of one token with the Key of another, scaled by the square root of the key dimension. These scores are then passed through a softmax to produce attention weights, which are used to compute a weighted sum of the Value vectors. This entire process happens in parallel across multiple "heads," each learning to attend to different aspects of the input.
The feed-forward network in modern LLMs typically uses a gated activation function called SwiGLU (introduced by Noam Shazeer in 2020), which has been shown empirically to outperform the original ReLU-based FFN. The SwiGLU FFN computes:
FFN(x) = (Swish(x * W1) * (x * W3)) * W2
where W1, W2, and W3 are learned weight matrices and Swish is the smooth activation function Swish(x) = x * sigmoid(x).
One critical modern addition is Rotary Position Embedding (RoPE), introduced by Su et al. in 2021 and adopted by LLaMA and most subsequent models. Unlike the original sinusoidal position embeddings, RoPE encodes position information by rotating the Query and Key vectors in the attention computation, which has been shown to generalize better to longer sequences than the model saw during training.
The dense model is your safest, most well-understood choice. If you are building a model for the first time, start here. The engineering complexity is manageable, the training dynamics are well-studied, and there is an enormous amount of open-source tooling available.
MIXTURE OF EXPERTS MODELS
The Mixture of Experts (MoE) architecture is one of the most exciting developments in LLM design of the past few years. The core idea is elegant: instead of having one large feed-forward network in each Transformer block, you have multiple smaller feed-forward networks called "experts," and a learned routing mechanism called a "gating network" or "router" that decides, for each token, which experts to activate.
The key insight is that not all experts need to be active for every token. In practice, models like Mixtral 8x7B (released by Mistral AI in December 2023) use 8 experts per layer but activate only 2 for each token. This means the model has approximately 47 billion total parameters but only uses about 13 billion parameters per forward pass. The result is a model that has the representational capacity of a 47B dense model but the inference cost of a 13B dense model, at least in terms of floating-point operations per token.
Here is how the MoE layer replaces the FFN in a standard Transformer block:
TOKEN EMBEDDING (vector of dimension d_model)
|
[Router / Gating Network]
-- computes a score for each of the N experts
-- selects the top-K experts (typically K=2)
-- computes softmax weights for the selected experts
|
+------+------+------+------+
| | | | |
[Exp1] [Exp2] [Exp3] ... [ExpN] (only top-K are actually computed)
| | | | |
+------+------+------+------+
|
[Weighted Sum of selected expert outputs] | OUTPUT (same dimension as input)
The router is typically a simple linear layer followed by a softmax. For a given token embedding x, it computes scores s_i = softmax(W_r * x) for each expert i, selects the top-K experts by score, and then computes the output as a weighted sum of those experts' outputs, where the weights are the normalized softmax scores of the selected experts.
The training of MoE models introduces a subtle but important challenge: load balancing. If the router consistently sends all tokens to the same few experts, those experts become overloaded while the others are never trained. This is called "expert collapse" and it is a real failure mode. To prevent it, a load-balancing auxiliary loss is added to the training objective. This loss penalizes imbalanced routing by encouraging the router to distribute tokens roughly equally across experts. The auxiliary loss coefficient is a hyperparameter that must be tuned carefully: too large and it dominates the training signal, too small and expert collapse occurs.
DeepSeek-MoE, introduced by DeepSeek AI in 2024, took the MoE concept further with a "fine-grained" approach: instead of 8 large experts, it uses many more, smaller experts (e.g., 64 experts with 6 activated per token). This finer granularity allows for more flexible specialization and has been shown to improve performance. DeepSeek-V3, released in late 2024, used 256 experts with top-8 routing and achieved remarkable results.
From an engineering perspective, MoE models are significantly more complex to train and deploy than dense models. The routing mechanism introduces communication overhead in distributed training because different tokens on the same GPU may be routed to experts on different GPUs, requiring all-to-all communication. Frameworks like Megablocks and specialized MoE implementations in DeepSpeed and Megatron-LM handle this, but the engineering complexity is real. If you are building your first LLM, MoE is probably not where you want to start. But if you need a model that is efficient at inference time while having a large total parameter count, MoE is the right architecture.
REASONING AND THINKING MODELS
The "thinking" or "reasoning" model is not a fundamentally different architecture but rather a different training paradigm applied on top of a standard dense or MoE model. The key idea, popularized by OpenAI's o1 (released September 2024) and DeepSeek-R1 (released January 2025), is to train the model to produce an explicit chain of thought before giving its final answer. This chain of thought, sometimes called a "scratchpad" or "thinking trace," allows the model to work through complex problems step by step, dramatically improving performance on tasks that require multi-step reasoning, such as mathematics, formal logic, and complex coding.
The training of a reasoning model typically proceeds in multiple stages. The first stage is standard pre-training on a large text corpus, exactly as for any other LLM. The second stage is supervised fine-tuning (SFT) on a dataset of problems paired with high-quality chain-of-thought solutions. The third and most distinctive stage is reinforcement learning, where the model is rewarded for producing correct final answers and penalized for incorrect ones, without necessarily specifying what the reasoning trace should look like. This allows the model to develop its own reasoning strategies.
DeepSeek-R1's training procedure, described in their January 2025 paper, is particularly instructive. They used a technique called GRPO (Group Relative Policy Optimization), which works as follows: for a given problem, the model generates a group of candidate responses (e.g., 8 responses). Each response is evaluated by a reward function that checks whether the final answer is correct. The reward for each response is then computed relative to the average reward of the group, rather than relative to an absolute baseline. This group-relative reward signal is then used to update the model's parameters via policy gradient. The result is a model that learns to produce longer, more careful reasoning traces when faced with difficult problems.
Here is a simplified illustration of what a reasoning model's output looks like for a math problem:
USER: What is the sum of all integers from 1 to 100?
MODEL (thinking trace, not shown to user):
Let me think about this carefully. I need to find the sum
1 + 2 + 3 + ... + 100. I can use Gauss's formula for
arithmetic series. The sum of an arithmetic series is
(n/2) * (first_term + last_term). Here n=100, first_term=1,
last_term=100. So the sum is (100/2) * (1 + 100) = 50 * 101
= 5050. Let me verify: the pairs (1,100), (2,99), ..., (50,51)
each sum to 101, and there are 50 such pairs, so 50*101=5050.
Correct.
MODEL (final answer, shown to user):
The sum of all integers from 1 to 100 is 5050, using the
formula S = n(n+1)/2 = 100*101/2 = 5050.
The thinking trace is generated using special tokens that delimit the reasoning section. In DeepSeek-R1, these are
One fascinating emergent behavior observed in reasoning models is that they spontaneously develop behaviors like self-correction (going back and revising an earlier step when they detect an error), exploration of multiple solution paths, and even something resembling doubt ("Wait, let me reconsider..."). These behaviors were not explicitly programmed; they emerged from the reinforcement learning process because they turned out to be useful strategies for producing correct final answers.
VISION-LANGUAGE MODELS
A Vision-Language Model (VLM) extends a standard LLM by adding the ability to process images. The architecture consists of three main components: a vision encoder, a connector (also called a projector or adapter), and a language model backbone.
The vision encoder is a neural network that converts an image into a sequence of vector embeddings. The most common choice is a Vision Transformer (ViT), which divides the image into a grid of fixed-size patches (e.g., 14x14 pixel patches for a 224x224 image, giving 256 patches) and processes them with a Transformer encoder. CLIP (Contrastive Language-Image Pre-training), developed by OpenAI in 2021, is a particularly popular vision encoder because it was pre-trained to align image and text representations, making it naturally suited for VLM applications.
The connector bridges the gap between the vision encoder's output space and the language model's input space. The simplest connector is a linear projection: a single matrix multiplication that maps each visual embedding from the vision encoder's dimension to the language model's embedding dimension. This is the approach used by LLaVA (Large Language and Vision Assistant), one of the most influential open-source VLMs. More sophisticated connectors include the Q-Former used in InstructBLIP, which uses a learned set of query vectors to extract the most task-relevant visual features, and the cross-attention layers used in Flamingo (DeepMind, 2022), which interleave visual information into the language model at multiple layers.
Here is the VLM data flow for a visual question answering task:
IMAGE (e.g., a photo of a Siemens S7-1500 PLC)
|
[Vision Encoder: ViT-L/14 or CLIP]
-- divides image into NxN patches
-- processes patches with Transformer encoder
-- outputs sequence of M visual embeddings
-- each embedding has dimension d_vision (e.g., 1024)
|
[Connector / Projector]
-- linear projection: d_vision -> d_model
-- outputs M visual tokens in language model space
|
TEXT TOKENS: "What model is this PLC?" -> tokenized
|
[Concatenate visual tokens + text tokens]
|
[Language Model Backbone (e.g., LLaMA-3)]
-- processes the combined sequence autoregressively
-- visual tokens appear before text tokens in the sequence
|
OUTPUT TOKENS: "This is a Siemens SIMATIC S7-1500 PLC,
part of the S7-1500 series..."
The training of a VLM typically proceeds in two stages. In the first stage, only the connector is trained while the vision encoder and language model are frozen. This teaches the connector to translate visual embeddings into a representation the language model can understand. In the second stage, the connector and the language model are fine-tuned together on a large dataset of image-text pairs, visual question answering examples, and image captioning data. The vision encoder is sometimes kept frozen throughout (as in the original LLaVA paper) or fine-tuned in the second stage (as in more recent models like LLaVA-1.5 and InternVL).
A particularly important design decision for VLMs is how to handle high-resolution images. A standard ViT processes images at a fixed resolution (e.g., 224x224 or 336x336), which loses fine-grained detail. Modern VLMs like LLaVA-HD and InternVL use a "dynamic resolution" approach: the image is divided into multiple overlapping tiles, each processed separately by the vision encoder, and the resulting visual tokens are concatenated. This allows the model to handle high-resolution images without changing the vision encoder architecture.
CHAPTER THREE: DEFINING YOUR MODEL'S GOALS - CAPABILITIES AND LANGUAGES
Before you write a single line of training code, you need to answer two fundamental questions: what should your model be good at, and in which languages? These decisions will determine your data collection strategy, your tokenizer design, your fine-tuning approach, and your evaluation benchmarks. Getting them wrong at this stage is expensive to fix later.
CAPABILITY TARGETING
Suppose you want to build a model that excels at mathematics and formal reasoning. This is a well-defined and achievable goal, and it has been pursued successfully by models like DeepSeek-Math, Qwen-Math, and the reasoning models discussed above. To achieve strong mathematical capability, you need to make specific choices at every stage of the pipeline.
At the data level, you need a large amount of high-quality mathematical text. This includes formal mathematics from sources like arXiv (which hosts millions of papers in mathematics, physics, and computer science), mathematical textbooks, competition problem sets (AMC, AIME, IMO), and solutions to those problems. You also need a large amount of mathematical reasoning traces, step-by-step solutions that demonstrate how to approach problems. The DeepSeek-Math model was trained on a dataset of 120 billion math-related tokens, curated from Common Crawl using a math-specific classifier.
At the tokenization level, mathematical notation presents special challenges. Mathematical expressions contain symbols like Greek letters, fractions, integrals, and summations that are rare in general text. A tokenizer trained on general text may split these into many small tokens, making them harder for the model to process. Some math-focused models use specialized tokenizers that handle LaTeX notation more efficiently.
At the architecture level, there is evidence that longer context windows help with mathematical reasoning, because complex proofs and derivations can be very long. Models like DeepSeek-Math-7B use a context length of 4096 tokens, while reasoning models like DeepSeek-R1 use 32768 tokens or more to accommodate long thinking traces.
At the fine-tuning level, mathematical capability is dramatically improved by training on datasets of problems paired with step-by-step solutions. The key insight, validated by multiple research groups, is that the quality of the solutions matters more than the quantity. A dataset of 100,000 high-quality, verified solutions is more valuable than 1 million solutions of unknown quality.
Now suppose instead that you want to build a model focused on code generation. This is perhaps the most commercially important capability specialization, pursued by models like Code Llama, DeepSeek-Coder, Qwen-Coder, and GitHub Copilot's underlying models. Code-focused models require a large amount of code data from sources like GitHub, Stack Overflow, and documentation websites. The code data should cover multiple programming languages (Python, JavaScript, TypeScript, Java, C++, Rust, Go, SQL, etc.) and should include not just code but also comments, docstrings, and associated natural language descriptions.
A critical technique for code models is "fill-in-the-middle" (FIM) training, introduced by Bavarian et al. at OpenAI in 2022. In standard autoregressive training, the model learns to predict the next token given all previous tokens. In FIM training, a portion of the code is randomly masked out, and the model is trained to predict the masked portion given both the preceding and following context. This teaches the model to perform code completion in the middle of a file, which is the most common use case in code editors.
Here is an illustration of FIM training:
ORIGINAL CODE:
def calculate_area(radius):
pi = 3.14159
area = pi * radius * radius
return area
FIM TRAINING EXAMPLE (prefix, suffix, middle format):
PREFIX: "def calculate_area(radius):\n pi = 3.14159\n"
SUFFIX: "\n return area"
MIDDLE: " area = pi * radius * radius"
The model sees: <PRE> PREFIX <SUF> SUFFIX <MID>
And must predict: " area = pi * radius * radius"
This training format is used by Code Llama and DeepSeek-Coder and is essential for achieving good code completion performance.
LANGUAGE TARGETING
Making a model genuinely multilingual is harder than it sounds, and most general-purpose LLMs are significantly weaker in non-English languages than their benchmark scores suggest. The reason is simple: the internet is dominated by English text, and most training datasets reflect this imbalance. Common Crawl, the largest publicly available web crawl dataset, is estimated to be roughly 45-50% English by token count. German, Spanish, French, and other major European languages each account for only a few percent.
If you want your model to be genuinely strong in German, Spanish, and English, you need to deliberately engineer your data mix to compensate for this imbalance. This involves two complementary strategies: upsampling low-resource languages and sourcing high-quality language-specific data.
Upsampling means that during training, you present the model with more examples from German and Spanish text than their natural proportion in the dataset would suggest. For example, if German text makes up 3% of your raw dataset but you want the model to have 15% German capability, you might upsample German data by a factor of 5. The exact upsampling ratios are determined empirically through ablation studies on smaller models.
High-quality language-specific data sources for German include the German Wikipedia (about 2.8 million articles), the OSCAR corpus (a large multilingual web corpus), the German portion of Common Crawl, German news archives, German legal texts (which are publicly available through EUR-Lex and national legal databases), and German-language books from Project Gutenberg and similar sources. For Spanish, analogous sources exist, including the Spanish Wikipedia, the Spanish portion of OSCAR, and a large body of Spanish-language literature and journalism.
The tokenizer design is also critical for multilingual models. A tokenizer trained primarily on English text will be inefficient for other languages: it will represent German or Spanish words as many more tokens than English words of equivalent length, because the tokenizer has not learned the common subword patterns of those languages. This "tokenizer fertility" problem directly affects the model's effective context length for non-English languages and can significantly degrade performance. The solution is to train the tokenizer on a multilingual corpus that reflects the target language distribution, ensuring that common words and subwords in each target language are represented as single tokens.
Here is a concrete illustration of the tokenizer fertility problem:
ENGLISH: "The transformer architecture is powerful."
TOKENS: ["The", " transformer", " architecture", " is", " powerful", "."]
COUNT: 6 tokens
GERMAN (with English-trained tokenizer):
"Die Transformer-Architektur ist leistungsstark."
TOKENS: ["Die", " Trans", "form", "er", "-", "Arch", "itektur",
" ist", " le", "ist", "ung", "sst", "ark", "."]
COUNT: 14 tokens (more than double!)
GERMAN (with multilingual tokenizer):
"Die Transformer-Architektur ist leistungsstark."
TOKENS: ["Die", " Transformer", "-", "Architektur", " ist",
" leistungsstark", "."]
COUNT: 7 tokens (comparable to English)
The difference is dramatic. A model using an English-biased tokenizer will "use up" its context window much faster when processing German text, effectively giving it a shorter memory for German than for English. This is why models like LLaMA 3 (which uses a 128,000-token vocabulary with much better multilingual coverage than LLaMA 1's 32,000-token vocabulary) perform significantly better on non-English languages.
CHAPTER FOUR: DATA - THE FOUNDATION OF EVERYTHING
If you take away only one lesson from this entire tutorial, make it this: data is the most important factor in determining the quality of your LLM. Architecture matters. Training procedure matters. But data quality and composition matter more than either of those. A mediocre architecture trained on excellent data will outperform an excellent architecture trained on mediocre data, every single time.
This is not a theoretical claim. It is validated by the empirical findings of virtually every major LLM paper published since 2022. The LLaMA paper (Touvron et al., Meta, 2023) explicitly demonstrated that a 7B parameter model trained on 1 trillion tokens of carefully curated data could outperform GPT-3 (175B parameters) on many benchmarks. The Phi series of models from Microsoft Research (Phi-1, Phi-1.5, Phi-2, Phi-3) demonstrated even more dramatically that small models trained on extremely high-quality "textbook-quality" data can punch far above their weight class.
Let us now walk through the complete data pipeline, from raw collection to training-ready token sequences.
DATA SOURCES
The raw materials for an LLM's training data come from a variety of sources, each with different characteristics in terms of quality, domain coverage, and licensing.
Web crawl data is the largest and most diverse source. The Common Crawl project (commoncrawl.org) has been crawling the web since 2008 and makes its data freely available. As of 2024, Common Crawl contains petabytes of raw web data. However, raw web data is extremely noisy: it contains spam, duplicate content, machine-generated text, adult content, and low-quality pages. Significant processing is required to extract useful training data from it. The C4 dataset (Colossal Clean Crawled Corpus), created by Google, was an early attempt at cleaning Common Crawl for LLM training. More recent efforts include RefinedWeb (Falcon), RedPajama, Dolma, and FineWeb (Hugging Face, 2024), which apply increasingly sophisticated filtering pipelines.
Books and long-form text provide a qualitatively different kind of data: coherent, long-range reasoning, rich vocabulary, and narrative structure. The Books3 dataset (a large collection of books) and Project Gutenberg (public domain books) are commonly used sources. However, copyright issues are a significant concern with book data, and several major lawsuits have been filed against AI companies for training on copyrighted books without permission. If you are building a commercial model, you need to be very careful about the licensing of your book data.
Code repositories, primarily from GitHub, are essential for any model that needs coding capability. The Stack (BigCode project, 2022) is a large dataset of permissively licensed code from GitHub, covering over 350 programming languages. It is the primary training data source for StarCoder and related models. The key challenge with code data is license filtering: GitHub contains code under many different licenses, some of which (like GPL) may impose restrictions on derivative works.
Scientific papers from arXiv provide high-quality technical and mathematical text. The arXiv dataset contains millions of papers in LaTeX format, which is valuable both for the mathematical content and for the LaTeX notation that the model can learn to produce. PubMed provides biomedical literature. The Semantic Scholar Open Research Corpus provides a broad collection of scientific papers.
Wikipedia is a high-quality, multilingual, encyclopedic source that is almost universally included in LLM training data. Its structured format (with sections, references, and infoboxes) and its coverage of virtually every topic make it invaluable. The Wikipedia dump is freely available in multiple languages.
Curated high-quality datasets like OpenWebText, The Pile (EleutherAI), and Dolma (Allen AI) are pre-assembled collections that combine multiple sources with varying degrees of filtering. These are good starting points for a new project, though you will likely want to supplement them with domain-specific data for your target capabilities.
DATA PROCESSING PIPELINE
Raw data from any of these sources cannot be fed directly into a training pipeline. It must go through a multi-stage processing pipeline that includes language identification, quality filtering, deduplication, and content filtering. This pipeline is often more complex and time-consuming to build than the model training code itself.
Language identification is the first step. You need to know what language each document is in, both to filter for your target languages and to apply language-specific processing. The most widely used tool for this is fastText's language identification model (lid.176.bin), which can identify 176 languages with high accuracy. For web data, you also need to handle the common case of mixed-language documents (e.g., a German webpage with English technical terms).
Quality filtering removes low-quality documents using a combination of heuristic rules and learned classifiers. Heuristic rules include filtering out documents that are too short (fewer than 100 words), have too high a proportion of non-alphabetic characters (indicating spam or machine-generated content), have too many repeated n-grams (indicating boilerplate or templated content), or have too high a proportion of stop words (indicating low-information content). Learned classifiers, such as the one used in the CCNet pipeline, train a fastText classifier to distinguish high-quality text (e.g., Wikipedia) from low-quality text (e.g., spam), and then filter out documents with low classifier scores.
Here is a simplified example of a quality filtering heuristic applied to web documents:
DOCUMENT QUALITY CHECKS (applied sequentially):
Check 1: Length filter
-- Minimum 100 words, maximum 100,000 words
-- Reject: "Buy cheap shoes now! Click here! Best prices!"
(too short, 7 words)
Check 2: Character ratio filter
-- Maximum 20% non-alphabetic characters
-- Reject: "!!!BUY NOW!!! $$$SALE$$$ 50% OFF!!! CLICK HERE!!!"
(too many special characters)
Check 3: Repetition filter
-- Maximum 20% duplicate lines
-- Reject: "Terms apply. Terms apply. Terms apply. Terms apply."
(highly repetitive)
Check 4: Language score filter
-- Minimum fastText language confidence 0.65
-- Reject documents with ambiguous language
Check 5: Quality classifier score
-- Train classifier on Wikipedia (positive) vs random web (negative)
-- Keep only documents above threshold score
Deduplication is perhaps the most important and most underappreciated step in data processing. Training data contains enormous amounts of duplicate and near-duplicate content: the same news article reprinted on hundreds of websites, the same code snippet copied across thousands of repositories, the same Wikipedia paragraph quoted in countless blog posts. Training on duplicate data wastes compute and, more importantly, causes the model to memorize specific texts rather than learning general patterns. It also increases the risk of the model reproducing copyrighted text verbatim.
Exact deduplication removes documents that are byte-for-byte identical. Near-duplicate deduplication removes documents that are very similar but not identical, using techniques like MinHash LSH (Locality-Sensitive Hashing) or SimHash. The Datasheets for Datasets paper by Gebru et al. and the Deduplicating Training Data Makes Language Models Better paper by Lee et al. (2022) provide strong evidence that aggressive deduplication significantly improves model quality.
Content filtering removes harmful, toxic, or illegal content from the training data. This includes pornographic content, hate speech, instructions for creating weapons, and personally identifiable information (PII) like names, email addresses, and phone numbers. Tools like the Perspective API (Google), Detoxify, and custom classifiers are used for this purpose. PII removal is particularly important for privacy compliance (GDPR in Europe, CCPA in California) and is typically done using named entity recognition (NER) models to identify and redact personal information.
After all filtering steps, the processed documents are tokenized (we will cover tokenization in the next section) and packed into fixed-length sequences for training. The packing process concatenates multiple documents end-to-end, separated by a special end-of-document token, until the desired sequence length (e.g., 4096 or 8192 tokens) is reached. This maximizes GPU utilization by ensuring that every training example is a full-length sequence.
DATA MIX AND PROPORTIONS
Once you have processed data from multiple sources, you need to decide how to mix them for training. This is called the "data mix" or "data recipe," and it is one of the most important hyperparameters of the entire training process. The data mix determines what the model knows, how it reasons, and what languages it speaks.
The exact data mix used by major models is often a closely guarded secret, but some information is available from published papers. LLaMA 3 (Meta, 2024) used a data mix that was approximately 50% general web text, 25% code, 15% multilingual text, and 10% other sources including books and scientific papers. The high proportion of code (25%) even for a general-purpose model reflects the finding that code training improves reasoning ability across all domains, not just coding tasks. This is because code has a highly structured, logical format that teaches the model systematic reasoning.
The Chinchilla scaling laws, published by Hoffmann et al. at DeepMind in 2022, provide important guidance on the relationship between model size and training data. The key finding was that previous large models (including GPT-3) were significantly undertrained: they used too many parameters relative to the amount of training data. The Chinchilla-optimal training ratio is approximately 20 tokens of training data per model parameter. So a 7B parameter model should be trained on approximately 140 billion tokens for optimal compute efficiency. However, subsequent work (including the LLaMA papers) showed that training on more tokens than the Chinchilla optimum continues to improve model quality, even if it is not compute-optimal. LLaMA 3's 8B model was trained on 15 trillion tokens, far more than the Chinchilla optimum, because the goal was to maximize model quality rather than compute efficiency.
CHAPTER FIVE: TOKENIZATION - TEACHING THE MODEL TO READ
Tokenization is the process of converting raw text into a sequence of integer IDs that the model can process. It sits at the interface between human-readable text and the mathematical world of neural networks, and getting it right has a surprisingly large impact on model quality.
The dominant tokenization approach for modern LLMs is Byte Pair Encoding (BPE), originally developed for text compression and adapted for NLP by Sennrich et al. in 2016. BPE works by starting with a vocabulary of individual bytes (or characters) and iteratively merging the most frequent pair of adjacent tokens into a new token, until the desired vocabulary size is reached. The result is a vocabulary of subword units that balances between character-level flexibility (the ability to represent any text) and word-level efficiency (common words are represented as single tokens).
Here is a step-by-step illustration of BPE training on a tiny corpus:
CORPUS: "low low low lower lower newest newest widest"
STEP 0: Initial vocabulary (characters + space marker)
Tokens: {l, o, w, e, r, n, s, t, i, d, _}
Frequencies: {l:8, o:8, w:8, e:5, r:2, n:2, s:3, t:2, i:1, d:1}
STEP 1: Find most frequent pair -> (l, o) appears 8 times
Merge: l + o -> lo
New vocabulary: {lo, w, e, r, n, s, t, i, d, _}
STEP 2: Find most frequent pair -> (lo, w) appears 8 times
Merge: lo + w -> low
New vocabulary: {low, e, r, n, s, t, i, d, _}
STEP 3: Find most frequent pair -> (e, r) appears 2 times
Merge: e + r -> er
New vocabulary: {low, er, n, e, s, t, i, d, _}
... and so on until vocabulary size is reached.
FINAL RESULT: Common words like "low" and "lower" are single tokens,
while rare words are split into subword pieces.
Modern LLMs use vocabulary sizes ranging from 32,000 tokens (LLaMA 1, 2) to 128,000 tokens (LLaMA 3, GPT-4) to 256,000 tokens (Gemma 2). Larger vocabularies mean that more words and subwords are represented as single tokens, improving efficiency and reducing the sequence length for a given text. However, larger vocabularies also mean larger embedding tables and output projection matrices, which increases memory usage.
SentencePiece is a popular tokenization library (developed by Google) that implements both BPE and Unigram tokenization in a language-agnostic way. It operates on raw Unicode text without requiring pre-tokenization (splitting on whitespace), which makes it particularly well-suited for multilingual models and languages like Chinese, Japanese, and Korean that do not use spaces to separate words. LLaMA and most of its derivatives use SentencePiece with BPE.
Tiktoken is OpenAI's tokenization library, used by GPT-3.5, GPT-4, and related models. It implements a variant of BPE that operates on bytes rather than characters, ensuring that any input (including arbitrary binary data) can be tokenized without errors.
For a multilingual model targeting English, German, and Spanish, the tokenizer training corpus should reflect the target language distribution. A practical approach is to sample approximately equal amounts of text from each target language for tokenizer training, even if the model training data has a different distribution. This ensures that the tokenizer is efficient for all target languages.
Special tokens are an important part of the tokenizer design. These are tokens with special meanings that are not part of the natural language vocabulary. Common special tokens include the beginning-of-sequence token (BOS), the end-of-sequence token (EOS), the padding token (PAD), and tokens for marking the boundaries of different parts of a conversation (e.g., <|user|>, <|assistant|>, <|system|>). For reasoning models, special tokens like are added to mark the position of visual tokens in the sequence.
CHAPTER SIX: MODEL ARCHITECTURE IN DETAIL
Now that we have our data and tokenizer, we can design the model architecture. We have already introduced the high-level structure of dense, MoE, reasoning, and VLM models. In this section, we go deeper into the specific architectural choices that you will need to make when implementing a model.
HYPERPARAMETER SELECTION
The first set of decisions concerns the model's size and shape. The key hyperparameters are the number of layers (depth), the model dimension (d_model, also called the hidden size), the number of attention heads, the head dimension, and the FFN intermediate dimension. These hyperparameters collectively determine the total number of parameters and the computational cost of training and inference.
Here are the hyperparameters for several well-known models to give you a sense of the design space:
MODEL PARAMS LAYERS d_model HEADS HEAD_DIM FFN_DIM
LLaMA-3-8B 8B 32 4096 32 128 14336
LLaMA-3-70B 70B 80 8192 64 128 28672
Mistral-7B 7B 32 4096 32 128 14336
Mixtral-8x7B 47B 32 4096 32 128 14336 (x8 experts)
DeepSeek-V3 671B 61 7168 128 128 18432 (x256 experts)
Qwen2.5-72B 72B 80 8192 64 128 29568
The relationship between these hyperparameters follows some general rules. The FFN intermediate dimension is typically 4x the model dimension for dense models (though SwiGLU FFNs use a slightly different ratio, approximately 8/3 * d_model, rounded to a multiple of 64 for hardware efficiency). The number of attention heads is typically chosen so that the head dimension (d_model / num_heads) is 64 or 128. Larger head dimensions have been shown to improve performance on long-context tasks.
Grouped Query Attention (GQA), introduced by Ainslie et al. in 2023 and adopted by LLaMA 3, Mistral, and most modern models, is an important optimization. In standard multi-head attention (MHA), each attention head has its own Query, Key, and Value matrices. In GQA, multiple Query heads share a single Key-Value head. For example, LLaMA 3-8B uses 32 Query heads but only 8 Key-Value heads (a ratio of 4:1). This reduces the size of the KV cache during inference by a factor of 4, which is crucial for serving the model efficiently at scale.
The context length (maximum sequence length) is another critical hyperparameter. Longer context lengths allow the model to process longer documents and conversations, but they increase memory usage quadratically (because the attention matrix is sequence_length x sequence_length) and training time linearly. Modern models typically use context lengths of 4096, 8192, or 32768 tokens during pre-training, with techniques like RoPE scaling used to extend the context length after training.
ATTENTION MECHANISMS IN DETAIL
The self-attention mechanism deserves a more detailed treatment because it is the computational core of the Transformer and the source of most of its power and most of its computational cost.
For a sequence of n tokens, each represented as a vector of dimension d_model, the attention mechanism computes:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where Q (queries), K (keys), and V (values) are matrices of shape (n, d_k), (n, d_k), and (n, d_v) respectively, obtained by projecting the input with learned weight matrices W_Q, W_K, and W_V. The scaling factor 1/sqrt(d_k) prevents the dot products from becoming too large in magnitude, which would cause the softmax to saturate and produce very small gradients.
The causal mask in decoder-only models ensures that position i can only attend to positions 1 through i, not to future positions. This is implemented by adding a large negative number (typically -infinity) to the attention scores for positions j > i before the softmax, so that those positions receive zero attention weight after the softmax.
Flash Attention, developed by Dao et al. at Stanford (2022, 2023), is a crucial algorithmic optimization that makes training large models practical. Standard attention requires materializing the full n x n attention matrix in GPU memory, which becomes prohibitively expensive for long sequences. Flash Attention computes the attention output in tiles, never materializing the full attention matrix, using a technique called online softmax. Flash Attention 2 (2023) further improved performance by optimizing the GPU memory access patterns. Flash Attention is now essentially universal in LLM training and is supported by PyTorch natively as of version 2.0.
POSITIONAL ENCODING: ROPE IN DETAIL
Rotary Position Embedding (RoPE) is the positional encoding scheme used by virtually all modern LLMs. The key idea is to encode the position of a token by rotating its Query and Key vectors in the complex plane, such that the dot product between a Query at position m and a Key at position n depends only on their relative position (m - n), not on their absolute positions. This relative position encoding property is what allows RoPE to generalize to longer sequences than seen during training.
RoPE applies a rotation matrix R_theta(m) to the Query vector at position m and R_theta(n) to the Key vector at position n, where theta is a set of base frequencies. The dot product Q_m * K_n then automatically encodes the relative position (m - n). The base frequencies are typically set to 10000 (as in the original RoPE paper) or scaled to larger values (e.g., 500000 in LLaMA 3) to improve long-context performance.
YaRN (Yet another RoPE extensioN) and other RoPE scaling techniques allow a model trained with a short context length to be extended to a longer context length without full retraining, by adjusting the RoPE frequencies. This is a practically important technique because training with very long context lengths is expensive.
CHAPTER SEVEN: PRE-TRAINING - THE BIG BANG OF YOUR MODEL
Pre-training is the stage where the model learns from the vast training corpus. It is the most computationally expensive stage by far, often accounting for 90-95% of the total compute budget for a model. It is also the stage that determines the model's fundamental knowledge, language understanding, and reasoning capabilities.
The pre-training objective for decoder-only LLMs is next-token prediction, also called causal language modeling. Given a sequence of tokens [t_1, t_2, ..., t_n], the model is trained to predict t_{i+1} given [t_1, ..., t_i], for all positions i simultaneously. The loss function is cross-entropy:
L = -(1/n) * sum_{i=1}^{n} log P(t_{i+1} | t_1, ..., t_i)
This simple objective, applied at massive scale, is sufficient to produce models with remarkable capabilities. The model learns grammar, facts, reasoning patterns, code syntax, mathematical notation, and much more, all from the signal of predicting the next token.
TRAINING INFRASTRUCTURE
Training a large LLM requires a cluster of GPUs (or TPUs) connected by a high-speed network. The standard GPU for LLM training as of 2024-2025 is the NVIDIA H100 (80GB HBM3 memory, 3.35 TB/s memory bandwidth, 1979 TFLOPS BF16 tensor core performance). Training a 7B parameter model for 1 trillion tokens on a cluster of 64 H100 GPUs takes approximately 3-4 weeks. Training a 70B parameter model for 1 trillion tokens requires approximately 512-1024 H100 GPUs for several weeks.
Distributed training is essential for models that do not fit in the memory of a single GPU. There are three main parallelism strategies, which are typically combined:
Data parallelism (DP) replicates the model across multiple GPUs and splits the training data across them. Each GPU processes a different batch of data, computes gradients, and then the gradients are averaged across all GPUs using an all-reduce operation. This is the simplest form of parallelism and scales well up to the point where the model fits in a single GPU's memory. FSDP (Fully Sharded Data Parallel), PyTorch's native implementation, extends data parallelism by sharding the model parameters, gradients, and optimizer states across GPUs, allowing models larger than a single GPU's memory to be trained with data parallelism.
Tensor parallelism (TP) splits individual layers across multiple GPUs. For example, the attention heads can be split across GPUs so that each GPU computes a subset of the attention heads, and the FFN weight matrices can be split column-wise and row-wise. This requires all-reduce communication within each Transformer layer, which requires a very fast interconnect (NVLink within a node, InfiniBand across nodes). Megatron-LM (NVIDIA) implements tensor parallelism efficiently.
Pipeline parallelism (PP) assigns different layers of the model to different GPUs. GPU 0 processes layers 1-8, GPU 1 processes layers 9-16, and so on. This reduces the memory requirement per GPU but introduces pipeline bubbles (periods where some GPUs are idle waiting for input from the previous stage). Techniques like micro-batching and interleaved scheduling reduce the bubble overhead.
In practice, large-scale LLM training uses a combination of all three parallelism strategies. For example, DeepSeek-V3 was trained using 16-way tensor parallelism, 8-way pipeline parallelism, and 64-way data parallelism on a cluster of 2048 H100 GPUs.
The optimizer used for LLM pre-training is almost universally AdamW (Adam with weight decay), with typical hyperparameters beta_1=0.9, beta_2=0.95, epsilon=1e-8, and weight decay=0.1. The learning rate schedule typically uses a linear warmup for the first 1-2% of training steps, followed by a cosine decay to a minimum learning rate of approximately 10% of the peak learning rate. The peak learning rate depends on the model size: larger models require smaller learning rates. Typical values range from 3e-4 for small models (1B parameters) to 1e-4 for large models (70B parameters).
Mixed precision training uses BFloat16 (BF16) for the forward and backward passes, while maintaining a full-precision (FP32) copy of the model parameters for the optimizer update. BF16 has the same exponent range as FP32 but fewer mantissa bits, making it more numerically stable than FP16 for training. Gradient scaling is not required with BF16, simplifying the training code.
TRAINING STABILITY
One of the most challenging aspects of LLM pre-training is maintaining training stability over the course of billions of training steps. Several failure modes can occur:
Loss spikes are sudden, large increases in the training loss that may or may not recover. They are often caused by particularly difficult or unusual batches of training data, numerical instability in the attention computation (especially for long sequences), or excessively large gradient norms. Gradient clipping (clipping the global gradient norm to a maximum value, typically 1.0) is essential for preventing loss spikes from diverging.
Loss divergence is a more severe failure where the loss increases without recovering. This is often caused by a learning rate that is too high, insufficient gradient clipping, or numerical overflow in the model computations. The solution is to restart training from the most recent checkpoint with a reduced learning rate.
Checkpoint saving is therefore critical. Modern training frameworks save checkpoints every few hundred training steps, allowing recovery from failures without losing too much progress. For very large training runs, checkpoint saving itself can be a significant overhead, and techniques like asynchronous checkpointing are used to minimize the impact.
Here is a simplified training loop in PyTorch to illustrate the key components:
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
model = MyTransformerModel(config).to("cuda")
optimizer = AdamW(model.parameters(), lr=3e-4,
betas=(0.9, 0.95), weight_decay=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps,
eta_min=3e-5)
for step, batch in enumerate(dataloader):
input_ids = batch["input_ids"].to("cuda")
labels = batch["labels"].to("cuda")
# Forward pass in BF16
with torch.autocast("cuda", dtype=torch.bfloat16):
logits = model(input_ids)
loss = cross_entropy(logits, labels)
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Optimizer step
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % 100 == 0:
print(f"Step {step}, Loss: {loss.item():.4f}")
if step % 500 == 0:
save_checkpoint(model, optimizer, scheduler, step)
CHAPTER EIGHT: FINE-TUNING AND ALIGNMENT - MAKING THE MODEL USEFUL
After pre-training, you have a model that is very good at predicting the next token in a document, but it is not yet a useful assistant. It will complete text in whatever direction the training data suggests, which might not be what you want. The alignment stage transforms the raw pre-trained model into a helpful, harmless, and honest assistant that follows instructions, refuses harmful requests, and produces outputs in the format users expect.
Alignment is typically done in three stages: Supervised Fine-Tuning (SFT), Reward Modeling, and Reinforcement Learning from Human Feedback (RLHF). More recent approaches like Direct Preference Optimization (DPO) and GRPO simplify or replace some of these stages.
SUPERVISED FINE-TUNING
SFT is the first alignment step. You collect a dataset of (instruction, response) pairs, where the instruction is what a user might ask and the response is a high-quality answer. The model is then fine-tuned on this dataset using the same next-token prediction objective as pre-training, but only computing the loss on the response tokens (not the instruction tokens). This teaches the model to follow instructions and produce responses in the desired format.
The quality of the SFT dataset is critical. A small dataset of 10,000-100,000 very high-quality examples is generally more valuable than a large dataset of millions of mediocre examples. The LIMA paper (Zhou et al., Meta, 2023) demonstrated this dramatically: a model fine-tuned on just 1,000 carefully curated examples could match or exceed models fine-tuned on much larger datasets. The key is that the examples should be diverse (covering many different types of instructions), high-quality (the responses should be genuinely helpful, accurate, and well-written), and representative of the target use cases.
For a model targeting mathematics, the SFT dataset should include many examples of math problems with step-by-step solutions. For a model targeting code generation, it should include many examples of programming tasks with correct, well-documented code. For a multilingual model, it should include examples in all target languages.
The SFT data format uses a chat template to structure the conversation. Here is an example using the LLaMA 3 chat template format:
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant specializing in mathematics
and programming. You provide clear, step-by-step solutions.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Write a Python function to compute the Fibonacci sequence
up to n terms.
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Here is a Python function to compute the Fibonacci sequence:
def fibonacci(n):
if n <= 0:
return []
elif n == 1:
return [0]
sequence = [0, 1]
for i in range(2, n):
sequence.append(sequence[-1] + sequence[-2])
return sequence
This function handles edge cases (n <= 0, n == 1) and
builds the sequence iteratively for efficiency.
<|eot_id|>
During SFT training, the loss is computed only on the assistant's response tokens (the tokens after the last <|start_header_id|>assistant<|end_header_id|> token). The instruction tokens are used as context but do not contribute to the loss. This is implemented by setting the labels for instruction tokens to -100 (the ignore index in PyTorch's cross-entropy loss).
REWARD MODELING
After SFT, the model can follow instructions, but it may not always produce the best possible response. To improve it further, we need a way to measure response quality automatically. This is the purpose of the reward model (RM).
A reward model is trained on a dataset of human preference comparisons. For a given instruction, human annotators are shown two or more responses generated by the SFT model and asked to rank them by quality. The reward model is then trained to predict which response a human would prefer. Technically, the reward model is typically initialized from the SFT model with the final language modeling head replaced by a scalar regression head, and trained using a Bradley-Terry preference model:
L_RM = -log(sigmoid(r(x, y_w) - r(x, y_l)))
where r(x, y) is the reward model's score for response y to instruction x, y_w is the preferred (winning) response, and y_l is the less preferred (losing) response. This loss encourages the reward model to assign higher scores to preferred responses.
Collecting human preference data is expensive and time-consuming. A typical RLHF pipeline requires tens of thousands to hundreds of thousands of human preference comparisons. Companies like Scale AI, Surge AI, and Appen provide human annotation services for this purpose. For domain-specific models, it is important that the annotators have relevant expertise: a reward model for mathematical reasoning should be trained on comparisons made by mathematicians, not general-purpose crowd workers.
RLHF WITH PPO
Once we have a reward model, we can use it to further fine-tune the SFT model using reinforcement learning. The standard algorithm for this is PPO (Proximal Policy Optimization), a policy gradient algorithm that is stable and sample-efficient. In the RLHF context:
The "policy" is the LLM being trained. The "action" is generating a response token by token. The "reward" is the reward model's score for the complete response. A KL divergence penalty is added to the reward to prevent the model from diverging too far from the SFT model (which would cause it to produce responses that are high-scoring according to the reward model but incoherent or degenerate).
Total Reward = RM_score(response) - beta * KL(policy || SFT_model)
where beta is a coefficient that controls the strength of the KL penalty. This penalty is crucial: without it, the model will quickly learn to "game" the reward model by producing responses that score highly but are not actually useful (a phenomenon called "reward hacking").
RLHF with PPO is complex to implement and computationally expensive, requiring the simultaneous maintenance of four models: the policy (being trained), the reference policy (the frozen SFT model), the reward model, and the value function (a critic model used by PPO). This complexity has motivated the development of simpler alternatives.
DIRECT PREFERENCE OPTIMIZATION
DPO (Direct Preference Optimization), introduced by Rafailov et al. at Stanford in 2023, is an elegant simplification of RLHF that achieves comparable results with much less engineering complexity. The key insight of DPO is that the optimal policy for the RLHF objective can be expressed in closed form in terms of the preference data, without needing to explicitly train a reward model or run reinforcement learning.
The DPO loss is:
L_DPO = -log(sigmoid(beta * (log P_policy(y_w|x) - log P_ref(y_w|x))
- beta * (log P_policy(y_l|x) - log P_ref(y_l|x))))
This loss directly fine-tunes the policy model on preference data, using the SFT model as a reference. It increases the probability of preferred responses relative to the reference model and decreases the probability of rejected responses. DPO is now widely used in open-source LLM fine-tuning pipelines and is supported by libraries like TRL (Transformer Reinforcement Learning, by Hugging Face).
GRPO FOR REASONING MODELS
GRPO (Group Relative Policy Optimization), used by DeepSeek-R1, is particularly suited for training reasoning models because it does not require a trained reward model. Instead, it uses a verifiable reward function: for mathematical problems, the reward is 1 if the final answer is correct (verified by a symbolic math checker or by comparing to the ground truth) and 0 otherwise. For code problems, the reward is 1 if the generated code passes all test cases.
The GRPO algorithm generates a group of G responses for each problem, computes the reward for each response, and then updates the policy to increase the probability of high-reward responses relative to the group average. This group-relative reward signal is more stable than absolute rewards and avoids the need for a separate value function (unlike PPO).
Here is a conceptual illustration of GRPO for a math problem:
PROBLEM: "Solve for x: 2x + 5 = 13"
GENERATE GROUP OF 8 RESPONSES:
Response 1: "x = 4" -> Reward: 1 (CORRECT)
Response 2: "x = 4" -> Reward: 1 (CORRECT)
Response 3: "x = 9" -> Reward: 0 (WRONG: forgot to subtract 5 first)
Response 4: "x = 4" -> Reward: 1 (CORRECT)
Response 5: "x = 6.5" -> Reward: 0 (WRONG)
Response 6: "x = 4" -> Reward: 1 (CORRECT)
Response 7: "x = 3" -> Reward: 0 (WRONG)
Response 8: "x = 4" -> Reward: 1 (CORRECT)
GROUP AVERAGE REWARD: 5/8 = 0.625
NORMALIZED REWARDS (relative to group average):
Responses 1,2,4,6,8: reward - avg = 1 - 0.625 = +0.375
Responses 3,5,7: reward - avg = 0 - 0.625 = -0.625
UPDATE: Increase probability of responses with positive normalized reward.
Decrease probability of responses with negative normalized reward.
This process, repeated over millions of problems, teaches the model to produce correct reasoning traces and correct final answers.
PARAMETER-EFFICIENT FINE-TUNING
Full fine-tuning of a large model requires updating all of its parameters, which is expensive in terms of memory and compute. For many practical applications, parameter-efficient fine-tuning (PEFT) methods allow you to adapt a pre-trained model to a specific task by updating only a small fraction of the parameters.
LoRA (Low-Rank Adaptation), introduced by Hu et al. at Microsoft in 2021, is the most widely used PEFT method. The idea is to represent the weight update for each layer as the product of two low-rank matrices: delta_W = A * B, where A has shape (d, r) and B has shape (r, d), and r is the rank (a small number, typically 4-64). Only A and B are trained; the original weight matrix W is frozen. The total number of trainable parameters is 2 * d * r per layer, which is much smaller than d^2 for typical values of r.
LoRA is particularly useful for fine-tuning large models on domain-specific data with limited compute. For example, you could take LLaMA 3-70B (which requires approximately 140GB of GPU memory for inference) and fine-tune it on a dataset of Siemens technical documentation using LoRA with rank 16, which would require only a few GB of additional memory for the LoRA parameters and could be done on a single 8xH100 node.
QLoRA (Quantized LoRA), introduced by Dettmers et al. in 2023, extends LoRA by quantizing the frozen base model to 4-bit precision (using a technique called NF4 quantization), dramatically reducing memory requirements. With QLoRA, a 70B parameter model can be fine-tuned on a single 8xA100 (40GB) node, making it accessible to organizations without massive GPU clusters.
CHAPTER NINE: BUILDING A VLM - ADDING EYES TO YOUR MODEL
Now that we have covered the full pipeline for building a text-only LLM, let us turn to the additional steps required to build a Vision-Language Model. We will assume that you start with a pre-trained LLM backbone (e.g., LLaMA 3-8B) and a pre-trained vision encoder (e.g., CLIP ViT-L/14) and want to connect them to create a VLM.
VISION ENCODER SELECTION AND CONFIGURATION
The vision encoder converts an image into a sequence of visual embeddings. The most important choices are the encoder architecture, the image resolution, and whether to fine-tune the encoder.
CLIP ViT-L/14 (Large Vision Transformer with 14x14 pixel patches) is the most commonly used vision encoder for VLMs. It was pre-trained by OpenAI on 400 million image-text pairs using a contrastive objective that aligns image and text representations. Its output is a sequence of 256 visual tokens (for a 224x224 image divided into 14x14 patches) plus a global [CLS] token, each of dimension 1024.
SigLIP (Sigmoid Loss for Language-Image Pre-training), developed by Google in 2023, is a more recent alternative that uses a sigmoid loss instead of the softmax contrastive loss used by CLIP. SigLIP has been shown to produce better visual representations for VLMs and is used by models like PaliGemma and InternVL.
For high-resolution image understanding (e.g., reading text in images, analyzing detailed diagrams, or processing engineering schematics), the standard 224x224 resolution is insufficient. The dynamic resolution approach used by LLaVA-HD and InternVL divides the image into multiple tiles (e.g., a 1344x1344 image into 36 tiles of 336x336 pixels), processes each tile separately with the vision encoder, and concatenates the resulting visual tokens. This produces a much larger number of visual tokens (e.g., 36 * 256 = 9216 tokens for a high-resolution image), which requires a large context window in the language model.
THE CONNECTOR / PROJECTOR
The connector maps visual embeddings from the vision encoder's space to the language model's embedding space. The design of the connector is a key architectural decision.
The simplest connector is a two-layer MLP (Multi-Layer Perceptron) with a GELU activation function:
visual_tokens = GELU(W1 * vision_embeddings + b1)
projected_tokens = W2 * visual_tokens + b2
where W1 maps from d_vision (e.g., 1024) to d_model (e.g., 4096) and W2 maps from d_model to d_model. This is the connector used by LLaVA-1.5, and despite its simplicity, it works remarkably well. The MLP connector is trained during the VLM fine-tuning stage to learn the mapping between visual and language representations.
The Q-Former connector, used by InstructBLIP and BLIP-2, is more sophisticated. It uses a set of N learned query vectors (e.g., N=32) that attend to the visual tokens via cross-attention. The output is a fixed-size sequence of N visual tokens, regardless of the number of visual tokens produced by the vision encoder. This is useful when you want to limit the number of visual tokens passed to the language model, but it can lose fine-grained visual detail.
VLM TRAINING STAGES
VLM training typically proceeds in two stages, as mentioned earlier. Let us now describe each stage in detail.
Stage 1 (Connector Pre-training): Only the connector is trained, while the vision encoder and language model are frozen. The training data consists of a large number of image-caption pairs (e.g., LAION-CC-SBU, a combination of LAION, Conceptual Captions, and SBU Captions, totaling approximately 558,000 image-caption pairs for the original LLaVA). The task is simple image captioning: given an image, generate a description. This stage teaches the connector to produce visual tokens that the language model can interpret as meaningful visual descriptions.
Stage 2 (Visual Instruction Tuning): The connector and the language model are jointly fine-tuned on a diverse dataset of visual instruction following examples. This dataset includes visual question answering (VQA), image captioning with detailed descriptions, optical character recognition (OCR), chart and diagram understanding, and multi-turn visual conversations. The vision encoder may be kept frozen or fine-tuned with a small learning rate. This stage teaches the model to follow instructions about images and to integrate visual and textual reasoning.
Here is an example of a visual instruction tuning example for an industrial use case:
IMAGE: [Photo of a Siemens SIMATIC S7-1500 PLC with a
fault indicator LED lit in red]
INSTRUCTION: "What does the red LED on this PLC indicate,
and what troubleshooting steps should I take?"
RESPONSE: "The red LED on the SIMATIC S7-1500 PLC indicates
a fault condition. The SF (System Fault) LED lights
red when the CPU has detected an error. To troubleshoot:
First, connect to the PLC using TIA Portal and check
the diagnostics buffer for the specific error code.
Common causes include a program error, hardware fault,
or communication error with a connected module. Check
that all module connectors are properly seated and that
the power supply voltage is within specification (24V DC
for most S7-1500 CPUs). If the error persists, consult
the SIMATIC S7-1500 system manual for the specific
error code."
This kind of domain-specific visual instruction tuning data is what transforms a general-purpose VLM into a specialized industrial assistant. Creating this data requires domain experts who can write accurate, detailed responses to visual questions in the target domain.
CHAPTER TEN: EVALUATION - HOW DO YOU KNOW IF IT'S GOOD?
Building an LLM or VLM without a rigorous evaluation framework is like flying blind. You need to know, at every stage of the pipeline, whether your model is improving and in what dimensions. Evaluation in LLM development is both an art and a science, and it is more complex than it might initially appear.
AUTOMATIC BENCHMARKS
Automatic benchmarks are standardized datasets with ground-truth answers that allow you to measure specific capabilities reproducibly. Here are the most important benchmarks for different capability areas:
MMLU (Massive Multitask Language Understanding) tests general knowledge across 57 subjects including mathematics, science, law, medicine, and humanities. It consists of 14,079 multiple-choice questions. A random baseline scores 25% (since there are 4 choices per question). GPT-4 scores approximately 86%, LLaMA 3-70B scores approximately 82%.
GSM8K (Grade School Math 8K) tests mathematical reasoning with 8,500 grade school math word problems. These problems require multi-step arithmetic reasoning and are a standard benchmark for math capability. GPT-4 scores approximately 92%, LLaMA 3-70B scores approximately 93%.
MATH (Hendrycks et al., 2021) is a much harder math benchmark with 12,500 competition-level math problems from AMC, AIME, and similar competitions. It requires advanced mathematical reasoning. GPT-4 scores approximately 52%, while reasoning models like DeepSeek-R1 score approximately 97%.
HumanEval (Chen et al., OpenAI, 2021) tests code generation with 164 Python programming problems. The model must generate a function that passes a set of unit tests. GPT-4 scores approximately 87%, LLaMA 3-70B scores approximately 81%.
MBPP (Mostly Basic Python Problems) is a similar code benchmark with 374 Python problems, slightly easier than HumanEval.
HellaSwag tests commonsense reasoning with sentence completion tasks. TruthfulQA tests whether the model produces truthful answers to questions that humans commonly answer incorrectly. BIG-Bench Hard is a collection of 23 challenging tasks that require multi-step reasoning.
For multilingual evaluation, MGSM (Multilingual Grade School Math) tests math reasoning in 11 languages, and XCOPA tests commonsense reasoning in 11 languages. For German specifically, the GermanBench suite includes several German-language benchmarks.
For VLMs, key benchmarks include MMBench (a comprehensive VQA benchmark), MMMU (Massive Multidisciplinary Multimodal Understanding), TextVQA (reading text in images), DocVQA (understanding document images), and ChartQA (understanding charts and graphs).
PERPLEXITY
Perplexity is a fundamental metric for language models that measures how well the model predicts a held-out test set. It is defined as the exponentiated average negative log-likelihood:
Perplexity = exp(-(1/n) * sum_{i=1}^{n} log P(t_i | t_1, ..., t_{i-1}))
Lower perplexity indicates better prediction. Perplexity is useful for comparing models of the same architecture trained on the same data distribution, but it is less useful for comparing models across different architectures or data distributions, because it is sensitive to tokenization differences. A model with a larger vocabulary will generally have lower perplexity than a model with a smaller vocabulary, even if they have similar capabilities.
HUMAN EVALUATION
Automatic benchmarks, while essential, do not capture everything that matters about a model's quality. Human evaluation is necessary to assess qualities like helpfulness, clarity, tone, and the ability to handle ambiguous or open-ended requests.
The most common form of human evaluation for LLMs is pairwise comparison: human evaluators are shown two responses to the same instruction and asked which one they prefer. This is the same format used to collect RLHF training data. Platforms like Chatbot Arena (LMSYS, 2023) have collected millions of human preference comparisons between different LLMs, producing an Elo-style ranking that is widely used as a measure of overall model quality.
For domain-specific models, human evaluation should be conducted by domain experts. A model for industrial automation should be evaluated by automation engineers, not by general-purpose crowd workers. The evaluation criteria should be tailored to the domain: for a Siemens industrial assistant, criteria might include technical accuracy, adherence to Siemens product terminology, and the quality of troubleshooting guidance.
EVALUATION DURING TRAINING
It is important to evaluate the model regularly during training, not just at the end. This allows you to detect problems early (like a capability regression or a language imbalance) and adjust the training accordingly. A practical approach is to run a lightweight evaluation suite every 1,000-5,000 training steps, covering a small but representative set of benchmarks. This gives you a training curve for each benchmark, allowing you to track progress and detect regressions.
CHAPTER ELEVEN: DEPLOYMENT - GETTING YOUR MODEL INTO THE WORLD
After pre-training, fine-tuning, and evaluation, you have a model that you are happy with. Now you need to deploy it so that users can actually interact with it. Deployment introduces a new set of engineering challenges around efficiency, latency, throughput, and cost.
QUANTIZATION
The first step in preparing a model for deployment is typically quantization: reducing the precision of the model's weights from 16-bit (BF16 or FP16) to 8-bit (INT8) or 4-bit (INT4). Quantization reduces memory usage and increases inference speed, often with only a small loss in model quality.
GPTQ (Generative Pre-trained Transformer Quantization) is a popular post-training quantization method that quantizes the model's weights to 4-bit or 8-bit precision using a second-order optimization technique. AWQ (Activation-aware Weight Quantization) is a more recent method that identifies and protects the most important weights during quantization, achieving better quality at the same bit-width. Both methods are implemented in the AutoGPTQ and AutoAWQ libraries.
A 7B parameter model in BF16 requires approximately 14GB of GPU memory. Quantized to 4-bit with GPTQ or AWQ, it requires only approximately 4GB, allowing it to run on a single consumer GPU like the NVIDIA RTX 4090 (24GB). This is a dramatic reduction in deployment cost.
INFERENCE OPTIMIZATION
Several algorithmic optimizations can dramatically improve inference throughput and latency:
KV cache management is critical for efficient autoregressive generation. During generation, the model computes Key and Value matrices for each token in the context. These can be cached and reused for subsequent tokens, avoiding redundant computation. However, the KV cache grows linearly with the sequence length and the number of layers, and can become very large for long sequences. Techniques like sliding window attention (used by Mistral) and KV cache quantization help manage this.
Continuous batching (also called dynamic batching) is a technique for serving multiple users simultaneously. Instead of waiting for all requests in a batch to complete before starting new ones, continuous batching allows new requests to be added to the batch as soon as a slot becomes available. This dramatically improves GPU utilization and throughput. vLLM (Virtual LLM), developed at UC Berkeley in 2023, implements continuous batching with a novel memory management technique called PagedAttention that treats the KV cache like virtual memory, allowing efficient sharing and reuse of KV cache pages across requests.
Speculative decoding is a technique for reducing latency in autoregressive generation. A small, fast "draft" model generates a sequence of candidate tokens, and the large "target" model verifies them in parallel. If the target model agrees with the draft model's tokens, they are accepted without additional computation; if it disagrees, the generation falls back to the target model. This can reduce latency by 2-3x for tasks where the draft model is often correct (e.g., code completion).
Here is a simplified illustration of speculative decoding:
TARGET MODEL: LLaMA-3-70B (slow but accurate)
DRAFT MODEL: LLaMA-3-8B (fast but less accurate)
STEP 1: Draft model generates 5 candidate tokens:
"The capital of France is Paris , which"
STEP 2: Target model verifies all 5 tokens in ONE forward pass:
"The" -> ACCEPT (target agrees)
"capital" -> ACCEPT
"of" -> ACCEPT
"France" -> ACCEPT
"is" -> ACCEPT
"Paris" -> ACCEPT
"," -> REJECT (target would have said ".")
RESULT: Accept 6 tokens in one target model forward pass
instead of 6 separate forward passes. Speedup: ~4-6x.
SERVING INFRASTRUCTURE
For production deployment, you need a serving infrastructure that handles request routing, load balancing, authentication, rate limiting, and monitoring. Popular serving frameworks for LLMs include vLLM (which provides an OpenAI-compatible API), TGI (Text Generation Inference, by Hugging Face), and TensorRT-LLM (by NVIDIA, optimized for NVIDIA GPUs).
For cloud deployment, the major cloud providers (AWS, Azure, GCP) offer managed LLM serving services. AWS SageMaker, Azure AI Studio, and Google Cloud Vertex AI all support deploying custom LLMs with auto-scaling and monitoring. For on-premises deployment (which may be required for data privacy reasons in industrial or healthcare settings), Kubernetes-based deployments with GPU node pools are the standard approach.
CHAPTER TWELVE: PUTTING IT ALL TOGETHER - A REALISTIC PROJECT PLAN
Let us now synthesize everything we have covered into a realistic project plan for building a domain-specific LLM. We will use a concrete example: a 7B parameter model targeting industrial automation, with strong capabilities in German and English, and specialized knowledge of Siemens products and industrial protocols.
PROJECT PHASES AND TIMELINE
Phase 1: Planning and Infrastructure Setup (4-6 weeks). During this phase, you define the model's target capabilities and languages, select the base architecture (in this case, a dense decoder-only model based on the LLaMA 3 architecture), set up the training infrastructure (GPU cluster or cloud compute), and establish the evaluation framework. You also begin assembling the team: you will need ML engineers with experience in distributed training, data engineers for the data pipeline, domain experts for data curation and evaluation, and DevOps engineers for the serving infrastructure.
Phase 2: Data Collection and Processing (8-12 weeks). This is often the longest and most labor-intensive phase. You collect data from all target sources (web crawl, technical documentation, standards documents, code repositories, multilingual corpora), build and run the data processing pipeline (language identification, quality filtering, deduplication, content filtering), and assemble the final training dataset. For a domain-specific model, you also collect and curate domain-specific data: Siemens product manuals, TIA Portal documentation, IEC 61131-3 standard documents, industrial automation forums, and similar sources.
Phase 3: Tokenizer Training (1-2 weeks). You train a BPE tokenizer on a representative sample of your training data, with a vocabulary size of 64,000-128,000 tokens. You verify that the tokenizer handles German, English, and technical terminology (including PLC programming syntax and industrial protocol names) efficiently.
Phase 4: Pre-training (8-16 weeks, depending on compute). You pre-train the model on the full training dataset. For a 7B parameter model trained on 1 trillion tokens on a cluster of 64 H100 GPUs, this takes approximately 3-4 weeks. You monitor training loss, benchmark scores, and training stability throughout, saving checkpoints regularly.
Phase 5: Supervised Fine-Tuning (2-4 weeks). You collect or create a dataset of instruction-response pairs covering the target use cases: answering questions about Siemens products, explaining PLC programming concepts, troubleshooting industrial automation issues, and generating code in structured text (ST) and ladder diagram (LD) formats. You fine-tune the pre-trained model on this dataset.
Phase 6: Alignment (RLHF or DPO) (2-4 weeks). You collect human preference data from domain experts (automation engineers) who compare pairs of model responses and indicate which is better. You then use DPO (or RLHF with PPO for more control) to align the model with these preferences.
Phase 7: Evaluation (2-4 weeks, ongoing). You evaluate the model on the standard benchmarks (MMLU, GSM8K, HumanEval) to ensure that domain-specific fine-tuning has not degraded general capabilities, and on domain-specific benchmarks (questions about Siemens products, PLC programming tasks, industrial troubleshooting scenarios) to measure domain-specific performance. You also conduct human evaluation with domain experts.
Phase 8: Deployment (2-4 weeks). You quantize the model to 4-bit or 8-bit precision, set up the serving infrastructure (vLLM or TGI on Kubernetes), and deploy the model. You set up monitoring for latency, throughput, error rates, and model quality (using a sample of production requests for offline evaluation).
Total timeline: approximately 6-12 months for a well-resourced team. Total compute cost for pre-training a 7B model on 1 trillion tokens: approximately $500,000-$1,000,000 at current cloud GPU prices (H100 at approximately $2-3/hour per GPU, 64 GPUs for 4 weeks = 64 * 24 * 28 * $2.5 = approximately $107,520 for compute alone, but this estimate scales with the actual token count and cluster size). Fine-tuning and alignment are much cheaper, typically $10,000-$50,000 in compute.
For most organizations, the more practical approach is to start with an existing open-weight model (LLaMA 3, Mistral, Qwen 2.5, etc.) and fine-tune it on domain-specific data, skipping the pre-training phase entirely. This reduces the timeline to 2-4 months and the compute cost to $10,000-$100,000, while still achieving excellent domain-specific performance.
CHAPTER THIRTEEN: COMMON PITFALLS AND HOW TO AVOID THEM
Building an LLM is a complex endeavor with many opportunities for things to go wrong. Here are the most common pitfalls, based on the collective experience of the research community.
Data contamination occurs when your evaluation benchmarks appear in your training data. If the model has seen the answers to GSM8K problems during training, its GSM8K score will be inflated and will not reflect its true mathematical reasoning ability. To prevent this, you should deduplicate your training data against your evaluation benchmarks before training. This is called "benchmark decontamination" and is a standard step in responsible LLM development.
Catastrophic forgetting occurs when fine-tuning on domain-specific data causes the model to forget its general capabilities. For example, fine-tuning aggressively on German industrial text might cause the model to perform worse on English benchmarks. To prevent this, you should include a mixture of general-purpose data in your fine-tuning dataset (typically 10-20% general data mixed with 80-90% domain-specific data), use a lower learning rate for fine-tuning than for pre-training, and monitor general benchmarks throughout fine-tuning.
Reward hacking in RLHF occurs when the model learns to produce responses that score highly according to the reward model but are not actually useful. This can happen because the reward model is imperfect and can be fooled by superficial features like response length, confident tone, or the use of specific phrases that the reward model associates with quality. To prevent this, use a KL penalty in RLHF (as discussed earlier), monitor the distribution of response lengths and other surface features during RL training, and regularly evaluate the model on held-out human preference data.
Tokenizer mismatch occurs when you use a pre-trained model with a different tokenizer than the one it was trained with. This is a surprisingly common mistake when adapting open-weight models. Always use the exact tokenizer that the base model was trained with, and if you need to add new tokens (e.g., for domain-specific terminology or special tokens), be careful to initialize the new token embeddings appropriately (e.g., by averaging the embeddings of related tokens).
Training instability in MoE models is more common than in dense models, due to the routing mechanism and the load-balancing auxiliary loss. If you observe loss spikes or expert collapse during MoE training, try increasing the load-balancing loss coefficient, reducing the learning rate, or switching to a more stable routing algorithm.
CONCLUSION: THE ROAD AHEAD
We have now covered the complete pipeline for building an LLM or VLM, from the initial architectural decision through data collection, tokenization, pre-training, alignment, evaluation, and deployment. We have seen how different model types, dense models, Mixture-of-Experts models, reasoning models, and Vision-Language Models, each make different trade-offs and require different engineering approaches. We have seen how to target specific capabilities like mathematics and coding, and how to build genuinely multilingual models that serve German, English, and Spanish speakers equally well.
The field is moving at an extraordinary pace. Techniques that were cutting-edge in 2022 (like RLHF with PPO) have already been partially superseded by simpler alternatives (DPO, GRPO). New architectural innovations (like GQA, RoPE, and MoE with fine-grained routing) have become standard practice in just a few years. The compute required to train a competitive model has been decreasing as training efficiency improves, even as the frontier models continue to grow larger.
What will not change is the fundamental importance of data quality, the elegance of the Transformer architecture, and the power of scale. Understanding these foundations deeply, as you now do after reading this tutorial, gives you the ability to navigate the rapidly evolving landscape of LLM and VLM development with confidence. Whether you are fine-tuning an existing model for a specific domain, evaluating a commercial model for your organization's use case, or leading a full model development project, the knowledge you have gained here will serve you well.
The most important thing to remember is that building an LLM is not magic. It is engineering, applied at scale, guided by careful empirical observation. Every decision in the pipeline, from the data mix to the learning rate schedule to the reward model design, has measurable consequences that can be studied, understood, and improved. That is what makes this field so fascinating, and so rewarding to work in.
Good luck building your model. The world needs more people who understand how these systems actually work.
REFERENCES AND FURTHER READING
"Attention Is All You Need" by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin (Google, 2017) is the original Transformer paper and remains essential reading. It is available at arxiv.org/abs/1706.03762.
"LLaMA: Open and Efficient Foundation Language Models" by Touvron et al. (Meta AI, 2023) describes the architecture and training of the original LLaMA model family, ranging from 7B to 65B parameters, trained exclusively on publicly available data. It is available at arxiv.org/abs/2302.13971. The follow-up technical report, "The Llama 3 Herd of Models" by Dubey et al. (Meta AI, 2024), describes the LLaMA 3 model family and is available at arxiv.org/abs/2407.21783.
"Mixtral of Experts" by Jiang et al. (Mistral AI, 2024) describes the Mixtral 8x7B Sparse Mixture of Experts model. It is available at arxiv.org/abs/2401.04088.
"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" by DeepSeek AI (2025) describes the training of the DeepSeek-R1 reasoning model using GRPO. It is available at arxiv.org/abs/2501.12948.
"Visual Instruction Tuning" by Liu, Li, Wu, and Lee (LLaVA, 2023) describes the LLaVA Vision-Language Model training approach, combining a CLIP vision encoder with a Vicuna language model via a linear projection connector. It is available at arxiv.org/abs/2304.08485.
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" by Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn (Stanford, 2023) describes the DPO alignment technique. It is available at arxiv.org/abs/2305.18290.
"Training Language Models to Follow Instructions with Human Feedback" by Ouyang et al. (OpenAI, 2022) describes the original InstructGPT RLHF approach using supervised fine-tuning, reward modeling, and PPO. It is available at arxiv.org/abs/2203.02155.
"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" by Dao (2023) describes the Flash Attention 2 algorithm, which achieves up to 2x speedup over the original FlashAttention and up to 10x over standard PyTorch attention. It is available at arxiv.org/abs/2307.08691.
"Scaling Laws for Neural Language Models" by Kaplan et al. (OpenAI, 2020) provides the theoretical foundation for understanding the power-law relationships between model size, dataset size, compute, and language model performance. It is available at arxiv.org/abs/2001.08361.
"Training Compute-Optimal Large Language Models" by Hoffmann et al. (DeepMind, 2022), known as the Chinchilla paper, demonstrated that many large models were significantly undertrained and that model size and training tokens should be scaled equally for compute-optimal training. It is available at arxiv.org/abs/2203.15556.
The Hugging Face documentation at huggingface.co/docs provides practical guides for tokenization, model training with the Transformers library, and fine-tuning with TRL. The Hugging Face blog at huggingface.co/blog contains excellent explanations of Mixture of Experts, Vision-Language Models, and many other topics covered in this tutorial.
No comments:
Post a Comment