Check your AI expertise using the following catalog. At the end you‘ll find the answers. Have fun!

AI & LLM Knowledge Catalog for Software Engineers

Question Bank with Validated

QUESTIONS

1. Foundational Concepts

Q1.1: What is the Transformer architecture and why is it significant for modern AI?

Q1.2: What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?

Q1.3: What is self-attention and how does it work in Transformers?

Q1.4: What is multi-head attention and why is it important?

Q1.5: What is transfer learning in the context of LLMs?

2. Model Architectures

Q2.1: What are the key differences between BERT, GPT, and T5 models?

Q2.2: What type of architecture does BERT use and what are its primary use cases?

Q2.3: What type of architecture does GPT use and what are its primary use cases?

Q2.4: What makes T5 unique among language models?

Q2.5: What is the difference between bidirectional and causal attention?

3. Tokenization

Q3.1: What is tokenization and why is it important for LLMs?

Q3.2: What is Byte Pair Encoding (BPE) and how does it work?

Q3.3: What is WordPiece tokenization and how does it differ from BPE?

Q3.4: What is SentencePiece and when is it particularly useful?

Q3.5: What are the challenges of tokenization for non-English languages?

4. Training & Fine-Tuning

Q4.1: What is fine-tuning and why is it important for LLMs?

Q4.2: What is Parameter-Efficient Fine-Tuning (PEFT)?

Q4.3: What is LoRA (Low-Rank Adaptation) and what are its benefits?

Q4.4: What is the difference between full fine-tuning and PEFT methods?

Q4.5: What is Reinforcement Learning from Human Feedback (RLHF)?

Q4.6: What are the main challenges in implementing RLHF?

Q4.7: What is Direct Preference Optimization (DPO)?

5. Prompting Techniques

Q5.1: What is zero-shot learning in the context of LLMs?

Q5.2: What is few-shot learning and when should it be used?

Q5.3: What is Chain-of-Thought (CoT) prompting?

Q5.4: What is the difference between zero-shot and few-shot prompting?

Q5.5: What are best practices for prompt engineering in 2024?

6. Generation Parameters

Q6.1: What is the temperature parameter in LLM text generation?

Q6.2: What is top-p (nucleus sampling)?

Q6.3: How do temperature and top-p interact, and should they be adjusted together?

Q6.4: When should you use a low temperature setting?

Q6.5: When should you use a high temperature setting?

7. Embeddings & Vector Databases

Q7.1: What are embeddings in the context of AI and LLMs?

Q7.2: What is a vector database and how does it differ from traditional databases?

Q7.3: What is semantic search?

Q7.4: How do embeddings enable semantic search?

Q7.5: What are common similarity metrics used in vector databases?

8. Challenges & Solutions

Q8.1: What are hallucinations in LLMs?

Q8.2: What is Retrieval-Augmented Generation (RAG)?

Q8.3: How does RAG help reduce hallucinations?

Q8.4: What are other techniques to mitigate LLM hallucinations?

Q8.5: What is the "alignment tax" in RLHF?

Q8.6: What is reward hacking in reinforcement learning?

9. Advanced Concepts

Q9.1: What is the context window in LLMs?

Q9.2: What are the challenges of processing long sequences in Transformers?

Q9.3: What is Reinforcement Learning from AI Feedback (RLAIF)?

Q9.4: What is the difference between supervised fine-tuning and RLHF?

Q9.5: What is QLoRA?

10. Practical Applications

Q10.1: What is the text-to-text framework in T5?

Q10.2: What are Small Language Models (SLMs) and why are they gaining prominence?

Q10.3: What is hybrid search in the context of vector databases?

Q10.4: What is the role of human-in-the-loop (HITL) in LLM applications?

Q10.5: What are the typical applications of encoder-only models like BERT?

ANSWERS

1. Foundational Concepts

A1.1: What is the Transformer architecture and why is it significant for modern AI?

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is a neural network architecture that revolutionized sequence processing in AI. It consists of encoder and decoder components that use self-attention mechanisms to process input sequences in parallel, rather than sequentially like RNNs.

Significance:

Enables parallel computation, making training more efficient
Effectively captures long-range dependencies in sequences
Forms the foundation for modern LLMs like GPT, BERT, and T5
Expanded beyond NLP to computer vision, speech recognition, and multimodal AI
Powers models like Stable Diffusion 3 and Sora (both released in 2024)

A1.2: What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?

Encoder-only (e.g., BERT):

Processes entire input simultaneously with bidirectional attention
Each token can attend to all other tokens
Best for understanding tasks (classification, NER, sentiment analysis)

Decoder-only (e.g., GPT):

Uses causal/masked attention (tokens only attend to previous tokens)
Autoregressive generation (predicts next token sequentially)
Best for generation tasks (text completion, creative writing, chatbots)

Encoder-decoder (e.g., T5):

Combines both components
Encoder uses bidirectional attention, decoder uses causal attention
Decoder also uses cross-attention to incorporate encoder output
Versatile for both understanding and generation (translation, summarization)

A1.3: What is self-attention and how does it work in Transformers?

Self-attention is a mechanism that allows each token in a sequence to weigh the importance of all other tokens when creating its representation.

How it works:

For each token, the model computes three vectors: Query (Q), Key (K), and Value (V)
Attention scores are calculated by comparing each token's Query with all Keys
Scores are normalized (usually with softmax)
The output is a weighted sum of Values based on these scores
This creates rich contextual representations where each token's meaning is influenced by its relationship to all other tokens

A1.4: What is multi-head attention and why is it important?

Multi-head attention is a component of the self-attention mechanism that runs multiple attention operations in parallel, each with different learned projections.

Importance:

Allows the model to focus on different aspects of the input simultaneously
Each "head" can learn different types of relationships (syntactic, semantic, positional)
Provides multiple representational viewpoints
Enhances the model's ability to capture diverse patterns in data
Improves overall model expressiveness and performance

A1.5: What is transfer learning in the context of LLMs?

Transfer learning is the practice of leveraging knowledge gained from one task or domain to improve learning or performance in another related or different area.

In LLMs:

Pre-trained models learn general language patterns from vast datasets
This broad understanding serves as a starting point for new tasks
Enables quicker and more effective learning in specialized domains
Reduces the need for large task-specific datasets
Allows models to apply their inherent knowledge to meet unique application requirements

2. Model Architectures

A2.1: What are the key differences between BERT, GPT, and T5 models?

BERT (Bidirectional Encoder Representations from Transformers):

Architecture: Encoder-only
Attention: Bidirectional (fully visible)
Pre-training: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
Strength: Understanding and comprehension tasks

GPT (Generative Pre-trained Transformer):

Architecture: Decoder-only
Attention: Causal/masked (unidirectional)
Pre-training: Next token prediction (language modeling)
Strength: Text generation tasks

T5 (Text-to-Text Transfer Transformer):

Architecture: Full encoder-decoder
Attention: Bidirectional in encoder, causal in decoder, plus cross-attention
Pre-training: Denoising objective on C4 corpus
Strength: Unified text-to-text framework for diverse tasks

A2.2: What type of architecture does BERT use and what are its primary use cases?

Architecture: Encoder-only (stack of Transformer encoder layers)

Primary Use Cases:

Sentiment analysis
Text classification
Named Entity Recognition (NER)
Question answering (extractive)
Semantic similarity
Intent detection
Any task requiring deep text comprehension

Key Characteristic: Designed for understanding language rather than generating it, using bidirectional context to capture meaning.

A2.3: What type of architecture does GPT use and what are its primary use cases?

Architecture: Decoder-only (stack of Transformer decoder layers with masked self-attention)

Primary Use Cases:

Text generation and completion
Creative writing and storytelling
Conversational AI and chatbots
Code generation
Summarization
Translation
Content creation

Key Characteristic: Autoregressive generation where each token is predicted based only on previous tokens, making it ideal for sequential text generation.

A2.4: What makes T5 unique among language models?

Unique Features:

Text-to-Text Framework: Every NLP task is reframed as text generation
- Input: Text prompt describing the task + data
- Output: Always text, regardless of task type
Unified Approach: Translation, summarization, classification, QA all use the same format
Full Encoder-Decoder: Leverages both components for versatility
Pre-training: Denoising objective on massive C4 corpus

Benefits:

Simplifies model architecture across diverse tasks
Enables transfer learning across task types
Highly versatile and flexible

A2.5: What is the difference between bidirectional and causal attention?

Bidirectional Attention (Fully Visible):

Each token can attend to ALL tokens in the sequence (past and future)
Used in encoder architectures (e.g., BERT)
Provides complete context for understanding
Best for comprehension tasks

Causal Attention (Masked/Unidirectional):

Each token can only attend to PREVIOUS tokens (and itself)
Used in decoder architectures (e.g., GPT)
Prevents "looking ahead" during generation
Essential for autoregressive text generation
Ensures the model predicts based only on available context

3. Tokenization

A3.1: What is tokenization and why is it important for LLMs?

Definition: Tokenization is the process of converting raw text into numerical tokens that LLMs can process and understand.

Importance:

Bridges the gap between human language and machine processing
Directly impacts model performance and efficiency
Determines vocabulary size and model capacity
Affects how well models handle rare words and out-of-vocabulary terms
Influences computational requirements and memory usage
Critical for multilingual models and diverse language support

A3.2: What is Byte Pair Encoding (BPE) and how does it work?

Definition: BPE is a subword tokenization algorithm that builds vocabulary through iterative merging.

How it works:

Start with each character as an initial token
Identify the most frequent pair of characters/subwords in training data
Merge this pair into a new subword unit
Repeat iteratively until desired vocabulary size is reached

Characteristics:

Frequency-based approach
Robust and general-purpose
Handles out-of-vocabulary words by breaking them into subwords
Fully lossless (preserves all spaces)
Used in GPT and Llama models

A3.3: What is WordPiece tokenization and how does it differ from BPE?

Definition: WordPiece is a subword tokenization algorithm developed by Google, used in BERT.

How it differs from BPE:

Selection Criterion: Uses probabilistic approach (maximizes likelihood) rather than simple frequency
Merge Strategy: Selects merges that maximize training data likelihood
Splitting Behavior: Splits rare words more conservatively
Linguistic Quality: Often produces more linguistically meaningful subwords
Lossiness: Lossy method (doesn't preserve spaces), unlike BPE
Performance: May offer better generalization for morphologically rich languages

A3.4: What is SentencePiece and when is it particularly useful?

Definition: SentencePiece is a language-agnostic tokenization method that works directly on raw text as a continuous stream of characters.

Key Features:

No pre-tokenization required
Treats text as continuous character stream (including spaces)
Can implement BPE-like merging or Unigram Language Model
Uses special markers to denote word boundaries
Partially lossless (retains single space for multiple consecutive spaces)

Particularly Useful For:

Languages without explicit word boundaries (Chinese, Japanese, Korean)
Multilingual models
Situations requiring language-agnostic processing
Used in T5 and ALBERT models

A3.5: What are the challenges of tokenization for non-English languages?

Key Challenges:

Excessive Fragmentation: Tokenizers trained on English-centric corpora may excessively fragment non-Roman alphabetic text
Reduced Efficiency: Over-tokenization leads to:
- Longer token sequences
- Slower generation speed
- Increased computational requirements
Performance Degradation: Can cause LLMs to generate incorrect or nonsensical responses
Vocabulary Imbalance: English tokens may be over-represented in vocabulary
Morphological Complexity: Some languages have richer morphology requiring different tokenization strategies

2024 Solutions:

Tailored vocabulary sets for specific target languages
Fine-tuning to reduce token fragmentation
Language-specific tokenization frameworks

4. Training & Fine-Tuning

A4.1: What is fine-tuning and why is it important for LLMs?

Definition: Fine-tuning is the process of taking a pre-trained LLM and further training it on a smaller, specialized dataset to adapt it to a specific task or domain.

Importance:

Enhances performance for particular applications
Increases accuracy and relevance for specialized tasks
Adapts general knowledge to organizational needs
More efficient than training from scratch
Enables customization while leveraging pre-trained knowledge
Essential for real-world enterprise applications

Market Impact: LLM Fine-Tuning Services reached USD 1.42 billion globally in 2024, reflecting robust adoption across industries.

A4.2: What is Parameter-Efficient Fine-Tuning (PEFT)?

Definition: PEFT refers to techniques that fine-tune LLMs efficiently by optimizing only a small subset of parameters or adding lightweight auxiliary components, rather than training the entire model.

Key Benefits:

Significantly reduces computational requirements
Lowers memory usage
More cost-effective than full fine-tuning
Preserves general knowledge from pre-training
Reduces risk of overfitting
Makes fine-tuning accessible on smaller hardware

Common PEFT Methods:

LoRA (Low-Rank Adaptation)
QLoRA (Quantized LoRA)
Adapter layers
Prefix tuning

A4.3: What is LoRA (Low-Rank Adaptation) and what are its benefits?

Definition: LoRA is a PEFT technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.

How it works:

Represents weight changes by multiplying two smaller, low-rank matrices
Only these low-rank matrices are trained
Original model weights remain frozen

Benefits:

Efficiency: Reduces trainable parameters by up to 10,000x
Memory: Reduces GPU memory requirements by 3x
Performance: Achieves comparable or better results than full fine-tuning
Accessibility: Enables fine-tuning on consumer-grade GPUs
Modularity: Multiple LoRA adapters can be swapped for different tasks

A4.4: What is the difference between full fine-tuning and PEFT methods?

Full Fine-Tuning:

Updates ALL model parameters
Requires significant computational resources
High memory requirements
Longer training time
Risk of catastrophic forgetting
May overfit on small datasets
Produces a complete new model

PEFT Methods:

Updates only a small subset of parameters or adds lightweight modules
Dramatically reduced computational requirements
Lower memory footprint (up to 3x reduction)
Faster training
Preserves pre-trained knowledge better
Less prone to overfitting
Produces small adapter modules (can have multiple for different tasks)
More accessible for resource-constrained environments

A4.5: What is Reinforcement Learning from Human Feedback (RLHF)?

Definition: RLHF is a technique that integrates human evaluation directly into the model's learning process to align LLM outputs with human values and preferences.

Process:

Pre-train base LLM
Collect human feedback on model outputs
Train a reward model based on human preferences
Use reinforcement learning (typically PPO) to optimize the LLM based on the reward model

Benefits:

Aligns models with human values
Improves helpfulness, harmlessness, and honesty
Enhances contextual understanding
Reduces bias and increases safety
Better task completion and user satisfaction

Examples: Used in InstructGPT, GPT-4, Claude 3, and other advanced LLMs.

A4.6: What are the main challenges in implementing RLHF?

Key Challenges:

Data Quality Issues:
- Subjective human preferences lead to inconsistencies
- Annotator fatigue affects quality
- Expensive and time-consuming to collect
Scalability:
- Resource-intensive process
- Requires large teams of human annotators
- High computational costs
Reward Model Problems:
- Instability and inaccuracies
- Reward hacking (exploiting flaws rather than true alignment)
- Difficulty capturing nuanced preferences
Training Instability:
- PPO algorithms challenging to tune
- Prone to instability during training
Alignment Tax:
- Optimizing for specific preferences may degrade general capabilities
- Performance trade-offs on broader benchmarks
Reduced Output Diversity:
- RLHF can decrease response variety compared to supervised fine-tuning

A4.7: What is Direct Preference Optimization (DPO)?

Definition: DPO is an alternative to RLHF that simplifies the alignment process by directly fine-tuning the policy on preference datasets, bypassing the separate reward model.

How it differs from RLHF:

No Reward Model: Eliminates the need for training a separate reward model
Direct Optimization: Directly optimizes on preference pairs
Simpler Pipeline: Fewer training stages
More Stable: Avoids reward model instability issues
More Efficient: Reduced computational requirements

Benefits:

Simpler implementation
More stable training
Comparable or better performance
Lower resource requirements

5. Prompting Techniques

A5.1: What is zero-shot learning in the context of LLMs?

Definition: Zero-shot learning allows an LLM to perform a task it has never explicitly seen before without any task-specific training or examples.

Characteristics:

Relies purely on pre-trained knowledge
No examples provided in the prompt
Model interprets instructions directly
Leverages general language understanding

Best Used For:

Simple, straightforward tasks
Exploratory queries
When clear, direct instructions can be provided
Tasks within the model's pre-training knowledge
Situations where examples aren't available

Example: "Translate this to French: Hello, how are you?"

A5.2: What is few-shot learning and when should it be used?

Definition: Few-shot learning involves providing the LLM with a small number of examples (typically 1-5 "shots") within the prompt to guide its understanding and performance on a new task.

Characteristics:

Minimal in-context learning
Model infers task structure from examples
Generalizes from provided demonstrations
No parameter updates required

When to Use:

Model needs to learn a new concept
Precise output format required
Complex tasks needing demonstration
Data is very limited
Zero-shot performance is insufficient
Task requires specific style or structure

Example:

Sentiment: "I love this!" → Positive
Sentiment: "This is terrible." → Negative
Sentiment: "It's okay." → ?

A5.3: What is Chain-of-Thought (CoT) prompting?

Definition: CoT prompting is a technique that encourages the model to reason step-by-step, breaking down complex problems into intermediate reasoning steps.

How it works:

Prompt includes explicit reasoning steps
Model generates intermediate thoughts before final answer
Can be used in both few-shot and zero-shot settings

Zero-Shot CoT: Simply append "Let's think step by step" to the prompt

Benefits:

Significantly improves complex problem-solving
Enhances logical reasoning
Makes model's thought process more transparent
Better performance on mathematical and logical tasks
Reduces errors in multi-step problems

A5.4: What is the difference between zero-shot and few-shot prompting?

Zero-Shot Prompting:

No examples provided
Direct instruction only
Relies on pre-trained knowledge
Faster to implement
Works well for simple, clear tasks
More flexible and exploratory

Few-Shot Prompting:

Includes 1-5 demonstration examples
Shows desired input-output pattern
Helps model understand task structure
Better for complex or specific tasks
Requires more careful prompt design
Generally higher performance on specialized tasks

When to Choose:

Zero-shot: Simple tasks, exploration, no examples available
Few-shot: Complex tasks, specific formats, when examples can be provided

A5.5: What are best practices for prompt engineering in 2024?

Best Practices:

Clear Instructions:
- Be specific and detailed
- Separate instructions from context using delimiters
- Define desired output format
Provide Context:
- Include relevant background information
- Use reference texts for factual accuracy
- Specify the role or persona if helpful
Break Down Complex Tasks:
- Split into simpler subtasks
- Reduces error rates
- Makes process more manageable
Use Advanced Techniques:
- Chain-of-Thought for reasoning tasks
- Few-shot examples for complex patterns
- Contrastive examples (positive and negative)
Iterate and Test:
- Test different phrasings
- Evaluate outputs systematically
- Refine based on results
Specify Constraints:
- Define what to avoid
- Set length limits if needed
- Specify tone and style

6. Generation Parameters

A6.1: What is the temperature parameter in LLM text generation?

Definition: Temperature is a parameter that controls the randomness of token selection during text generation by adjusting the probability distribution of possible next tokens.

How it works:

Applied to the logits (raw prediction scores) before sampling
Lower values sharpen the distribution (more deterministic)
Higher values flatten the distribution (more random)

Temperature Ranges:

0.0: Deterministic (always selects highest probability token)
0.0-0.5: Low randomness, focused, predictable
0.6-1.0: Balanced creativity and coherence
1.0: Default in many models
1.0-2.0: High randomness, creative, exploratory

Use Cases:

Low (0-0.5): Factual tasks, technical writing, summarization
High (0.7-2.0): Creative writing, brainstorming, poetry

A6.2: What is top-p (nucleus sampling)?

Definition: Top-p (nucleus sampling) is a parameter that controls randomness by selecting from a dynamic subset of tokens whose cumulative probability exceeds a threshold p.

How it works:

Tokens are sorted by probability
Cumulative probability is calculated
Only tokens whose cumulative probability ≤ p are considered
Selection is made from this dynamic "nucleus"

Top-p Values:

Lower (closer to 0): Smaller nucleus, more focused, predictable
0.1: Only top 10% probability mass considered
0.9-0.95: Common values, balanced approach
Higher (closer to 1): Larger nucleus, more diverse, creative

Adaptive Nature: The number of tokens in the nucleus varies based on the probability distribution at each step.

A6.3: How do temperature and top-p interact, and should they be adjusted together?

Interaction:

Both control randomness/creativity but in different ways
Temperature acts as a "global thermostat" affecting overall distribution
Top-p dynamically adjusts the pool of candidate tokens
Their effects can compound and interact in complex ways

Best Practice: Experts generally advise modifying either temperature OR top-p, but NOT both simultaneously.

Reasoning:

Overlapping effects make outcomes unpredictable
Difficult to understand which parameter is causing specific behaviors
Simpler to tune one parameter at a time
More predictable and controllable results

Recommended Approach:

Choose one parameter to adjust based on your use case
Keep the other at its default value
Test systematically

A6.4: When should you use a low temperature setting?

Use Low Temperature (0.0-0.5) When:

Factual Accuracy is Critical:
- Technical documentation
- Scientific writing
- Medical or legal content
Consistency is Required:
- Standardized responses
- Reproducible outputs
- Automated systems
Structured Tasks:
- Data extraction
- Classification
- Summarization of factual content
- Code generation for well-defined problems
Precision Over Creativity:
- Question answering
- Information retrieval
- Translation of technical content

Effect: More predictable, focused, and coherent outputs with less variation.

A6.5: When should you use a high temperature setting?

Use High Temperature (0.7-2.0) When:

Creativity is Desired:
- Creative writing and storytelling
- Poetry generation
- Marketing copy with unique angles
Exploration is Needed:
- Brainstorming sessions
- Generating diverse ideas
- Exploring multiple perspectives
Variety is Important:
- Generating multiple different responses
- Avoiding repetitive outputs
- Creating varied content
Imaginative Tasks:
- Character dialogue
- Fictional scenarios
- Artistic descriptions

Caution: Very high temperatures (>1.5) can lead to incoherent or nonsensical outputs.

7. Embeddings & Vector Databases

A7.1: What are embeddings in the context of AI and LLMs?

Definition: Embeddings are high-dimensional numerical representations (vectors) of data such as text, images, or audio, generated by machine learning models.

Key Characteristics:

Capture semantic meaning and relationships
Similar items are located close together in vector space
Enable mathematical operations on semantic concepts
Typically hundreds to thousands of dimensions

Example Relationship:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

Applications:

Semantic search
Similarity detection
Clustering and classification
Recommendation systems
RAG (Retrieval-Augmented Generation)

A7.2: What is a vector database and how does it differ from traditional databases?

Definition: Vector databases are specialized data storage systems designed to efficiently store, index, and retrieve high-dimensional vectors (embeddings).

Key Differences from Traditional Databases:

Traditional Databases:

Store structured data (rows, columns)
Exact match queries
SQL-based querying
Optimized for transactional operations

Vector Databases:

Store high-dimensional vectors
Similarity-based searches
Specialized indexing for vector operations
Optimized for nearest neighbor searches
Use distance metrics (cosine similarity, Euclidean distance)

Key Features:

Optimized vector storage
Efficient similarity search algorithms
Scalability for billions of vectors
Support for real-time queries

Market Growth: Projected to reach USD 7.34 billion by 2030 from USD 1.66 billion in 2023 (CAGR 23.7%).

A7.3: What is semantic search?

Definition: Semantic search is a search technique that understands the intent and contextual meaning behind a user's query, rather than just matching keywords.

How it works:

Query is converted to an embedding vector
Vector database searches for similar embeddings
Results are ranked by semantic similarity
Returns contextually relevant results

Advantages over Keyword Search:

Understands synonyms and related concepts
Captures user intent
Handles natural language queries
More accurate and relevant results
Doesn't require exact keyword matches

Example:

Query: "warm clothing"
Results: sweaters, coats, jackets (even without those exact words)

A7.4: How do embeddings enable semantic search?

Process:

Indexing Phase:
- Documents are converted to embeddings using an embedding model
- Embeddings are stored in a vector database
- Each document has a corresponding vector representation
Query Phase:
- User query is converted to an embedding using the same model
- Query embedding is compared to document embeddings
- Similarity is calculated using distance metrics
- Most similar documents are retrieved

Key Enabler:

Embeddings capture semantic meaning in numerical form
Similar meanings = similar vectors
Vector similarity = semantic similarity
Enables mathematical comparison of meaning

Why It Works:

Embeddings learned from vast datasets understand language relationships
Contextual information is encoded in the vector space
Similarity in vector space correlates with semantic similarity

A7.5: What are common similarity metrics used in vector databases?

1. Cosine Similarity:

Measures the cosine of the angle between two vectors
Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)
Best for: Text embeddings, when magnitude doesn't matter
Most common in NLP applications

2. Euclidean Distance (L2):

Measures straight-line distance between vectors
Range: 0 to ∞ (0 = identical, larger = more different)
Best for: When absolute distances matter
Sensitive to vector magnitude

3. Dot Product:

Measures the product of vector magnitudes and cosine of angle
Combines magnitude and direction
Best for: When both magnitude and direction are important
Computationally efficient

4. Manhattan Distance (L1):

Sum of absolute differences along each dimension
Best for: High-dimensional spaces, certain optimization problems

Selection Criteria:

Depends on embedding model and use case
Cosine similarity most common for text
Should match the metric the embedding model was optimized for

8. Challenges & Solutions

A8.1: What are hallucinations in LLMs?

Definition: Hallucinations are instances where LLMs generate inaccurate, fabricated, or nonsensical information that appears convincing but is not grounded in reality or the provided context.

Types:

Factual Hallucinations: Incorrect facts or data
Contextual Hallucinations: Information not supported by the input
Confabulation: Making up plausible-sounding but false details

Why They Occur:

Models predict based on patterns, not true understanding
Training data may contain inaccuracies
No inherent fact-checking mechanism
Pressure to generate fluent, coherent text
Lack of access to real-time or verified information

Impact:

Critical concern for high-stakes applications (medicine, law, finance)
Undermines trust and reliability
Can spread misinformation

A8.2: What is Retrieval-Augmented Generation (RAG)?

Definition: RAG is a technique that combines the generative capabilities of LLMs with information retrieval from external, reliable knowledge bases.

How it works:

User submits a query
System retrieves relevant information from external sources (vector database, documents)
Retrieved information is provided as context to the LLM
LLM generates response grounded in the retrieved information

Components:

Retriever: Finds relevant information (often using semantic search)
Generator: LLM that produces the final response
Knowledge Base: External source of verified information

Benefits:

Grounds responses in factual, verified data
Significantly reduces hallucinations
Enables access to up-to-date information
Allows citation of sources
More controllable and trustworthy outputs

A8.3: How does RAG help reduce hallucinations?

Mechanisms:

Factual Grounding:
- Provides verified information as context
- LLM generates based on real data, not just patterns
- Reduces reliance on potentially faulty training data
External Knowledge Access:
- Accesses up-to-date information beyond training cutoff
- Retrieves domain-specific knowledge
- Can include proprietary or specialized data
Source Attribution:
- Enables citing sources
- Makes verification possible
- Increases accountability
Contextual Relevance:
- Retrieves specifically relevant information
- Reduces need for model to "guess"
- Provides concrete facts to work with

Effectiveness:

Stanford 2024 study: 96% reduction in hallucinations when RAG combined with other techniques
Particularly effective for factual, knowledge-intensive tasks

A8.4: What are other techniques to mitigate LLM hallucinations?

Key Mitigation Techniques (2024):

Fine-Tuning on High-Quality Data:
- Use carefully curated, domain-specific datasets
- Minimize exposure to biased or inaccurate information
- Improves factual accuracy
RLHF (Reinforcement Learning from Human Feedback):
- Human evaluation prioritizes factual accuracy
- Refines model behavior toward truthfulness
- Aligns outputs with human values
Advanced Prompting:
- Chain-of-Thought for logical reasoning
- Clear, specific instructions
- Provide reference texts
Post-Processing and Filtering:
- Cross-reference against verified databases
- Rule-based systems to catch errors
- Rank outputs by factual consistency
Human-in-the-Loop (HITL):
- Expert review of outputs
- Real-time feedback mechanisms
- Critical for high-stakes applications
Factual Consistency Scoring:
- Algorithmic detection of hallucinations
- Specialized models for validation
- Automated fact-checking
Hybrid Approaches:
- Combine multiple techniques (RAG + RLHF + guardrails)
- Multi-layered defense
- Best results in 2024 research

A8.5: What is the "alignment tax" in RLHF?

Definition: The alignment tax refers to the potential degradation of a model's general capabilities or performance on broader NLP benchmarks when it is optimized for specific human preferences through RLHF.

How it occurs:

RLHF optimizes for specific reward signals
May inadvertently reduce performance on other tasks
Trade-off between specialized alignment and general capability
Model becomes more focused but potentially less versatile

Examples:

Model aligned for safety may become overly cautious
Optimization for specific style may reduce creativity
Focus on particular domains may weaken others

Implications:

Need to balance alignment with general capability
Careful reward model design is critical
Multi-objective optimization may help
Important consideration in RLHF implementation

A8.6: What is reward hacking in reinforcement learning?

Definition: Reward hacking occurs when a model exploits flaws or loopholes in the reward system to maximize its reward score without actually achieving the intended behavior or alignment.

How it happens:

Reward model has imperfections or blind spots
Model finds unintended ways to score highly
Optimizes for the metric rather than the goal
Similar to "teaching to the test" in education

Examples:

Generating verbose responses to appear more helpful
Using specific phrases that score well without being genuinely useful
Exploiting ambiguities in reward criteria
Gaming the evaluation metric

Consequences:

Model appears aligned but isn't truly helpful
Undermines the purpose of RLHF
Can lead to unexpected behaviors
Reduces model reliability

Mitigation:

Robust reward model design
Multiple reward models (multi-objective)
Bayesian Reward Model Ensembles (BRME)
Regular auditing and testing
Human oversight

9. Advanced Concepts

A9.1: What is the context window in LLMs?

Definition: The context window is the maximum number of tokens an LLM can process in a single interaction, including both input (prompt) and output (generated text).

Key Aspects:

Measured in tokens (not words or characters)
Determines how much information the model can "remember" at once
Includes prompt + conversation history + generated response

Typical Sizes (2024):

Early models: 2,048-4,096 tokens
Modern models: 8,192-32,768 tokens
Extended context models: 100,000+ tokens (some up to 1-2 million)

Limitations:

Information beyond the window is not accessible
Longer contexts require more computational resources
May impact response quality at extreme lengths

Importance:

Determines ability to process long documents
Affects conversation memory
Critical for complex, multi-turn interactions

A9.2: What are the challenges of processing long sequences in Transformers?

Key Challenges:

Computational Complexity:
- Self-attention has O(n²) complexity with sequence length
- Quadratic growth in computation and memory
- Becomes prohibitive for very long sequences
Memory Requirements:
- Attention matrices grow quadratically
- GPU memory limitations
- Expensive to store and compute
Training Difficulty:
- Longer sequences require more resources
- Slower training times
- Gradient flow issues

2024 Solutions:

Sparse Attention:
- Only attend to subset of tokens
- Reduces complexity while maintaining performance
Clustered Attention:
- Group similar tokens
- Attend within clusters
Memory-Augmented Attention:
- External memory mechanisms
- Compress historical information
Hybrid Architectures:
- Combine Transformers with State Space Models (e.g., Jamba)
- Convolutional approaches (e.g., Hyena)

Impact:

Order-of-magnitude improvements in speed and memory
Enable processing entire books vs. paragraphs
Critical for extended context LLMs

A9.3: What is Reinforcement Learning from AI Feedback (RLAIF)?

Definition: RLAIF is an alternative to RLHF that leverages AI-generated feedback instead of human annotations to train the reward model and align LLMs.

How it works:

Use an AI system (often another LLM) to evaluate outputs
AI generates preference rankings or feedback
Train reward model on AI feedback
Use RL to optimize the model

Advantages over RLHF:

Scalability: No need for large teams of human annotators
Cost-Effective: Reduces expensive human annotation
Consistency: AI feedback can be more consistent
Speed: Faster feedback generation
Accessibility: Makes alignment more accessible

Challenges:

AI feedback quality depends on the evaluator model
May inherit biases from the AI evaluator
Less diverse perspectives than human feedback
Validation of AI judgments needed

Use Cases:

Supplementing human feedback
Scaling alignment to more tasks
Iterative improvement cycles

A9.4: What is the difference between supervised fine-tuning and RLHF?

Supervised Fine-Tuning (SFT):

Training Signal: Direct input-output pairs
Objective: Minimize prediction error on labeled examples
Process: Standard supervised learning
Data: Requires high-quality labeled examples
Output: Model learns to imitate training examples
Diversity: Generally maintains higher output diversity
Complexity: Simpler to implement

RLHF (Reinforcement Learning from Human Feedback):

Training Signal: Human preference rankings
Objective: Maximize reward based on human preferences
Process: Multi-stage (SFT → reward model → RL optimization)
Data: Requires preference comparisons
Output: Model learns what humans prefer
Diversity: May reduce output diversity
Complexity: More complex, requires RL algorithms

Relationship:

RLHF typically starts with SFT as a base
SFT provides basic capability, RLHF aligns preferences
SFT alone may not capture nuanced human preferences
RLHF better for alignment, SFT better for capability

A9.5: What is QLoRA?

Definition: QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA that further optimizes fine-tuning by quantizing the low-rank adaptation matrices to lower precision (typically 4-bit or 8-bit).

How it works:

Combines LoRA's low-rank adaptation with quantization
Stores base model in quantized format (4-bit)
Trains low-rank adapters in higher precision
Dramatically reduces memory requirements

Benefits:

Extreme Memory Efficiency: Additional memory savings beyond LoRA
Accessibility: Enables fine-tuning of larger models on consumer GPUs
Performance: Maintains comparable performance to full precision
Cost-Effective: Reduces hardware requirements significantly

Example Impact:

Can fine-tune 65B parameter models on a single 48GB GPU
Makes large model fine-tuning accessible to researchers and smaller organizations

Trade-offs:

Slight potential performance degradation from quantization
More complex implementation
Requires careful calibration

10. Practical Applications

A10.1: What is the text-to-text framework in T5?

Definition: The text-to-text framework is T5's approach where every NLP task is reformulated as a text generation problem, with both inputs and outputs being text strings.

How it works:

Input: Task description + input data (all as text)
Output: Always text, regardless of task type
Unified Format: Same model architecture handles all tasks

Examples:

Translation:

Input: "translate English to German: That is good."
Output: "Das ist gut."

Classification:

Input: "sentiment: This movie is terrible!"
Output: "negative"

Summarization:

Input: "summarize: [long article text]"
Output: "[summary text]"

Question Answering:

Input: "question: What is the capital? context: Paris is the capital of France."
Output: "Paris"

Benefits:

Simplifies model architecture across tasks
Enables transfer learning between different task types
Consistent training and inference pipeline
Highly versatile and flexible

A10.2: What are Small Language Models (SLMs) and why are they gaining prominence?

Definition: Small Language Models (SLMs) are LLMs with fewer parameters (typically under 10B) designed to be more efficient while maintaining strong performance for specific tasks.

Characteristics:

Fewer parameters than large LLMs
More efficient inference
Lower computational requirements
Often specialized for specific domains
Can run on edge devices or consumer hardware

Why They're Gaining Prominence (2024):

Efficiency:
- Lower computational costs
- Faster inference
- Reduced energy consumption
Accessibility:
- Can run on smaller hardware
- More affordable to deploy
- Democratizes AI access
Performance:
- Often superior for specific, well-defined tasks
- Moving away from "bigger is better" mentality
- Specialized models outperform general ones in their domain
Practical Deployment:
- Easier to deploy in production
- Lower latency
- Better for real-time applications
Privacy:
- Can run on-device
- Reduces data transmission
- Better for sensitive applications

Trend: 2024 saw increased focus on SLMs as organizations recognize that task-specific smaller models often provide better ROI than general-purpose large models.

A10.3: What is hybrid search in the context of vector databases?

Definition: Hybrid search combines traditional keyword-based search with vector-based semantic search to provide more comprehensive and relevant results.

How it works:

Keyword Search: Performs traditional exact/fuzzy matching
Vector Search: Performs semantic similarity search
Combination: Merges and ranks results from both approaches
Scoring: Often uses weighted combination or reciprocal rank fusion

Benefits:

Leverages Strengths of Both:

Keyword Search: Precise matches, specific terms, proper nouns
Vector Search: Semantic understanding, synonyms, context

Improved Relevance:

Catches results that either method alone might miss
More robust to different query types
Better overall search quality

Flexibility:

Can adjust weights based on use case
Adaptable to different domains
Handles both specific and conceptual queries

Use Cases:

Enterprise search systems
E-commerce product search
Document retrieval
Knowledge bases

A10.4: What is the role of human-in-the-loop (HITL) in LLM applications?

Definition: Human-in-the-Loop (HITL) refers to integrating human oversight and intervention into the LLM workflow to ensure quality, accuracy, and safety.

Key Roles:

Quality Assurance:
- Review and validate LLM outputs
- Identify errors and hallucinations
- Ensure factual accuracy
Safety and Compliance:
- Catch harmful or inappropriate content
- Ensure regulatory compliance
- Maintain ethical standards
Continuous Improvement:
- Provide feedback for model refinement
- Flag edge cases and failures
- Guide model updates
Domain Expertise:
- Apply specialized knowledge
- Validate technical or professional content
- Ensure context-appropriate responses

Implementation Approaches:

Pre-deployment: Human review before outputs reach users
Post-deployment: User feedback mechanisms
Sampling: Review subset of outputs for quality monitoring
Escalation: Automatic flagging of uncertain outputs for human review

Critical For:

High-stakes applications (healthcare, legal, finance)
Regulated industries
Customer-facing applications
Safety-critical systems

A10.5: What are the typical applications of encoder-only models like BERT?

Typical Applications:

Text Classification:
- Sentiment analysis
- Topic categorization
- Intent detection
- Spam detection
Named Entity Recognition (NER):
- Identifying people, places, organizations
- Extracting structured information
- Information extraction
Question Answering (Extractive):
- Finding answers within provided text
- Reading comprehension
- Document-based Q&A
Semantic Similarity:
- Duplicate detection
- Paraphrase identification
- Document similarity
Token Classification:
- Part-of-speech tagging
- Chunking
- Syntax analysis
Sentence Pair Tasks:
- Natural language inference
- Textual entailment
- Semantic textual similarity

Why BERT Excels:

Bidirectional context understanding
Deep comprehension of text meaning
Strong performance on understanding tasks
Pre-trained on large corpora

Not Suitable For:

Text generation
Creative writing
Open-ended content creation

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Monday, November 17, 2025