Check your AI expertise using the following catalog. At the end you‘ll find the answers. Have fun!
AI & LLM Knowledge Catalog for Software Engineers
Question Bank with Validated
QUESTIONS
1. Foundational Concepts
Q1.1: What is the Transformer architecture and why is it significant for modern AI?
Q1.2: What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?
Q1.3: What is self-attention and how does it work in Transformers?
Q1.4: What is multi-head attention and why is it important?
Q1.5: What is transfer learning in the context of LLMs?
2. Model Architectures
Q2.1: What are the key differences between BERT, GPT, and T5 models?
Q2.2: What type of architecture does BERT use and what are its primary use cases?
Q2.3: What type of architecture does GPT use and what are its primary use cases?
Q2.4: What makes T5 unique among language models?
Q2.5: What is the difference between bidirectional and causal attention?
3. Tokenization
Q3.1: What is tokenization and why is it important for LLMs?
Q3.2: What is Byte Pair Encoding (BPE) and how does it work?
Q3.3: What is WordPiece tokenization and how does it differ from BPE?
Q3.4: What is SentencePiece and when is it particularly useful?
Q3.5: What are the challenges of tokenization for non-English languages?
4. Training & Fine-Tuning
Q4.1: What is fine-tuning and why is it important for LLMs?
Q4.2: What is Parameter-Efficient Fine-Tuning (PEFT)?
Q4.3: What is LoRA (Low-Rank Adaptation) and what are its benefits?
Q4.4: What is the difference between full fine-tuning and PEFT methods?
Q4.5: What is Reinforcement Learning from Human Feedback (RLHF)?
Q4.6: What are the main challenges in implementing RLHF?
Q4.7: What is Direct Preference Optimization (DPO)?
5. Prompting Techniques
Q5.1: What is zero-shot learning in the context of LLMs?
Q5.2: What is few-shot learning and when should it be used?
Q5.3: What is Chain-of-Thought (CoT) prompting?
Q5.4: What is the difference between zero-shot and few-shot prompting?
Q5.5: What are best practices for prompt engineering in 2024?
6. Generation Parameters
Q6.1: What is the temperature parameter in LLM text generation?
Q6.2: What is top-p (nucleus sampling)?
Q6.3: How do temperature and top-p interact, and should they be adjusted together?
Q6.4: When should you use a low temperature setting?
Q6.5: When should you use a high temperature setting?
7. Embeddings & Vector Databases
Q7.1: What are embeddings in the context of AI and LLMs?
Q7.2: What is a vector database and how does it differ from traditional databases?
Q7.3: What is semantic search?
Q7.4: How do embeddings enable semantic search?
Q7.5: What are common similarity metrics used in vector databases?
8. Challenges & Solutions
Q8.1: What are hallucinations in LLMs?
Q8.2: What is Retrieval-Augmented Generation (RAG)?
Q8.3: How does RAG help reduce hallucinations?
Q8.4: What are other techniques to mitigate LLM hallucinations?
Q8.5: What is the "alignment tax" in RLHF?
Q8.6: What is reward hacking in reinforcement learning?
9. Advanced Concepts
Q9.1: What is the context window in LLMs?
Q9.2: What are the challenges of processing long sequences in Transformers?
Q9.3: What is Reinforcement Learning from AI Feedback (RLAIF)?
Q9.4: What is the difference between supervised fine-tuning and RLHF?
Q9.5: What is QLoRA?
10. Practical Applications
Q10.1: What is the text-to-text framework in T5?
Q10.2: What are Small Language Models (SLMs) and why are they gaining prominence?
Q10.3: What is hybrid search in the context of vector databases?
Q10.4: What is the role of human-in-the-loop (HITL) in LLM applications?
Q10.5: What are the typical applications of encoder-only models like BERT?
ANSWERS
1. Foundational Concepts
A1.1: What is the Transformer architecture and why is it significant for modern AI?
The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is a neural network architecture that revolutionized sequence processing in AI. It consists of encoder and decoder components that use self-attention mechanisms to process input sequences in parallel, rather than sequentially like RNNs.
Significance:
- Enables parallel computation, making training more efficient
- Effectively captures long-range dependencies in sequences
- Forms the foundation for modern LLMs like GPT, BERT, and T5
- Expanded beyond NLP to computer vision, speech recognition, and multimodal AI
- Powers models like Stable Diffusion 3 and Sora (both released in 2024)
A1.2: What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?
Encoder-only (e.g., BERT):
- Processes entire input simultaneously with bidirectional attention
- Each token can attend to all other tokens
- Best for understanding tasks (classification, NER, sentiment analysis)
Decoder-only (e.g., GPT):
- Uses causal/masked attention (tokens only attend to previous tokens)
- Autoregressive generation (predicts next token sequentially)
- Best for generation tasks (text completion, creative writing, chatbots)
Encoder-decoder (e.g., T5):
- Combines both components
- Encoder uses bidirectional attention, decoder uses causal attention
- Decoder also uses cross-attention to incorporate encoder output
- Versatile for both understanding and generation (translation, summarization)
A1.3: What is self-attention and how does it work in Transformers?
Self-attention is a mechanism that allows each token in a sequence to weigh the importance of all other tokens when creating its representation.
How it works:
- For each token, the model computes three vectors: Query (Q), Key (K), and Value (V)
- Attention scores are calculated by comparing each token's Query with all Keys
- Scores are normalized (usually with softmax)
- The output is a weighted sum of Values based on these scores
- This creates rich contextual representations where each token's meaning is influenced by its relationship to all other tokens
A1.4: What is multi-head attention and why is it important?
Multi-head attention is a component of the self-attention mechanism that runs multiple attention operations in parallel, each with different learned projections.
Importance:
- Allows the model to focus on different aspects of the input simultaneously
- Each "head" can learn different types of relationships (syntactic, semantic, positional)
- Provides multiple representational viewpoints
- Enhances the model's ability to capture diverse patterns in data
- Improves overall model expressiveness and performance
A1.5: What is transfer learning in the context of LLMs?
Transfer learning is the practice of leveraging knowledge gained from one task or domain to improve learning or performance in another related or different area.
In LLMs:
- Pre-trained models learn general language patterns from vast datasets
- This broad understanding serves as a starting point for new tasks
- Enables quicker and more effective learning in specialized domains
- Reduces the need for large task-specific datasets
- Allows models to apply their inherent knowledge to meet unique application requirements
2. Model Architectures
A2.1: What are the key differences between BERT, GPT, and T5 models?
BERT (Bidirectional Encoder Representations from Transformers):
- Architecture: Encoder-only
- Attention: Bidirectional (fully visible)
- Pre-training: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
- Strength: Understanding and comprehension tasks
GPT (Generative Pre-trained Transformer):
- Architecture: Decoder-only
- Attention: Causal/masked (unidirectional)
- Pre-training: Next token prediction (language modeling)
- Strength: Text generation tasks
T5 (Text-to-Text Transfer Transformer):
- Architecture: Full encoder-decoder
- Attention: Bidirectional in encoder, causal in decoder, plus cross-attention
- Pre-training: Denoising objective on C4 corpus
- Strength: Unified text-to-text framework for diverse tasks
A2.2: What type of architecture does BERT use and what are its primary use cases?
Architecture: Encoder-only (stack of Transformer encoder layers)
Primary Use Cases:
- Sentiment analysis
- Text classification
- Named Entity Recognition (NER)
- Question answering (extractive)
- Semantic similarity
- Intent detection
- Any task requiring deep text comprehension
Key Characteristic: Designed for understanding language rather than generating it, using bidirectional context to capture meaning.
A2.3: What type of architecture does GPT use and what are its primary use cases?
Architecture: Decoder-only (stack of Transformer decoder layers with masked self-attention)
Primary Use Cases:
- Text generation and completion
- Creative writing and storytelling
- Conversational AI and chatbots
- Code generation
- Summarization
- Translation
- Content creation
Key Characteristic: Autoregressive generation where each token is predicted based only on previous tokens, making it ideal for sequential text generation.
A2.4: What makes T5 unique among language models?
Unique Features:
Text-to-Text Framework: Every NLP task is reframed as text generation
- Input: Text prompt describing the task + data
- Output: Always text, regardless of task type
Unified Approach: Translation, summarization, classification, QA all use the same format
Full Encoder-Decoder: Leverages both components for versatility
Pre-training: Denoising objective on massive C4 corpus
Benefits:
- Simplifies model architecture across diverse tasks
- Enables transfer learning across task types
- Highly versatile and flexible
A2.5: What is the difference between bidirectional and causal attention?
Bidirectional Attention (Fully Visible):
- Each token can attend to ALL tokens in the sequence (past and future)
- Used in encoder architectures (e.g., BERT)
- Provides complete context for understanding
- Best for comprehension tasks
Causal Attention (Masked/Unidirectional):
- Each token can only attend to PREVIOUS tokens (and itself)
- Used in decoder architectures (e.g., GPT)
- Prevents "looking ahead" during generation
- Essential for autoregressive text generation
- Ensures the model predicts based only on available context
3. Tokenization
A3.1: What is tokenization and why is it important for LLMs?
Definition: Tokenization is the process of converting raw text into numerical tokens that LLMs can process and understand.
Importance:
- Bridges the gap between human language and machine processing
- Directly impacts model performance and efficiency
- Determines vocabulary size and model capacity
- Affects how well models handle rare words and out-of-vocabulary terms
- Influences computational requirements and memory usage
- Critical for multilingual models and diverse language support
A3.2: What is Byte Pair Encoding (BPE) and how does it work?
Definition: BPE is a subword tokenization algorithm that builds vocabulary through iterative merging.
How it works:
- Start with each character as an initial token
- Identify the most frequent pair of characters/subwords in training data
- Merge this pair into a new subword unit
- Repeat iteratively until desired vocabulary size is reached
Characteristics:
- Frequency-based approach
- Robust and general-purpose
- Handles out-of-vocabulary words by breaking them into subwords
- Fully lossless (preserves all spaces)
- Used in GPT and Llama models
A3.3: What is WordPiece tokenization and how does it differ from BPE?
Definition: WordPiece is a subword tokenization algorithm developed by Google, used in BERT.
How it differs from BPE:
- Selection Criterion: Uses probabilistic approach (maximizes likelihood) rather than simple frequency
- Merge Strategy: Selects merges that maximize training data likelihood
- Splitting Behavior: Splits rare words more conservatively
- Linguistic Quality: Often produces more linguistically meaningful subwords
- Lossiness: Lossy method (doesn't preserve spaces), unlike BPE
- Performance: May offer better generalization for morphologically rich languages
A3.4: What is SentencePiece and when is it particularly useful?
Definition: SentencePiece is a language-agnostic tokenization method that works directly on raw text as a continuous stream of characters.
Key Features:
- No pre-tokenization required
- Treats text as continuous character stream (including spaces)
- Can implement BPE-like merging or Unigram Language Model
- Uses special markers to denote word boundaries
- Partially lossless (retains single space for multiple consecutive spaces)
Particularly Useful For:
- Languages without explicit word boundaries (Chinese, Japanese, Korean)
- Multilingual models
- Situations requiring language-agnostic processing
- Used in T5 and ALBERT models
A3.5: What are the challenges of tokenization for non-English languages?
Key Challenges:
Excessive Fragmentation: Tokenizers trained on English-centric corpora may excessively fragment non-Roman alphabetic text
Reduced Efficiency: Over-tokenization leads to:
- Longer token sequences
- Slower generation speed
- Increased computational requirements
Performance Degradation: Can cause LLMs to generate incorrect or nonsensical responses
Vocabulary Imbalance: English tokens may be over-represented in vocabulary
Morphological Complexity: Some languages have richer morphology requiring different tokenization strategies
2024 Solutions:
- Tailored vocabulary sets for specific target languages
- Fine-tuning to reduce token fragmentation
- Language-specific tokenization frameworks
4. Training & Fine-Tuning
A4.1: What is fine-tuning and why is it important for LLMs?
Definition: Fine-tuning is the process of taking a pre-trained LLM and further training it on a smaller, specialized dataset to adapt it to a specific task or domain.
Importance:
- Enhances performance for particular applications
- Increases accuracy and relevance for specialized tasks
- Adapts general knowledge to organizational needs
- More efficient than training from scratch
- Enables customization while leveraging pre-trained knowledge
- Essential for real-world enterprise applications
Market Impact: LLM Fine-Tuning Services reached USD 1.42 billion globally in 2024, reflecting robust adoption across industries.
A4.2: What is Parameter-Efficient Fine-Tuning (PEFT)?
Definition: PEFT refers to techniques that fine-tune LLMs efficiently by optimizing only a small subset of parameters or adding lightweight auxiliary components, rather than training the entire model.
Key Benefits:
- Significantly reduces computational requirements
- Lowers memory usage
- More cost-effective than full fine-tuning
- Preserves general knowledge from pre-training
- Reduces risk of overfitting
- Makes fine-tuning accessible on smaller hardware
Common PEFT Methods:
- LoRA (Low-Rank Adaptation)
- QLoRA (Quantized LoRA)
- Adapter layers
- Prefix tuning
A4.3: What is LoRA (Low-Rank Adaptation) and what are its benefits?
Definition: LoRA is a PEFT technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.
How it works:
- Represents weight changes by multiplying two smaller, low-rank matrices
- Only these low-rank matrices are trained
- Original model weights remain frozen
Benefits:
- Efficiency: Reduces trainable parameters by up to 10,000x
- Memory: Reduces GPU memory requirements by 3x
- Performance: Achieves comparable or better results than full fine-tuning
- Accessibility: Enables fine-tuning on consumer-grade GPUs
- Modularity: Multiple LoRA adapters can be swapped for different tasks
A4.4: What is the difference between full fine-tuning and PEFT methods?
Full Fine-Tuning:
- Updates ALL model parameters
- Requires significant computational resources
- High memory requirements
- Longer training time
- Risk of catastrophic forgetting
- May overfit on small datasets
- Produces a complete new model
PEFT Methods:
- Updates only a small subset of parameters or adds lightweight modules
- Dramatically reduced computational requirements
- Lower memory footprint (up to 3x reduction)
- Faster training
- Preserves pre-trained knowledge better
- Less prone to overfitting
- Produces small adapter modules (can have multiple for different tasks)
- More accessible for resource-constrained environments
A4.5: What is Reinforcement Learning from Human Feedback (RLHF)?
Definition: RLHF is a technique that integrates human evaluation directly into the model's learning process to align LLM outputs with human values and preferences.
Process:
- Pre-train base LLM
- Collect human feedback on model outputs
- Train a reward model based on human preferences
- Use reinforcement learning (typically PPO) to optimize the LLM based on the reward model
Benefits:
- Aligns models with human values
- Improves helpfulness, harmlessness, and honesty
- Enhances contextual understanding
- Reduces bias and increases safety
- Better task completion and user satisfaction
Examples: Used in InstructGPT, GPT-4, Claude 3, and other advanced LLMs.
A4.6: What are the main challenges in implementing RLHF?
Key Challenges:
Data Quality Issues:
- Subjective human preferences lead to inconsistencies
- Annotator fatigue affects quality
- Expensive and time-consuming to collect
Scalability:
- Resource-intensive process
- Requires large teams of human annotators
- High computational costs
Reward Model Problems:
- Instability and inaccuracies
- Reward hacking (exploiting flaws rather than true alignment)
- Difficulty capturing nuanced preferences
Training Instability:
- PPO algorithms challenging to tune
- Prone to instability during training
Alignment Tax:
- Optimizing for specific preferences may degrade general capabilities
- Performance trade-offs on broader benchmarks
Reduced Output Diversity:
- RLHF can decrease response variety compared to supervised fine-tuning
A4.7: What is Direct Preference Optimization (DPO)?
Definition: DPO is an alternative to RLHF that simplifies the alignment process by directly fine-tuning the policy on preference datasets, bypassing the separate reward model.
How it differs from RLHF:
- No Reward Model: Eliminates the need for training a separate reward model
- Direct Optimization: Directly optimizes on preference pairs
- Simpler Pipeline: Fewer training stages
- More Stable: Avoids reward model instability issues
- More Efficient: Reduced computational requirements
Benefits:
- Simpler implementation
- More stable training
- Comparable or better performance
- Lower resource requirements
5. Prompting Techniques
A5.1: What is zero-shot learning in the context of LLMs?
Definition: Zero-shot learning allows an LLM to perform a task it has never explicitly seen before without any task-specific training or examples.
Characteristics:
- Relies purely on pre-trained knowledge
- No examples provided in the prompt
- Model interprets instructions directly
- Leverages general language understanding
Best Used For:
- Simple, straightforward tasks
- Exploratory queries
- When clear, direct instructions can be provided
- Tasks within the model's pre-training knowledge
- Situations where examples aren't available
Example: "Translate this to French: Hello, how are you?"
A5.2: What is few-shot learning and when should it be used?
Definition: Few-shot learning involves providing the LLM with a small number of examples (typically 1-5 "shots") within the prompt to guide its understanding and performance on a new task.
Characteristics:
- Minimal in-context learning
- Model infers task structure from examples
- Generalizes from provided demonstrations
- No parameter updates required
When to Use:
- Model needs to learn a new concept
- Precise output format required
- Complex tasks needing demonstration
- Data is very limited
- Zero-shot performance is insufficient
- Task requires specific style or structure
Example:
Sentiment: "I love this!" → Positive
Sentiment: "This is terrible." → Negative
Sentiment: "It's okay." → ?A5.3: What is Chain-of-Thought (CoT) prompting?
Definition: CoT prompting is a technique that encourages the model to reason step-by-step, breaking down complex problems into intermediate reasoning steps.
How it works:
- Prompt includes explicit reasoning steps
- Model generates intermediate thoughts before final answer
- Can be used in both few-shot and zero-shot settings
Zero-Shot CoT: Simply append "Let's think step by step" to the prompt
Benefits:
- Significantly improves complex problem-solving
- Enhances logical reasoning
- Makes model's thought process more transparent
- Better performance on mathematical and logical tasks
- Reduces errors in multi-step problems
A5.4: What is the difference between zero-shot and few-shot prompting?
Zero-Shot Prompting:
- No examples provided
- Direct instruction only
- Relies on pre-trained knowledge
- Faster to implement
- Works well for simple, clear tasks
- More flexible and exploratory
Few-Shot Prompting:
- Includes 1-5 demonstration examples
- Shows desired input-output pattern
- Helps model understand task structure
- Better for complex or specific tasks
- Requires more careful prompt design
- Generally higher performance on specialized tasks
When to Choose:
- Zero-shot: Simple tasks, exploration, no examples available
- Few-shot: Complex tasks, specific formats, when examples can be provided
A5.5: What are best practices for prompt engineering in 2024?
Best Practices:
Clear Instructions:
- Be specific and detailed
- Separate instructions from context using delimiters
- Define desired output format
Provide Context:
- Include relevant background information
- Use reference texts for factual accuracy
- Specify the role or persona if helpful
Break Down Complex Tasks:
- Split into simpler subtasks
- Reduces error rates
- Makes process more manageable
Use Advanced Techniques:
- Chain-of-Thought for reasoning tasks
- Few-shot examples for complex patterns
- Contrastive examples (positive and negative)
Iterate and Test:
- Test different phrasings
- Evaluate outputs systematically
- Refine based on results
Specify Constraints:
- Define what to avoid
- Set length limits if needed
- Specify tone and style
6. Generation Parameters
A6.1: What is the temperature parameter in LLM text generation?
Definition: Temperature is a parameter that controls the randomness of token selection during text generation by adjusting the probability distribution of possible next tokens.
How it works:
- Applied to the logits (raw prediction scores) before sampling
- Lower values sharpen the distribution (more deterministic)
- Higher values flatten the distribution (more random)
Temperature Ranges:
- 0.0: Deterministic (always selects highest probability token)
- 0.0-0.5: Low randomness, focused, predictable
- 0.6-1.0: Balanced creativity and coherence
- 1.0: Default in many models
- 1.0-2.0: High randomness, creative, exploratory
Use Cases:
- Low (0-0.5): Factual tasks, technical writing, summarization
- High (0.7-2.0): Creative writing, brainstorming, poetry
A6.2: What is top-p (nucleus sampling)?
Definition: Top-p (nucleus sampling) is a parameter that controls randomness by selecting from a dynamic subset of tokens whose cumulative probability exceeds a threshold p.
How it works:
- Tokens are sorted by probability
- Cumulative probability is calculated
- Only tokens whose cumulative probability ≤ p are considered
- Selection is made from this dynamic "nucleus"
Top-p Values:
- Lower (closer to 0): Smaller nucleus, more focused, predictable
- 0.1: Only top 10% probability mass considered
- 0.9-0.95: Common values, balanced approach
- Higher (closer to 1): Larger nucleus, more diverse, creative
Adaptive Nature: The number of tokens in the nucleus varies based on the probability distribution at each step.
A6.3: How do temperature and top-p interact, and should they be adjusted together?
Interaction:
- Both control randomness/creativity but in different ways
- Temperature acts as a "global thermostat" affecting overall distribution
- Top-p dynamically adjusts the pool of candidate tokens
- Their effects can compound and interact in complex ways
Best Practice: Experts generally advise modifying either temperature OR top-p, but NOT both simultaneously.
Reasoning:
- Overlapping effects make outcomes unpredictable
- Difficult to understand which parameter is causing specific behaviors
- Simpler to tune one parameter at a time
- More predictable and controllable results
Recommended Approach:
- Choose one parameter to adjust based on your use case
- Keep the other at its default value
- Test systematically
A6.4: When should you use a low temperature setting?
Use Low Temperature (0.0-0.5) When:
Factual Accuracy is Critical:
- Technical documentation
- Scientific writing
- Medical or legal content
Consistency is Required:
- Standardized responses
- Reproducible outputs
- Automated systems
Structured Tasks:
- Data extraction
- Classification
- Summarization of factual content
- Code generation for well-defined problems
Precision Over Creativity:
- Question answering
- Information retrieval
- Translation of technical content
Effect: More predictable, focused, and coherent outputs with less variation.
A6.5: When should you use a high temperature setting?
Use High Temperature (0.7-2.0) When:
Creativity is Desired:
- Creative writing and storytelling
- Poetry generation
- Marketing copy with unique angles
Exploration is Needed:
- Brainstorming sessions
- Generating diverse ideas
- Exploring multiple perspectives
Variety is Important:
- Generating multiple different responses
- Avoiding repetitive outputs
- Creating varied content
Imaginative Tasks:
- Character dialogue
- Fictional scenarios
- Artistic descriptions
Caution: Very high temperatures (>1.5) can lead to incoherent or nonsensical outputs.
7. Embeddings & Vector Databases
A7.1: What are embeddings in the context of AI and LLMs?
Definition: Embeddings are high-dimensional numerical representations (vectors) of data such as text, images, or audio, generated by machine learning models.
Key Characteristics:
- Capture semantic meaning and relationships
- Similar items are located close together in vector space
- Enable mathematical operations on semantic concepts
- Typically hundreds to thousands of dimensions
Example Relationship:
- vector("king") - vector("man") + vector("woman") ≈ vector("queen")
Applications:
- Semantic search
- Similarity detection
- Clustering and classification
- Recommendation systems
- RAG (Retrieval-Augmented Generation)
A7.2: What is a vector database and how does it differ from traditional databases?
Definition: Vector databases are specialized data storage systems designed to efficiently store, index, and retrieve high-dimensional vectors (embeddings).
Key Differences from Traditional Databases:
Traditional Databases:
- Store structured data (rows, columns)
- Exact match queries
- SQL-based querying
- Optimized for transactional operations
Vector Databases:
- Store high-dimensional vectors
- Similarity-based searches
- Specialized indexing for vector operations
- Optimized for nearest neighbor searches
- Use distance metrics (cosine similarity, Euclidean distance)
Key Features:
- Optimized vector storage
- Efficient similarity search algorithms
- Scalability for billions of vectors
- Support for real-time queries
Market Growth: Projected to reach USD 7.34 billion by 2030 from USD 1.66 billion in 2023 (CAGR 23.7%).
A7.3: What is semantic search?
Definition: Semantic search is a search technique that understands the intent and contextual meaning behind a user's query, rather than just matching keywords.
How it works:
- Query is converted to an embedding vector
- Vector database searches for similar embeddings
- Results are ranked by semantic similarity
- Returns contextually relevant results
Advantages over Keyword Search:
- Understands synonyms and related concepts
- Captures user intent
- Handles natural language queries
- More accurate and relevant results
- Doesn't require exact keyword matches
Example:
- Query: "warm clothing"
- Results: sweaters, coats, jackets (even without those exact words)
A7.4: How do embeddings enable semantic search?
Process:
Indexing Phase:
- Documents are converted to embeddings using an embedding model
- Embeddings are stored in a vector database
- Each document has a corresponding vector representation
Query Phase:
- User query is converted to an embedding using the same model
- Query embedding is compared to document embeddings
- Similarity is calculated using distance metrics
- Most similar documents are retrieved
Key Enabler:
- Embeddings capture semantic meaning in numerical form
- Similar meanings = similar vectors
- Vector similarity = semantic similarity
- Enables mathematical comparison of meaning
Why It Works:
- Embeddings learned from vast datasets understand language relationships
- Contextual information is encoded in the vector space
- Similarity in vector space correlates with semantic similarity
A7.5: What are common similarity metrics used in vector databases?
1. Cosine Similarity:
- Measures the cosine of the angle between two vectors
- Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)
- Best for: Text embeddings, when magnitude doesn't matter
- Most common in NLP applications
2. Euclidean Distance (L2):
- Measures straight-line distance between vectors
- Range: 0 to ∞ (0 = identical, larger = more different)
- Best for: When absolute distances matter
- Sensitive to vector magnitude
3. Dot Product:
- Measures the product of vector magnitudes and cosine of angle
- Combines magnitude and direction
- Best for: When both magnitude and direction are important
- Computationally efficient
4. Manhattan Distance (L1):
- Sum of absolute differences along each dimension
- Best for: High-dimensional spaces, certain optimization problems
Selection Criteria:
- Depends on embedding model and use case
- Cosine similarity most common for text
- Should match the metric the embedding model was optimized for
8. Challenges & Solutions
A8.1: What are hallucinations in LLMs?
Definition: Hallucinations are instances where LLMs generate inaccurate, fabricated, or nonsensical information that appears convincing but is not grounded in reality or the provided context.
Types:
- Factual Hallucinations: Incorrect facts or data
- Contextual Hallucinations: Information not supported by the input
- Confabulation: Making up plausible-sounding but false details
Why They Occur:
- Models predict based on patterns, not true understanding
- Training data may contain inaccuracies
- No inherent fact-checking mechanism
- Pressure to generate fluent, coherent text
- Lack of access to real-time or verified information
Impact:
- Critical concern for high-stakes applications (medicine, law, finance)
- Undermines trust and reliability
- Can spread misinformation
A8.2: What is Retrieval-Augmented Generation (RAG)?
Definition: RAG is a technique that combines the generative capabilities of LLMs with information retrieval from external, reliable knowledge bases.
How it works:
- User submits a query
- System retrieves relevant information from external sources (vector database, documents)
- Retrieved information is provided as context to the LLM
- LLM generates response grounded in the retrieved information
Components:
- Retriever: Finds relevant information (often using semantic search)
- Generator: LLM that produces the final response
- Knowledge Base: External source of verified information
Benefits:
- Grounds responses in factual, verified data
- Significantly reduces hallucinations
- Enables access to up-to-date information
- Allows citation of sources
- More controllable and trustworthy outputs
A8.3: How does RAG help reduce hallucinations?
Mechanisms:
Factual Grounding:
- Provides verified information as context
- LLM generates based on real data, not just patterns
- Reduces reliance on potentially faulty training data
External Knowledge Access:
- Accesses up-to-date information beyond training cutoff
- Retrieves domain-specific knowledge
- Can include proprietary or specialized data
Source Attribution:
- Enables citing sources
- Makes verification possible
- Increases accountability
Contextual Relevance:
- Retrieves specifically relevant information
- Reduces need for model to "guess"
- Provides concrete facts to work with
Effectiveness:
- Stanford 2024 study: 96% reduction in hallucinations when RAG combined with other techniques
- Particularly effective for factual, knowledge-intensive tasks
A8.4: What are other techniques to mitigate LLM hallucinations?
Key Mitigation Techniques (2024):
Fine-Tuning on High-Quality Data:
- Use carefully curated, domain-specific datasets
- Minimize exposure to biased or inaccurate information
- Improves factual accuracy
RLHF (Reinforcement Learning from Human Feedback):
- Human evaluation prioritizes factual accuracy
- Refines model behavior toward truthfulness
- Aligns outputs with human values
Advanced Prompting:
- Chain-of-Thought for logical reasoning
- Clear, specific instructions
- Provide reference texts
Post-Processing and Filtering:
- Cross-reference against verified databases
- Rule-based systems to catch errors
- Rank outputs by factual consistency
Human-in-the-Loop (HITL):
- Expert review of outputs
- Real-time feedback mechanisms
- Critical for high-stakes applications
Factual Consistency Scoring:
- Algorithmic detection of hallucinations
- Specialized models for validation
- Automated fact-checking
Hybrid Approaches:
- Combine multiple techniques (RAG + RLHF + guardrails)
- Multi-layered defense
- Best results in 2024 research
A8.5: What is the "alignment tax" in RLHF?
Definition: The alignment tax refers to the potential degradation of a model's general capabilities or performance on broader NLP benchmarks when it is optimized for specific human preferences through RLHF.
How it occurs:
- RLHF optimizes for specific reward signals
- May inadvertently reduce performance on other tasks
- Trade-off between specialized alignment and general capability
- Model becomes more focused but potentially less versatile
Examples:
- Model aligned for safety may become overly cautious
- Optimization for specific style may reduce creativity
- Focus on particular domains may weaken others
Implications:
- Need to balance alignment with general capability
- Careful reward model design is critical
- Multi-objective optimization may help
- Important consideration in RLHF implementation
A8.6: What is reward hacking in reinforcement learning?
Definition: Reward hacking occurs when a model exploits flaws or loopholes in the reward system to maximize its reward score without actually achieving the intended behavior or alignment.
How it happens:
- Reward model has imperfections or blind spots
- Model finds unintended ways to score highly
- Optimizes for the metric rather than the goal
- Similar to "teaching to the test" in education
Examples:
- Generating verbose responses to appear more helpful
- Using specific phrases that score well without being genuinely useful
- Exploiting ambiguities in reward criteria
- Gaming the evaluation metric
Consequences:
- Model appears aligned but isn't truly helpful
- Undermines the purpose of RLHF
- Can lead to unexpected behaviors
- Reduces model reliability
Mitigation:
- Robust reward model design
- Multiple reward models (multi-objective)
- Bayesian Reward Model Ensembles (BRME)
- Regular auditing and testing
- Human oversight
9. Advanced Concepts
A9.1: What is the context window in LLMs?
Definition: The context window is the maximum number of tokens an LLM can process in a single interaction, including both input (prompt) and output (generated text).
Key Aspects:
- Measured in tokens (not words or characters)
- Determines how much information the model can "remember" at once
- Includes prompt + conversation history + generated response
Typical Sizes (2024):
- Early models: 2,048-4,096 tokens
- Modern models: 8,192-32,768 tokens
- Extended context models: 100,000+ tokens (some up to 1-2 million)
Limitations:
- Information beyond the window is not accessible
- Longer contexts require more computational resources
- May impact response quality at extreme lengths
Importance:
- Determines ability to process long documents
- Affects conversation memory
- Critical for complex, multi-turn interactions
A9.2: What are the challenges of processing long sequences in Transformers?
Key Challenges:
Computational Complexity:
- Self-attention has O(n²) complexity with sequence length
- Quadratic growth in computation and memory
- Becomes prohibitive for very long sequences
Memory Requirements:
- Attention matrices grow quadratically
- GPU memory limitations
- Expensive to store and compute
Training Difficulty:
- Longer sequences require more resources
- Slower training times
- Gradient flow issues
2024 Solutions:
Sparse Attention:
- Only attend to subset of tokens
- Reduces complexity while maintaining performance
Clustered Attention:
- Group similar tokens
- Attend within clusters
Memory-Augmented Attention:
- External memory mechanisms
- Compress historical information
Hybrid Architectures:
- Combine Transformers with State Space Models (e.g., Jamba)
- Convolutional approaches (e.g., Hyena)
Impact:
- Order-of-magnitude improvements in speed and memory
- Enable processing entire books vs. paragraphs
- Critical for extended context LLMs
A9.3: What is Reinforcement Learning from AI Feedback (RLAIF)?
Definition: RLAIF is an alternative to RLHF that leverages AI-generated feedback instead of human annotations to train the reward model and align LLMs.
How it works:
- Use an AI system (often another LLM) to evaluate outputs
- AI generates preference rankings or feedback
- Train reward model on AI feedback
- Use RL to optimize the model
Advantages over RLHF:
- Scalability: No need for large teams of human annotators
- Cost-Effective: Reduces expensive human annotation
- Consistency: AI feedback can be more consistent
- Speed: Faster feedback generation
- Accessibility: Makes alignment more accessible
Challenges:
- AI feedback quality depends on the evaluator model
- May inherit biases from the AI evaluator
- Less diverse perspectives than human feedback
- Validation of AI judgments needed
Use Cases:
- Supplementing human feedback
- Scaling alignment to more tasks
- Iterative improvement cycles
A9.4: What is the difference between supervised fine-tuning and RLHF?
Supervised Fine-Tuning (SFT):
- Training Signal: Direct input-output pairs
- Objective: Minimize prediction error on labeled examples
- Process: Standard supervised learning
- Data: Requires high-quality labeled examples
- Output: Model learns to imitate training examples
- Diversity: Generally maintains higher output diversity
- Complexity: Simpler to implement
RLHF (Reinforcement Learning from Human Feedback):
- Training Signal: Human preference rankings
- Objective: Maximize reward based on human preferences
- Process: Multi-stage (SFT → reward model → RL optimization)
- Data: Requires preference comparisons
- Output: Model learns what humans prefer
- Diversity: May reduce output diversity
- Complexity: More complex, requires RL algorithms
Relationship:
- RLHF typically starts with SFT as a base
- SFT provides basic capability, RLHF aligns preferences
- SFT alone may not capture nuanced human preferences
- RLHF better for alignment, SFT better for capability
A9.5: What is QLoRA?
Definition: QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA that further optimizes fine-tuning by quantizing the low-rank adaptation matrices to lower precision (typically 4-bit or 8-bit).
How it works:
- Combines LoRA's low-rank adaptation with quantization
- Stores base model in quantized format (4-bit)
- Trains low-rank adapters in higher precision
- Dramatically reduces memory requirements
Benefits:
- Extreme Memory Efficiency: Additional memory savings beyond LoRA
- Accessibility: Enables fine-tuning of larger models on consumer GPUs
- Performance: Maintains comparable performance to full precision
- Cost-Effective: Reduces hardware requirements significantly
Example Impact:
- Can fine-tune 65B parameter models on a single 48GB GPU
- Makes large model fine-tuning accessible to researchers and smaller organizations
Trade-offs:
- Slight potential performance degradation from quantization
- More complex implementation
- Requires careful calibration
10. Practical Applications
A10.1: What is the text-to-text framework in T5?
Definition: The text-to-text framework is T5's approach where every NLP task is reformulated as a text generation problem, with both inputs and outputs being text strings.
How it works:
- Input: Task description + input data (all as text)
- Output: Always text, regardless of task type
- Unified Format: Same model architecture handles all tasks
Examples:
Translation:
- Input: "translate English to German: That is good."
- Output: "Das ist gut."
Classification:
- Input: "sentiment: This movie is terrible!"
- Output: "negative"
Summarization:
- Input: "summarize: [long article text]"
- Output: "[summary text]"
Question Answering:
- Input: "question: What is the capital? context: Paris is the capital of France."
- Output: "Paris"
Benefits:
- Simplifies model architecture across tasks
- Enables transfer learning between different task types
- Consistent training and inference pipeline
- Highly versatile and flexible
A10.2: What are Small Language Models (SLMs) and why are they gaining prominence?
Definition: Small Language Models (SLMs) are LLMs with fewer parameters (typically under 10B) designed to be more efficient while maintaining strong performance for specific tasks.
Characteristics:
- Fewer parameters than large LLMs
- More efficient inference
- Lower computational requirements
- Often specialized for specific domains
- Can run on edge devices or consumer hardware
Why They're Gaining Prominence (2024):
Efficiency:
- Lower computational costs
- Faster inference
- Reduced energy consumption
Accessibility:
- Can run on smaller hardware
- More affordable to deploy
- Democratizes AI access
Performance:
- Often superior for specific, well-defined tasks
- Moving away from "bigger is better" mentality
- Specialized models outperform general ones in their domain
Practical Deployment:
- Easier to deploy in production
- Lower latency
- Better for real-time applications
Privacy:
- Can run on-device
- Reduces data transmission
- Better for sensitive applications
Trend: 2024 saw increased focus on SLMs as organizations recognize that task-specific smaller models often provide better ROI than general-purpose large models.
A10.3: What is hybrid search in the context of vector databases?
Definition: Hybrid search combines traditional keyword-based search with vector-based semantic search to provide more comprehensive and relevant results.
How it works:
- Keyword Search: Performs traditional exact/fuzzy matching
- Vector Search: Performs semantic similarity search
- Combination: Merges and ranks results from both approaches
- Scoring: Often uses weighted combination or reciprocal rank fusion
Benefits:
Leverages Strengths of Both:
- Keyword Search: Precise matches, specific terms, proper nouns
- Vector Search: Semantic understanding, synonyms, context
Improved Relevance:
- Catches results that either method alone might miss
- More robust to different query types
- Better overall search quality
Flexibility:
- Can adjust weights based on use case
- Adaptable to different domains
- Handles both specific and conceptual queries
Use Cases:
- Enterprise search systems
- E-commerce product search
- Document retrieval
- Knowledge bases
A10.4: What is the role of human-in-the-loop (HITL) in LLM applications?
Definition: Human-in-the-Loop (HITL) refers to integrating human oversight and intervention into the LLM workflow to ensure quality, accuracy, and safety.
Key Roles:
Quality Assurance:
- Review and validate LLM outputs
- Identify errors and hallucinations
- Ensure factual accuracy
Safety and Compliance:
- Catch harmful or inappropriate content
- Ensure regulatory compliance
- Maintain ethical standards
Continuous Improvement:
- Provide feedback for model refinement
- Flag edge cases and failures
- Guide model updates
Domain Expertise:
- Apply specialized knowledge
- Validate technical or professional content
- Ensure context-appropriate responses
Implementation Approaches:
- Pre-deployment: Human review before outputs reach users
- Post-deployment: User feedback mechanisms
- Sampling: Review subset of outputs for quality monitoring
- Escalation: Automatic flagging of uncertain outputs for human review
Critical For:
- High-stakes applications (healthcare, legal, finance)
- Regulated industries
- Customer-facing applications
- Safety-critical systems
A10.5: What are the typical applications of encoder-only models like BERT?
Typical Applications:
Text Classification:
- Sentiment analysis
- Topic categorization
- Intent detection
- Spam detection
Named Entity Recognition (NER):
- Identifying people, places, organizations
- Extracting structured information
- Information extraction
Question Answering (Extractive):
- Finding answers within provided text
- Reading comprehension
- Document-based Q&A
Semantic Similarity:
- Duplicate detection
- Paraphrase identification
- Document similarity
Token Classification:
- Part-of-speech tagging
- Chunking
- Syntax analysis
Sentence Pair Tasks:
- Natural language inference
- Textual entailment
- Semantic textual similarity
Why BERT Excels:
- Bidirectional context understanding
- Deep comprehension of text meaning
- Strong performance on understanding tasks
- Pre-trained on large corpora
Not Suitable For:
- Text generation
- Creative writing
- Open-ended content creation
No comments:
Post a Comment