Friday, April 25, 2025

How to See What's Happening Inside an LLM When It Receives a Prompt

1. Attention Visualization

Transformer-based language models use attention mechanisms to decide how different words relate to each other. You can visualize attention weights to see how the model connects input tokens.Tools you can use:BertViz (https://github.com/jessevig/bertviz)ExBERT (https://exbert.net)What you see:Relationships between input and output tokens.Where the model focuses when generating each token.

2. Logits and Probability Distribution:

The model outputs raw scores (logits) that become probabilities after applying a softmax function. You can inspect these probabilities to understand why certain tokens are chosen.

Example Python code (using Hugging Face Transformers):

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained('gpt2')

model = GPT2LMHeadModel.from_pretrained('gpt2')prompt = "The capital of France is"

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model(**inputs)logits = outputs.logits

probabilities = torch.softmax(logits[:, -1, :], dim=-1)top_probs, top_indices = torch.topk(probabilities, 5)

for prob, idx in zip(top_probs[0], top_indices[0]):

    print(tokenizer.decode(idx), prob.item())


What you see: Probabilities of possible next tokens. Insight into token selection process.

3.Hidden States and Embeddings

Language models convert tokens into embeddings and pass them through multiple layers, producing hidden states.

Example (using Hugging Face Transformers):

outputs = model(**inputs, output_hidden_states=True)

hidden_states = outputs.hidden_states  # contains hidden states from all layers

What you see: 

Numerical representations of input tokens at each layer. How representations evolve through layers.

4.Activation Visualization and Analysis

You can analyze neuron activations to understand how specific neurons respond to inputs.Tools:OpenAI's Activation Atlas (https://distill.pub/2019/activation-atlas/)TransformerLens (https://github.com/neelnanda-io/TransformerLens)

What you see:Activation patterns of neurons. Neurons associated with specific linguistic or semantic features.

5.Explainability and Interpretability Tools

Specialized frameworks help interpret model predictions.Tools: SHAP (https://github.com/slundberg/shap) LIME (https://github.com/marcotcr/lime)

What you see: Importance scores for each input word or token.Which parts of input influence model output most.

6. Prompt Engineering and Ablation Studies:

Change prompts systematically to observe how the model's behavior changes.

What you see: How prompt variations affect model responses. Sensitivity of the model to specific words or phrasing.

Recommended workflow to get started:

  1. Begin with attention visualization.
  2. Inspect logits and probabilities.
  3. Explore hidden states and neuron activations.

Using these methods and tools, you can gain insights into how language models process prompts and generate responses.

No comments: