Thursday, May 29, 2025

UNDERSTANDING SPECIAL PROMPT MARKERS IN LARGE LANGUAGE MODELS (LLMs): WHAT, WHY, HOW


Imagine you’re directing a play. You need a way to tell the actors who’s speaking, what the scene is, what the mood should be, and when to start and stop talking. Large Language Models (LLMs), being mathematically glorified parrots with attention spans measured in tokens, need a similar stage script — and this is where special markers come in.


These markers — such as [INST], <<SYS>>, or even custom delimiter tokens like <|user|> and <|assistant|> — are not just syntactic candy. They serve a critical function: they structure the inputsignal roles and intent, and ensure the model doesn’t hallucinate that you’re a banana asking for stock predictions.


WHY DO LLMs NEED SPECIAL MARKERS?


An LLM receives a string of tokens and tries to predict the next most likely token. It has no semantic concept of users, assistants, or system messages unless explicitly trained or prompted with patterns that encode such distinctions.


These patterns are implemented through prompt formatting, a fancy way of saying: wrap everything in clear markers so the model knows what to do with it.


Without these markers:

The model might misinterpret system instructions as user queries.

It might not differentiate between past and present user messages.

It could reply using the wrong tone or persona.

Multi-turn conversation contexts can fall apart.


Thus, special prompt markers segment and annotate the raw input string in a way that aligns with the model’s expectations — especially in fine-tuned models trained with instruction-following datasets.


COMMON TYPES OF SPECIAL MARKERS AND THEIR MEANING


Different models use different formatting conventions. Let’s explore the most common ones:


(1) [INST] … [/INST]


This format is popularized by AlpacaVicunaMistral-instruct, and many models fine-tuned on instruction-following datasets. Here’s the layout:


[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>


What is the capital of France?

[/INST]


Explanation:

[INST] opens the instruction block.

<<SYS>> defines the system prompt: the high-level rules and persona.

The user prompt follows.

[/INST] closes the user instruction. What comes next is expected to be the assistant’s reply.


This structure is designed to guide the model’s behavior with clarity, especially in fine-tuned transformers that expect this exact token pattern during inference.


(2) <|user|>, <|assistant|>, <|system|>


These are used by some OpenAI API models and Hugging Face chat templates (e.g. ChatML or Zephyr):


<|system|>

You are an AI tutor that teaches quantum computing to beginners.

<|user|>

Can you explain what a qubit is?

<|assistant|>

Certainly! A qubit is...


These pseudo-HTML tags act as role specifiers, telling the model who is speaking and what the context is.


(3) <s> and </s> (or BOS/EOS)


These are start and end-of-sequence tokens often injected automatically by the tokenizer. While you don’t typically write these manually, some prompt formats (especially when training from scratch) make them explicit to help the model understand turn-taking.


HOW DO MODELS LEARN TO UNDERSTAND THESE MARKERS?


The model only “understands” special tags if it has seen them during training or fine-tuning.

If you fine-tune a model on instruction pairs using [INST]…[/INST], then the model learns to expect and interpret these symbols as boundaries between instruction and response.

If a model is trained using raw text without role annotations, then these markers will mean nothing unless you add them consistently during fine-tuning.

In inference mode, using the wrong format (or omitting expected markers) can cause the model to output gibberish, ignore the system prompt, or blend roles incorrectly.


In other words: you must format your prompts in the way the model was trained to expect.



EXAMPLES FROM REAL MODELS AND LIBRARIES


Mistral-Instruct Prompt (ChatML-inspired):


<s>[INST] <<SYS>>

You are a coding assistant.

<</SYS>>

Write a Python function to reverse a string.

[/INST]


OpenAI ChatML Style Prompt:


<|system|>

You are a poetic assistant.

<|user|>

Write a haiku about mountains.

<|assistant|>

Snow on distant peaks  

Whispers lost in icy winds  

Silence climbs with dawn


LLaMA 2 Chat Format (Meta’s style):


Internally uses <<START>> and <<END>> tokens (mapped to special IDs) along with role-specific message structs in JSON-style prompting, when used with libraries like transformers or llama.cpp.



BEST PRACTICES FOR USING MARKERS


Always check your model’s documentation or tokenizer behavior to see what prompt format it expects.


Use consistent markers across your dataset if fine-tuning or training from scratch.


Avoid putting instructions outside the markers — the model may ignore or misinterpret them.


If mixing multiple turns, wrap each turn with role identifiers so the model tracks who said what.


Never assume the model will “just get it” if you skip the opening tag — it’s not psychic, it’s statistical.



FUTURE OF PROMPT MARKERS: WILL WE STILL NEED THEM?


Many emerging models now support structured JSON-style messages where you don’t need manual markers. The OpenAI Chat API, Anthropic’s Claude via MCP, and Google’s Gemini use rich structured input schemas under the hood.


However, for raw model access, open-source models, or low-level inference frameworks (like llama.cpp, transformers, ctranslate2), you still need to wrap everything in these textual markers.


So until all models speak fluent protobufs or JSON natively, these humble tags will remain the duct tape of the prompting world.



CONCLUSION


Special markers like [INST], <<SYS>>, <|user|>, and their kin may seem like arcane syntax, but they are the glue that holds instruction-following and role-based prompting together. Without them, LLMs are like actors reading a script without knowing who they’re playing or what scene they’re in.


If you want your model to follow your instructions, take on a persona, or remember who’s talking — use the right tags, in the right order, and your model will thank you by behaving a lot less like a deranged improv artist and more like a helpful, coherent assistant.




CODE EXAMPLE


import torch

from transformers import AutoModelForCausalLM, AutoTokenizer


# Load the model and tokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.1"  # or any other instruct-tuned model

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")


# Define the system prompt

system_prompt = """<<SYS>>

You are a helpful and honest AI assistant. Always provide clear, concise, and accurate answers.

<</SYS>>"""


# Define a multi-turn chat session

turn_1 = "What is the capital of Italy?"

turn_2 = "Can you also give me a fun fact about Rome?"


# Wrap the full prompt using [INST] and system instructions

def format_prompt(system, turns):

    full_prompt = ""

    for idx, user_input in enumerate(turns):

        if idx == 0:

            full_prompt += f"<s>[INST] {system.strip()}\n\n{user_input.strip()} [/INST]"

        else:

            full_prompt += f"\n{assistant_reply.strip()} </s>\n<s>[INST] {user_input.strip()} [/INST]"

        inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():

            outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False, temperature=0.7)

        output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract last assistant reply (naive split)

        assistant_reply = output_text.split("[/INST]")[-1].strip()

    return output_text


# Build the conversation

chat_history = [turn_1, turn_2]

response = format_prompt(system_prompt, chat_history)


# Print the full response

print("\n=== MODEL RESPONSE ===\n")

print(response)



WHAT DOES THIS SCRIPT DO?


1. It loads a Mistral-Instruct style model using Hugging Face.

2. It prepares a system-level directive wrapped in <<SYS>> … <</SYS>>.

3. It builds a multi-turn user conversation, incrementally appending model replies after each [INST] … [/INST] block.

4. It uses the [INST] and </s> special markers to denote instruction boundaries and model response turn ends.

5. It invokes the model via .generate() and prints the decoded output.

No comments: