Monday, April 28, 2025

The LLM Memory Estimator – Predicting Hardware Requirements with Context Size Consideration

Introduction

Large Language Models (LLMs) have transformed AI by enabling powerful tasks like text generation, code assistance, and complex knowledge queries. However, using LLMs efficiently requires careful planning of hardware resources, especially memory and compute power.


A critical mistake many users make is to only consider the number of model parameters. In reality, two memory consumers must be accounted for:

  • Model parameters (fixed size)
  • Context activations (growing with prompt length)

Ignoring context size can easily cause out-of-memory errors when working with long prompts or during fine-tuning.


The LLM Memory Estimator solves this problem by taking into account both model parameters and context size to predict hardware needs properly.


Memory Breakdown

Memory consumption can be split into:

1. Model Parameters

2. Context Activations

3. System Overhead


Estimating Model Parameters

Each parameter takes a certain number of bytes:

  • FP32: 4 bytes
  • FP16: 2 bytes
  • INT8 / FP8: 1 byte

Model Memory = Number of Parameters × Bytes per Parameter ÷ (1024^3)


Estimating Context Activations

Activation Memory = Context Length × Hidden Size × Number of Layers × Bytes per Activation ÷ (1024^3)


System Overhead Reserve

Typically reserve 8 GB or more for system processes.


Sample Code for the Estimator

def estimate_memory_requirements(total_memory_gb, billion_params,

                                  precision="FP16",

                                  context_length=2048,

                                  hidden_size=4096,

                                  num_layers=32,

                                  overhead_gb=8,

                                  mode="inference"):

    bytes_per_param = {"FP32": 4, "FP16": 2, "INT8": 1, "FP8": 1}

    bytes_per_activation = {"FP32": 4, "FP16": 2, "INT8": 1, "FP8": 1}

    

    if precision not in bytes_per_param:

        raise ValueError("Unsupported precision")


    available_memory_gb = total_memory_gb - overhead_gb

    if available_memory_gb <= 0:

        raise ValueError("Not enough memory after overhead")


    num_params = billion_params * 1e9

    model_memory_gb = (num_params * bytes_per_param[precision]) / (1024**3)

    activation_memory_gb = (context_length * hidden_size * num_layers * bytes_per_activation[precision]) / (1024**3)


    if mode == "fine-tune":

        model_memory_gb *= 2.5

        activation_memory_gb *= 1.5


    total_memory_needed_gb = model_memory_gb + activation_memory_gb

    fits = total_memory_needed_gb <= available_memory_gb


    return {

        "Model Memory (GB)": model_memory_gb,

        "Activation Memory (GB)": activation_memory_gb,

        "Total Required (GB)": total_memory_needed_gb,

        "Available (GB)": available_memory_gb,

        "Fits": fits

    }


Example Output:

Model Memory (GB): 13.06

Activation Memory (GB): 1.00

Total Required (GB): 14.06

Available (GB): 56.00

Fits: True


Impact of Large Contexts


Longer context = More activation memory.

At 32768 tokens, activation memory can easily exceed 8GB, significantly affecting total requirements.


Practical Observations


- Short contexts (2048–4096 tokens) mostly affect model size.

- Long contexts (32k+) significantly increase memory needs.

- Fine-tuning increases memory demands dramatically.


Suggestions for Best Practice

  • Subtract operating system overhead.
  • Estimate separately for model and activations.
  • Quantize when possible.
  • Minimize batch size during inference.


Summary Table


| Scenario                             | Memory Dominated By               |

|--------------------------|---------------------------——-|

| Short Context + Inference| Model Parameters                      |

| Long Context + Inference | Activations + Parameters          |

| Fine-Tuning                        | Parameters + Optimizer State  |


Conclusion


The updated LLM Memory Estimator allows better prediction of whether your hardware can support a specific model. Considering context size is crucial for realistic planning.


Cheat-Sheet: Typical Memory Requirements


- Mistral 7B

  - Parameters: 7B

  - Hidden size: 4096

  - Layers: 32

  - FP16, 4096 tokens: ~14 GB

- LLaMA 2 13B

  - Parameters: 13B

  - Hidden size: 5120

  - Layers: 40

  - FP16, 4096 tokens: ~26 GB

- LLaMA 2 70B

  - Parameters: 70B

  - Hidden size: 8192

  - Layers: 80

  - FP16, 4096 tokens: ~120 GB


For fine-tuning, multiply memory needs by 3 to 5x!

No comments: