Monday, March 09, 2026

UNDERSTANDING AND MANAGING LLM CONTEXT MEMORY: A GUIDE FOR DEVELOPERS




INTRODUCTION


Large Language Models operate within strict memory constraints known as context windows. These windows define the maximum number of tokens a model can process and retain during a single interaction session. Understanding how context memory works and what happens when it becomes full is crucial for developers building robust AI applications.


Context memory represents the temporary storage space where an LLM maintains awareness of the current conversation, previous interactions, and relevant information needed to generate coherent responses. When this memory space becomes exhausted, the model faces significant challenges that can severely impact application performance and user experience.


TECHNICAL ARCHITECTURE OF CONTEXT MEMORY


Context memory in Large Language Models operates through a token-based system where each piece of text gets converted into numerical representations called tokens. The context window represents the maximum number of these tokens the model can actively process and maintain in memory during a single session.


The tokenization process breaks down text into smaller units, which may be words, subwords, or characters depending on the model’s tokenizer. For example, the sentence “Hello world” might be tokenized as [“Hello”, “ world”] using two tokens, while more complex text with special characters or uncommon words may require significantly more tokens.


# Simple tokenization example

def estimate_tokens(text):

    """

    Basic token estimation for demonstration purposes.

    Real implementations use model-specific tokenizers.

    """

    # Rough approximation: 1 token per 4 characters for English text

    return len(text) // 4 + 1


sample_text = "The quick brown fox jumps over the lazy dog."

estimated_tokens = estimate_tokens(sample_text)

print(f"Text: {sample_text}")

print(f"Estimated tokens: {estimated_tokens}")


The context window operates as a sliding buffer where new tokens enter from one end while older tokens may be removed from the other end when space becomes constrained. This mechanism ensures the model always operates within its memory limits, but it introduces the challenge of managing which information to retain and which to discard.


Modern LLMs typically have context windows ranging from 2,048 tokens in smaller models to over 128,000 tokens in advanced models. However, even these large windows can become insufficient when dealing with lengthy conversations, large documents, or complex multi-turn interactions that accumulate significant token counts over time.


WHAT HAPPENS WHEN CONTEXT MEMORY IS FULL


When an LLM’s context window reaches capacity, several critical issues emerge that directly impact the model’s functionality and the user experience. The most immediate consequence is token truncation, where the model must discard older tokens to accommodate new input.


Token truncation typically follows a “first-in, first-out” approach, meaning the earliest parts of the conversation or input get removed from memory. This creates a sliding window effect where the model loses access to information that occurred earlier in the session, potentially causing it to forget important context, user preferences, or previously established facts.


class ContextWindow:

    def __init__(self, max_tokens=1000):

        """

        Simple context window implementation showing truncation behavior.

        """

        self.max_tokens = max_tokens

        self.current_tokens = []

        self.total_tokens_processed = 0

    

    def add_message(self, message, token_count):

        """

        Add a new message to the context window.

        Truncates older messages if necessary.

        """

        self.current_tokens.append({

            'message': message,

            'tokens': token_count,

            'sequence_id': self.total_tokens_processed

        })

        

        # Calculate total tokens in current context

        current_total = sum(item['tokens'] for item in self.current_tokens)

        

        # Truncate from the beginning if over limit

        while current_total > self.max_tokens and self.current_tokens:

            removed_item = self.current_tokens.pop(0)

            current_total -= removed_item['tokens']

            print(f"Truncated message {removed_item['sequence_id']}: {removed_item['message'][:50]}...")

        

        self.total_tokens_processed += 1

        

    def get_current_context_size(self):

        """Return the current number of tokens in context."""

        return sum(item['tokens'] for item in self.current_tokens)


Performance degradation represents another significant consequence of context memory saturation. As the context window approaches its limits, the model must spend increasing computational resources managing memory, deciding what to truncate, and attempting to maintain coherence despite missing information. This results in slower response times and potentially reduced quality in generated content.


The loss of conversational coherence becomes particularly problematic in applications requiring long-term memory or reference to earlier parts of a conversation. Users may experience frustration when the AI appears to forget previously discussed topics, agreed-upon facts, or established preferences, leading to repetitive conversations and reduced user satisfaction.


CONTEXT MANAGEMENT STRATEGIES


Effective context management requires implementing multiple strategies that work together to optimize memory usage while preserving important information. The first strategy involves intelligent content prioritization, where developers implement systems that identify and retain the most critical information while allowing less important content to be truncated.


Content prioritization can be achieved through various approaches including recency weighting, where more recent messages receive higher priority, importance scoring based on content analysis, and user-defined importance markers that allow users to flag critical information for retention.


class PriorityContextManager:

    def __init__(self, max_tokens=1000):

        """

        Context manager with priority-based retention.

        """

        self.max_tokens = max_tokens

        self.messages = []

        self.importance_keywords = ['important', 'remember', 'key', 'critical']

    

    def calculate_priority(self, message, recency_factor):

        """

        Calculate priority score for a message based on various factors.

        """

        base_score = recency_factor  # More recent = higher score

        

        # Boost score if message contains importance keywords

        importance_boost = sum(1 for keyword in self.importance_keywords 

                             if keyword.lower() in message.lower())

        

        # Boost score for questions or requests

        question_boost = 1 if '?' in message else 0

        

        return base_score + (importance_boost * 2) + question_boost

    

    def add_message(self, message, token_count):

        """Add message with priority calculation."""

        recency_factor = len(self.messages) + 1  # Higher for more recent

        priority = self.calculate_priority(message, recency_factor)

        

        self.messages.append({

            'message': message,

            'tokens': token_count,

            'priority': priority,

            'timestamp': len(self.messages)

        })

        

        self.manage_context_size()

    

    def manage_context_size(self):

        """Remove lowest priority messages when over token limit."""

        current_total = sum(msg['tokens'] for msg in self.messages)

        

        while current_total > self.max_tokens and len(self.messages) > 1:

            # Sort by priority (ascending) and remove lowest priority message

            self.messages.sort(key=lambda x: x['priority'])

            removed_message = self.messages.pop(0)

            current_total -= removed_message['tokens']

            print(f"Removed low priority message: {removed_message['message'][:30]}...")


Context summarization represents another powerful strategy where lengthy conversations or documents are condensed into shorter summaries that capture the essential information while using fewer tokens. This approach allows the system to maintain awareness of earlier interactions without consuming excessive context space.


Summarization can be implemented through extractive methods that select the most important sentences or phrases from the original content, or through abstractive methods that generate new concise summaries capturing the key points. The choice between these approaches depends on the specific requirements of the application and the computational resources available.


class ContextSummarizer:

    def __init__(self, max_tokens=1000, summary_trigger_ratio=0.8):

        """

        Context manager with automatic summarization capabilities.

        """

        self.max_tokens = max_tokens

        self.summary_trigger_ratio = summary_trigger_ratio

        self.messages = []

        self.summaries = []

    

    def add_message(self, message, token_count):

        """Add message and trigger summarization if needed."""

        self.messages.append({

            'message': message,

            'tokens': token_count,

            'timestamp': len(self.messages)

        })

        

        current_total = sum(msg['tokens'] for msg in self.messages)

        trigger_threshold = self.max_tokens * self.summary_trigger_ratio

        

        if current_total > trigger_threshold:

            self.create_summary_and_prune()

    

    def create_summary_and_prune(self):

        """

        Create summary of older messages and remove them.

        This is a simplified version - real implementation would use

        more sophisticated summarization techniques.

        """

        if len(self.messages) < 3:

            return

        

        # Take oldest half of messages for summarization

        messages_to_summarize = self.messages[:len(self.messages)//2]

        

        # Simple extractive summary - take first and last messages

        summary_content = f"Earlier conversation summary: {messages_to_summarize[0]['message']} ... {messages_to_summarize[-1]['message']}"

        

        # Estimate summary tokens (much smaller than original)

        summary_tokens = len(summary_content) // 4

        

        # Store summary and remove original messages

        self.summaries.append({

            'summary': summary_content,

            'tokens': summary_tokens,

            'original_count': len(messages_to_summarize)

        })

        

        # Remove summarized messages

        self.messages = self.messages[len(messages_to_summarize):]

        

        print(f"Created summary of {len(messages_to_summarize)} messages")

    

    def get_full_context(self):

        """Get complete context including summaries and current messages."""

        full_context = []

        

        # Add all summaries first

        for summary in self.summaries:

            full_context.append(f"[SUMMARY] {summary['summary']}")

        

        # Add current messages

        for msg in self.messages:

            full_context.append(msg['message'])

        

        return full_context


Chunking strategies involve breaking large inputs into smaller segments that can be processed individually while maintaining relevant context between chunks. This approach proves particularly useful when dealing with long documents, extensive code bases, or detailed technical documentation that exceeds the model’s context window.


Effective chunking requires careful consideration of content boundaries to avoid breaking related information across different chunks. Semantic chunking approaches analyze content meaning to identify natural break points, while syntactic chunking focuses on structural elements like paragraphs, sections, or code functions.


IMPLEMENTING CONTEXT-AWARE APPLICATIONS


Building applications that effectively manage context memory requires implementing monitoring systems that track token usage and provide early warnings when context limits are approaching. These monitoring systems should provide real-time feedback about context utilization and enable proactive management before truncation becomes necessary.


class ContextMonitor:

    def __init__(self, max_tokens=1000, warning_threshold=0.8):

        """

        Monitor context usage and provide warnings.

        """

        self.max_tokens = max_tokens

        self.warning_threshold = warning_threshold

        self.current_tokens = 0

        self.message_history = []

    

    def add_content(self, content, token_count):

        """Add content and monitor usage."""

        self.current_tokens += token_count

        self.message_history.append({

            'content': content,

            'tokens': token_count,

            'cumulative_tokens': self.current_tokens

        })

        

        # Check warning threshold

        usage_ratio = self.current_tokens / self.max_tokens

        if usage_ratio >= self.warning_threshold:

            self.emit_warning(usage_ratio)

        

        # Check if over limit

        if self.current_tokens > self.max_tokens:

            self.handle_overflow()

    

    def emit_warning(self, usage_ratio):

        """Emit context usage warning."""

        percentage = usage_ratio * 100

        print(f"WARNING: Context usage at {percentage:.1f}% of capacity")

        print(f"Current tokens: {self.current_tokens}/{self.max_tokens}")

    

    def handle_overflow(self):

        """Handle context overflow situation."""

        print(f"ERROR: Context overflow! {self.current_tokens}/{self.max_tokens} tokens")

        print("Consider implementing context management strategies")

    

    def get_usage_stats(self):

        """Return current usage statistics."""

        return {

            'current_tokens': self.current_tokens,

            'max_tokens': self.max_tokens,

            'usage_percentage': (self.current_tokens / self.max_tokens) * 100,

            'messages_count': len(self.message_history),

            'average_tokens_per_message': self.current_tokens / max(1, len(self.message_history))

        }


User interface considerations become crucial when implementing context management features. Applications should provide users with visibility into context usage, options to mark important content for retention, and controls for managing conversation length. Clear communication about context limitations helps users understand why certain behaviors occur and how they can optimize their interactions.


The user experience should seamlessly integrate context management without creating cognitive burden for users. This includes providing automatic suggestions for conversation management, offering options to save important information outside the context window, and maintaining conversation flow despite necessary context adjustments.


ADVANCED CONTEXT MANAGEMENT TECHNIQUES


Vector-based context storage represents an advanced technique where important information is stored as dense vector representations that can be retrieved when needed without consuming context window space. This approach allows applications to maintain long-term memory while keeping the active context focused on immediate needs.


Vector storage systems typically use embedding models to convert text into numerical vectors that capture semantic meaning. These vectors can then be stored in vector databases and retrieved based on similarity searches when relevant context is needed for generating responses.


import numpy as np

from typing import List, Dict, Tuple


class VectorContextStore:

    def __init__(self, max_active_tokens=1000, vector_dim=128):

        """

        Context store using vector embeddings for long-term memory.

        """

        self.max_active_tokens = max_active_tokens

        self.vector_dim = vector_dim

        self.active_context = []

        self.vector_store = {}  # In practice, use a proper vector database

        self.next_id = 0

    

    def create_simple_embedding(self, text):

        """

        Create a simple embedding for demonstration.

        Real implementation would use proper embedding models.

        """

        # Simple hash-based embedding for demonstration

        hash_value = hash(text)

        embedding = np.random.RandomState(hash_value).randn(self.vector_dim)

        return embedding / np.linalg.norm(embedding)  # Normalize

    

    def store_in_vector_memory(self, message, tokens):

        """Store message in vector memory for long-term retention."""

        embedding = self.create_simple_embedding(message)

        

        self.vector_store[self.next_id] = {

            'message': message,

            'tokens': tokens,

            'embedding': embedding,

            'timestamp': self.next_id

        }

        

        print(f"Stored message {self.next_id} in vector memory")

        self.next_id += 1

    

    def retrieve_relevant_context(self, query, top_k=3):

        """Retrieve relevant messages from vector memory."""

        if not self.vector_store:

            return []

        

        query_embedding = self.create_simple_embedding(query)

        similarities = []

        

        for doc_id, doc_data in self.vector_store.items():

            similarity = np.dot(query_embedding, doc_data['embedding'])

            similarities.append((doc_id, similarity, doc_data))

        

        # Sort by similarity and return top_k

        similarities.sort(key=lambda x: x[1], reverse=True)

        return [item[2] for item in similarities[:top_k]]

    

    def add_message_with_vector_management(self, message, tokens):

        """Add message with intelligent vector storage management."""

        # Add to active context

        self.active_context.append({

            'message': message,

            'tokens': tokens

        })

        

        # Check if active context is getting full

        current_tokens = sum(item['tokens'] for item in self.active_context)

        

        if current_tokens > self.max_active_tokens * 0.8:  # 80% threshold

            # Move oldest messages to vector storage

            while (current_tokens > self.max_active_tokens * 0.6 and 

                   len(self.active_context) > 1):

                oldest_message = self.active_context.pop(0)

                self.store_in_vector_memory(

                    oldest_message['message'], 

                    oldest_message['tokens']

                )

                current_tokens -= oldest_message['tokens']


Hierarchical context management involves organizing context information into different levels of importance and accessibility. Critical information remains in the immediate context, important information moves to a secondary layer, and background information is stored in long-term memory systems that can be accessed when needed.


This hierarchical approach allows applications to maintain multiple levels of context awareness while optimizing the use of the limited context window for the most relevant information. The system can dynamically promote or demote information between levels based on usage patterns and relevance to current interactions.


PERFORMANCE OPTIMIZATION AND BEST PRACTICES


Token estimation accuracy plays a crucial role in effective context management. Developers should implement precise token counting mechanisms that account for the specific tokenizer used by their chosen model. Inaccurate token estimates can lead to unexpected truncation or inefficient context utilization.


Different models use different tokenization schemes, and token counts can vary significantly between models even for the same text. Applications should be designed to work with the specific tokenizer of their target model and should include safety margins in their context calculations to account for minor variations in token counting.


class AccurateTokenCounter:

    def __init__(self, model_type="gpt-3.5-turbo"):

        """

        Token counter with model-specific estimation.

        """

        self.model_type = model_type

        # In practice, use tiktoken or similar libraries for accurate counting

        

    def count_tokens(self, text):

        """

        Count tokens for specific model type.

        This is a simplified version - use proper tokenizer libraries.

        """

        token_ratios = {

            "gpt-3.5-turbo": 0.75,  # Approximately 0.75 tokens per word

            "gpt-4": 0.75,

            "claude": 0.8,

            "llama": 0.85

        }

        

        word_count = len(text.split())

        ratio = token_ratios.get(self.model_type, 0.75)

        estimated_tokens = int(word_count * ratio)

        

        # Add buffer for special tokens and formatting

        return max(1, estimated_tokens + 5)

    

    def estimate_conversation_tokens(self, messages):

        """Estimate total tokens for a conversation."""

        total_tokens = 0

        

        for message in messages:

            content_tokens = self.count_tokens(message.get('content', ''))

            # Add tokens for message formatting and metadata

            formatting_tokens = 10  # Approximate overhead per message

            total_tokens += content_tokens + formatting_tokens

        

        return total_tokens


Memory-efficient data structures should be employed to minimize overhead in context management systems. Using appropriate data types, avoiding unnecessary data duplication, and implementing efficient storage mechanisms can significantly improve application performance when dealing with large amounts of context data.


Applications should also implement garbage collection strategies that regularly clean up unused context data and optimize memory usage. This includes removing expired summaries, cleaning up temporary data structures, and ensuring that vector storage systems do not grow unbounded over time.


ERROR HANDLING AND RECOVERY STRATEGIES


Robust error handling becomes essential when dealing with context memory limitations. Applications should gracefully handle situations where context management fails, token estimation proves inaccurate, or external storage systems become unavailable.


Fallback mechanisms should be implemented to ensure application functionality continues even when advanced context management features fail. This might include reverting to simple truncation strategies, providing user notifications about reduced functionality, or implementing retry mechanisms for transient failures.


class RobustContextManager:

    def __init__(self, max_tokens=1000):

        """Context manager with comprehensive error handling."""

        self.max_tokens = max_tokens

        self.messages = []

        self.error_count = 0

        self.fallback_mode = False

    

    def add_message_safely(self, message, estimated_tokens=None):

        """Add message with error handling and recovery."""

        try:

            # Attempt accurate token counting

            if estimated_tokens is None:

                estimated_tokens = self.estimate_tokens_safely(message)

            

            # Validate inputs

            if not isinstance(message, str) or estimated_tokens <= 0:

                raise ValueError("Invalid message or token count")

            

            # Attempt advanced context management

            if not self.fallback_mode:

                self.advanced_context_management(message, estimated_tokens)

            else:

                self.simple_context_management(message, estimated_tokens)

                

            # Reset error count on success

            self.error_count = 0

            

        except Exception as e:

            self.handle_context_error(e, message, estimated_tokens)

    

    def estimate_tokens_safely(self, message):

        """Estimate tokens with error handling."""

        try:

            # Attempt sophisticated token counting

            return len(message.split()) * 0.75 + 5

        except:

            # Fallback to character-based estimation

            return len(message) // 4 + 1

    

    def advanced_context_management(self, message, tokens):

        """Implement advanced context management with error handling."""

        # This would include summarization, priority management, etc.

        # For demonstration, we'll use simple management

        self.simple_context_management(message, tokens)

    

    def simple_context_management(self, message, tokens):

        """Simple fallback context management."""

        self.messages.append({'message': message, 'tokens': tokens})

        

        # Simple truncation if over limit

        current_total = sum(msg['tokens'] for msg in self.messages)

        while current_total > self.max_tokens and len(self.messages) > 1:

            removed = self.messages.pop(0)

            current_total -= removed['tokens']

    

    def handle_context_error(self, error, message, tokens):

        """Handle context management errors."""

        self.error_count += 1

        print(f"Context management error: {error}")

        

        # Switch to fallback mode after multiple errors

        if self.error_count >= 3:

            self.fallback_mode = True

            print("Switching to fallback context management mode")

        

        # Attempt to add message using simple method

        try:

            self.simple_context_management(message, tokens or len(message)//4)

        except Exception as fallback_error:

            print(f"Critical context error: {fallback_error}")

            # Log error but continue operation


MONITORING AND ANALYTICS


Context usage analytics provide valuable insights into application performance and user behavior patterns. Implementing comprehensive monitoring systems helps developers identify optimization opportunities and understand how context limitations impact user experience.


Key metrics to track include average context utilization, frequency of truncation events, user session lengths, and the effectiveness of different context management strategies. This data enables continuous improvement of context management algorithms and helps guide feature development priorities.


CONCLUSION


Managing LLM context memory effectively requires a comprehensive approach combining multiple strategies and robust implementation practices. Developers must understand the technical constraints, implement appropriate management strategies, and provide seamless user experiences despite these limitations.


Success in context management comes from careful planning, implementation of multiple complementary strategies, and continuous monitoring and optimization. The techniques and code examples presented in this article provide a foundation for building robust context-aware applications that can handle the challenges of limited context memory while delivering excellent user experiences.


The future of context management continues to evolve with advances in model architectures, more efficient tokenization schemes, and improved memory management techniques. Developers should stay informed about these developments while focusing on implementing solid fundamental practices that will remain relevant regardless of technological advances.


COMPLETE RUNNING EXAMPLE: INTELLIGENT CHATBOT WITH COMPREHENSIVE CONTEXT MANAGEMENT


import numpy as np

import json

from typing import List, Dict, Optional, Tuple

from datetime import datetime

import hashlib


class IntelligentChatbot:

    """

    Complete chatbot implementation with comprehensive context management.

    Demonstrates all context management strategies discussed in the article.

    """

    

    def __init__(self, max_context_tokens=2000, model_type="gpt-3.5-turbo"):

        # Core configuration

        self.max_context_tokens = max_context_tokens

        self.model_type = model_type

        self.warning_threshold = 0.8

        self.summarization_threshold = 0.75

        

        # Context storage

        self.active_messages = []

        self.conversation_summaries = []

        self.vector_memory = {}

        self.user_preferences = {}

        

        # Management components

        self.token_counter = TokenCounter(model_type)

        self.priority_manager = PriorityManager()

        self.summarizer = ConversationSummarizer()

        self.vector_store = VectorMemoryStore()

        

        # Monitoring and statistics

        self.stats = {

            'total_messages': 0,

            'truncated_messages': 0,

            'summaries_created': 0,

            'vector_stores': 0,

            'context_warnings': 0

        }

        

        # Error handling

        self.error_count = 0

        self.fallback_mode = False

        

        print(f"Intelligent Chatbot initialized with {max_context_tokens} token limit")

    

    def process_user_message(self, user_message: str, user_id: str = "default") -> str:

        """

        Process a user message with comprehensive context management.

        """

        try:

            # Update statistics

            self.stats['total_messages'] += 1

            

            # Count tokens for the new message

            message_tokens = self.token_counter.count_message_tokens(user_message, "user")

            

            # Create message object

            message_obj = {

                'content': user_message,

                'role': 'user',

                'tokens': message_tokens,

                'timestamp': datetime.now(),

                'user_id': user_id,

                'priority': self.priority_manager.calculate_priority(user_message),

                'message_id': self.generate_message_id()

            }

            

            # Add to active context

            self.add_message_to_context(message_obj)

            

            # Manage context size

            self.manage_context_overflow()

            

            # Generate response

            response = self.generate_response(user_message, user_id)

            

            # Add response to context

            response_tokens = self.token_counter.count_message_tokens(response, "assistant")

            response_obj = {

                'content': response,

                'role': 'assistant',

                'tokens': response_tokens,

                'timestamp': datetime.now(),

                'user_id': user_id,

                'priority': 5,  # Assistant messages get standard priority

                'message_id': self.generate_message_id()

            }

            

            self.add_message_to_context(response_obj)

            self.manage_context_overflow()

            

            return response

            

        except Exception as e:

            return self.handle_processing_error(e, user_message)

    

    def add_message_to_context(self, message_obj: Dict):

        """Add message to active context with priority management."""

        self.active_messages.append(message_obj)

        

        # Check for context warnings

        current_tokens = self.get_current_token_count()

        if current_tokens > self.max_context_tokens * self.warning_threshold:

            self.emit_context_warning(current_tokens)

    

    def manage_context_overflow(self):

        """Manage context when approaching or exceeding token limits."""

        current_tokens = self.get_current_token_count()

        

        if current_tokens > self.max_context_tokens * self.summarization_threshold:

            self.trigger_context_management()

    

    def trigger_context_management(self):

        """Trigger appropriate context management strategy."""

        try:

            current_tokens = self.get_current_token_count()

            

            print(f"Context management triggered: {current_tokens}/{self.max_context_tokens} tokens")

            

            # Strategy 1: Remove low-priority messages

            self.remove_low_priority_messages()

            

            # Strategy 2: Create summary if still over threshold

            if self.get_current_token_count() > self.max_context_tokens * 0.9:

                self.create_conversation_summary()

            

            # Strategy 3: Store important messages in vector memory

            if self.get_current_token_count() > self.max_context_tokens * 0.8:

                self.move_to_vector_memory()

            

            # Strategy 4: Final truncation if necessary

            if self.get_current_token_count() > self.max_context_tokens:

                self.perform_emergency_truncation()

                

        except Exception as e:

            print(f"Context management error: {e}")

            self.perform_emergency_truncation()

    

    def remove_low_priority_messages(self):

        """Remove messages with lowest priority scores."""

        if len(self.active_messages) <= 2:  # Keep at least 2 messages

            return

        

        # Sort by priority (ascending) and remove lowest

        messages_by_priority = sorted(self.active_messages, key=lambda x: x['priority'])

        

        removed_count = 0

        while (self.get_current_token_count() > self.max_context_tokens * 0.8 and 

               len(self.active_messages) > 2 and 

               removed_count < len(messages_by_priority) // 3):

            

            message_to_remove = messages_by_priority[removed_count]

            if message_to_remove in self.active_messages:

                self.active_messages.remove(message_to_remove)

                self.stats['truncated_messages'] += 1

                print(f"Removed low priority message: {message_to_remove['content'][:50]}...")

            

            removed_count += 1

    

    def create_conversation_summary(self):

        """Create summary of older conversation parts."""

        if len(self.active_messages) < 4:

            return

        

        # Take oldest 1/3 of messages for summarization

        messages_to_summarize = self.active_messages[:len(self.active_messages)//3]

        

        summary = self.summarizer.create_summary(messages_to_summarize)

        summary_tokens = self.token_counter.count_text_tokens(summary)

        

        # Store summary

        summary_obj = {

            'summary': summary,

            'tokens': summary_tokens,

            'original_message_count': len(messages_to_summarize),

            'created_at': datetime.now(),

            'summary_id': self.generate_message_id()

        }

        

        self.conversation_summaries.append(summary_obj)

        

        # Remove original messages

        for message in messages_to_summarize:

            if message in self.active_messages:

                self.active_messages.remove(message)

        

        self.stats['summaries_created'] += 1

        print(f"Created summary of {len(messages_to_summarize)} messages ({summary_tokens} tokens)")

    

    def move_to_vector_memory(self):

        """Move important messages to vector memory storage."""

        if len(self.active_messages) <= 3:

            return

        

        # Find messages suitable for vector storage (high importance, not recent)

        candidates = [msg for msg in self.active_messages[:-2]  # Keep last 2 messages

                     if msg['priority'] >= 7]  # High priority messages

        

        for message in candidates[:2]:  # Move up to 2 messages

            self.vector_store.store_message(message)

            if message in self.active_messages:

                self.active_messages.remove(message)

                self.stats['vector_stores'] += 1

                print(f"Moved to vector memory: {message['content'][:50]}...")

    

    def perform_emergency_truncation(self):

        """Perform simple truncation when other strategies fail."""

        target_tokens = int(self.max_context_tokens * 0.7)  # Target 70% capacity

        

        while self.get_current_token_count() > target_tokens and len(self.active_messages) > 1:

            removed_message = self.active_messages.pop(0)

            self.stats['truncated_messages'] += 1

            print(f"Emergency truncation: {removed_message['content'][:30]}...")

    

    def generate_response(self, user_message: str, user_id: str) -> str:

        """

        Generate response using current context and retrieved memory.

        """

        # Retrieve relevant information from vector memory

        relevant_memories = self.vector_store.retrieve_relevant(user_message, top_k=2)

        

        # Build context for response generation

        context_parts = []

        

        # Add conversation summaries

        for summary in self.conversation_summaries[-2:]:  # Last 2 summaries

            context_parts.append(f"[Previous conversation: {summary['summary']}]")

        

        # Add relevant memories

        for memory in relevant_memories:

            context_parts.append(f"[Relevant context: {memory['content']}]")

        

        # Add current active messages

        for message in self.active_messages:

            context_parts.append(f"{message['role']}: {message['content']}")

        

        # Simulate response generation (in practice, this would call actual LLM)

        response = self.simulate_llm_response(user_message, context_parts, user_id)

        

        return response

    

    def simulate_llm_response(self, user_message: str, context_parts: List[str], user_id: str) -> str:

        """

        Simulate LLM response generation.

        In practice, this would call the actual language model API.

        """

        # Simple response simulation based on user message

        if "hello" in user_message.lower():

            return f"Hello! How can I help you today? I have access to our previous conversations and relevant context."

        elif "remember" in user_message.lower():

            return f"I maintain context across our conversation and can reference relevant past discussions when needed."

        elif "context" in user_message.lower():

            return f"I'm currently managing {len(self.active_messages)} active messages, {len(self.conversation_summaries)} summaries, and {len(self.vector_memory)} vector memories."

        elif "?" in user_message:

            return f"That's an interesting question. Based on our conversation context, here's what I think..."

        else:

            return f"I understand you're saying: '{user_message}'. Let me respond appropriately based on our conversation history."

    

    def get_current_token_count(self) -> int:

        """Calculate current total token count in active context."""

        total = sum(msg['tokens'] for msg in self.active_messages)

        total += sum(summary['tokens'] for summary in self.conversation_summaries)

        return total

    

    def emit_context_warning(self, current_tokens: int):

        """Emit warning when context usage is high."""

        usage_percentage = (current_tokens / self.max_context_tokens) * 100

        print(f"Context Warning: {usage_percentage:.1f}% capacity ({current_tokens}/{self.max_context_tokens} tokens)")

        self.stats['context_warnings'] += 1

    

    def generate_message_id(self) -> str:

        """Generate unique message ID."""

        return hashlib.md5(f"{datetime.now().isoformat()}{self.stats['total_messages']}".encode()).hexdigest()[:8]

    

    def handle_processing_error(self, error: Exception, user_message: str) -> str:

        """Handle errors during message processing."""

        self.error_count += 1

        print(f"Processing error: {error}")

        

        if self.error_count >= 3:

            self.fallback_mode = True

            print("Switching to fallback mode due to repeated errors")

        

        return f"I apologize, but I encountered an error processing your message. Please try rephrasing your request."

    

    def get_conversation_stats(self) -> Dict:

        """Get comprehensive conversation statistics."""

        return {

            **self.stats,

            'current_active_messages': len(self.active_messages),

            'current_summaries': len(self.conversation_summaries),

            'current_vector_memories': len(self.vector_memory),

            'current_tokens': self.get_current_token_count(),

            'context_utilization': f"{(self.get_current_token_count() / self.max_context_tokens) * 100:.1f}%",

            'fallback_mode': self.fallback_mode

        }

    

    def export_conversation_history(self) -> Dict:

        """Export complete conversation history for analysis or backup."""

        return {

            'active_messages': self.active_messages,

            'summaries': self.conversation_summaries,

            'vector_memory': dict(self.vector_memory),

            'stats': self.get_conversation_stats(),

            'export_timestamp': datetime.now().isoformat()

        }


class TokenCounter:

    """Accurate token counting for different model types."""

    

    def __init__(self, model_type: str):

        self.model_type = model_type

        self.token_ratios = {

            "gpt-3.5-turbo": 0.75,

            "gpt-4": 0.75,

            "claude": 0.8,

            "llama": 0.85

        }

    

    def count_text_tokens(self, text: str) -> int:

        """Count tokens in plain text."""

        if not text:

            return 0

        

        word_count = len(text.split())

        ratio = self.token_ratios.get(self.model_type, 0.75)

        base_tokens = int(word_count * ratio)

        

        # Add buffer for special tokens

        return max(1, base_tokens + 3)

    

    def count_message_tokens(self, content: str, role: str) -> int:

        """Count tokens for a complete message including formatting."""

        content_tokens = self.count_text_tokens(content)

        formatting_tokens = 8  # Overhead for message structure

        role_tokens = self.count_text_tokens(role)

        

        return content_tokens + formatting_tokens + role_tokens


class PriorityManager:

    """Manage message priorities for intelligent context management."""

    

    def __init__(self):

        self.importance_keywords = [

            'important', 'critical', 'urgent', 'remember', 'key', 'essential',

            'don\'t forget', 'make sure', 'note that', 'please remember'

        ]

        self.question_indicators = ['?', 'how', 'what', 'when', 'where', 'why', 'who']

    

    def calculate_priority(self, message: str) -> int:

        """Calculate priority score for a message (1-10 scale)."""

        base_priority = 5  # Default priority

        

        message_lower = message.lower()

        

        # Boost for importance keywords

        importance_boost = sum(2 for keyword in self.importance_keywords 

                             if keyword in message_lower)

        

        # Boost for questions

        question_boost = 1 if any(indicator in message_lower for indicator in self.question_indicators) else 0

        

        # Boost for longer, detailed messages

        length_boost = 1 if len(message) > 100 else 0

        

        # Boost for messages with specific requests

        request_boost = 1 if any(word in message_lower for word in ['please', 'can you', 'would you']) else 0

        

        final_priority = min(10, base_priority + importance_boost + question_boost + length_boost + request_boost)

        

        return final_priority


class ConversationSummarizer:

    """Create intelligent summaries of conversation segments."""

    

    def create_summary(self, messages: List[Dict]) -> str:

        """Create a concise summary of multiple messages."""

        if not messages:

            return ""

        

        # Extract key information from messages

        topics = set()

        questions = []

        important_statements = []

        

        for message in messages:

            content = message['content']

            

            # Identify questions

            if '?' in content:

                questions.append(content)

            

            # Identify important statements

            if message.get('priority', 0) >= 7:

                important_statements.append(content)

            

            # Extract potential topics (simple keyword extraction)

            words = content.lower().split()

            topics.update(word for word in words if len(word) > 4)

        

        # Build summary

        summary_parts = []

        

        if topics:

            main_topics = list(topics)[:3]  # Top 3 topics

            summary_parts.append(f"Discussion topics: {', '.join(main_topics)}")

        

        if questions:

            summary_parts.append(f"Key question: {questions[-1][:100]}...")

        

        if important_statements:

            summary_parts.append(f"Important note: {important_statements[-1][:100]}...")

        

        if not summary_parts:

            # Fallback summary

            first_message = messages[0]['content'][:100]

            last_message = messages[-1]['content'][:100]

            summary_parts.append(f"Conversation from '{first_message}...' to '{last_message}...'")

        

        return " | ".join(summary_parts)


class VectorMemoryStore:

    """Store and retrieve messages using vector embeddings."""

    

    def __init__(self, embedding_dim: int = 128):

        self.embedding_dim = embedding_dim

        self.memory_store = {}

        self.next_id = 0

    

    def create_embedding(self, text: str) -> np.ndarray:

        """Create simple embedding for text (in practice, use proper embedding models)."""

        # Simple hash-based embedding for demonstration

        hash_value = hash(text) % (2**32)

        embedding = np.random.RandomState(hash_value).randn(self.embedding_dim)

        return embedding / np.linalg.norm(embedding)

    

    def store_message(self, message: Dict):

        """Store message in vector memory."""

        embedding = self.create_embedding(message['content'])

        

        self.memory_store[self.next_id] = {

            'content': message['content'],

            'role': message['role'],

            'timestamp': message['timestamp'],

            'priority': message['priority'],

            'embedding': embedding,

            'tokens': message['tokens']

        }

        

        self.next_id += 1

    

    def retrieve_relevant(self, query: str, top_k: int = 3) -> List[Dict]:

        """Retrieve most relevant messages for a query."""

        if not self.memory_store:

            return []

        

        query_embedding = self.create_embedding(query)

        similarities = []

        

        for memory_id, memory_data in self.memory_store.items():

            similarity = np.dot(query_embedding, memory_data['embedding'])

            similarities.append((similarity, memory_data))

        

        # Sort by similarity and return top results

        similarities.sort(key=lambda x: x[0], reverse=True)

        return [item[1] for item in similarities[:top_k]]


# Demonstration of the complete system

def demonstrate_intelligent_chatbot():

    """Demonstrate the chatbot with various context management scenarios."""

    print("=== INTELLIGENT CHATBOT DEMONSTRATION ===\n")

    

    # Initialize chatbot with small context window for demonstration

    chatbot = IntelligentChatbot(max_context_tokens=500, model_type="gpt-3.5-turbo")

    

    # Simulate conversation that will trigger context management

    test_messages = [

        "Hello, I'm working on a machine learning project about natural language processing.",

        "I need to implement a transformer model for text classification. Can you help?",

        "The dataset has 50,000 training samples and 10,000 test samples.",

        "Important: Remember that I'm using PyTorch and prefer clean, documented code.",

        "What are the key components of a transformer architecture?",

        "How do attention mechanisms work in transformers?",

        "Can you explain the difference between encoder and decoder architectures?",

        "I'm particularly interested in BERT-style models for my classification task.",

        "What preprocessing steps should I consider for text classification?",

        "How should I handle class imbalance in my dataset?",

        "Please remember: my deadline is next Friday, so time is important.",

        "What evaluation metrics would you recommend for text classification?",

        "How can I fine-tune a pre-trained model for my specific domain?",

        "What are some common pitfalls to avoid in NLP projects?",

        "Can you provide code examples for implementing the data loading pipeline?"

    ]

    

    print("Starting conversation simulation...\n")

    

    for i, message in enumerate(test_messages, 1):

        print(f"--- Message {i} ---")

        print(f"User: {message}")

        

        response = chatbot.process_user_message(message, user_id="demo_user")

        print(f"Bot: {response}")

        

        # Show context stats every few messages

        if i % 3 == 0:

            stats = chatbot.get_conversation_stats()

            print(f"\nContext Stats: {stats['current_tokens']}/{chatbot.max_context_tokens} tokens " +

                  f"({stats['context_utilization']})")

            print(f"Active messages: {stats['current_active_messages']}, " +

                  f"Summaries: {stats['current_summaries']}, " +

                  f"Vector memories: {stats['current_vector_memories']}")

        

        print()

    

    # Final statistics

    print("=== FINAL CONVERSATION STATISTICS ===")

    final_stats = chatbot.get_conversation_stats()

    for key, value in final_stats.items():

        print(f"{key}: {value}")

    

    return chatbot


# Run the demonstration

if __name__ == "__main__":

    demo_chatbot = demonstrate_intelligent_chatbot()