Hitchhiker's Guide to AI, Software Architecture, and Everything Else: Building a TypeScript Library for LLM Application Development

Preface

In this article I propose a similar library for handling common tasks in LLM applications in TypeScript to what I introduced in my last post for Python.

Introduction

Developing applications that integrate Large Language Models in TypeScript environments requires solving many recurring technical challenges. Whether building Node.js backend services, Electron desktop applications, or serverless functions, developers repeatedly implement GPU detection, configuration management, tool calling mechanisms, rate limiting, and other foundational components. This article presents a comprehensive TypeScript library designed to eliminate this redundant work by providing production-ready, reusable components that handle the most common requirements of LLM-based applications.

The library follows clean architecture principles with clear separation of concerns and leverages TypeScript's powerful type system to provide compile-time safety and excellent developer experience. Each component is designed to be independently usable while also working seamlessly with other components. The goal is to provide developers with a toolkit that accelerates development without imposing rigid constraints on application architecture.

This article explores each component in depth, explaining not just what it does but why it is designed in a particular way. We will examine the technical challenges each component addresses and provide concrete code examples. At the end, a complete running example demonstrates how all components integrate to create a functional LLM application.

GPU Detection and Optimization Component

Modern LLM inference requires significant computational resources, and utilizing available GPU acceleration is essential for acceptable performance. In TypeScript environments, particularly Node.js applications, detecting and configuring GPU acceleration requires interfacing with native libraries and system information. Different hardware platforms use different GPU technologies: Apple Silicon uses Metal Performance Shaders, NVIDIA uses CUDA, AMD uses ROCm, and Intel has its own acceleration framework.

The GPU detection component solves this problem by automatically identifying available acceleration hardware and providing the appropriate configuration for the inference engine. This component abstracts away platform-specific details, allowing the rest of the application to remain hardware-agnostic.

The detection process follows a priority order. CUDA is checked first because it offers the most mature ecosystem for LLM inference. If CUDA is unavailable, the component checks for ROCm on AMD hardware, then Metal Performance Shaders on Apple Silicon, and finally Intel acceleration. If no GPU is available, the component falls back to CPU inference with appropriate warnings.

Here is the core implementation of the GPU detector:

import { exec } from 'child_process';
import { promisify } from 'util';
import * as os from 'os';
import { createLogger, Logger } from './logger';

const execAsync = promisify(exec);

export enum AcceleratorType {
    CUDA = 'cuda',
    ROCM = 'rocm',
    MPS = 'mps',
    INTEL = 'intel',
    CPU = 'cpu'
}

export interface AcceleratorInfo {
    acceleratorType: AcceleratorType;
    deviceName: string;
    deviceCount: number;
    memoryAvailable?: number;
    computeCapability?: string;
}

export class GPUDetector {
    private logger: Logger;
    private cachedInfo?: AcceleratorInfo;

    constructor() {
        this.logger = createLogger('GPUDetector');
    }

    public async detect(): Promise<AcceleratorInfo> {
        if (this.cachedInfo) {
            return this.cachedInfo;
        }

        // Check for NVIDIA CUDA
        const cudaInfo = await this.detectCuda();
        if (cudaInfo) {
            this.cachedInfo = cudaInfo;
            this.logger.info(`Detected CUDA GPU: ${cudaInfo.deviceName}`);
            return cudaInfo;
        }

        // Check for AMD ROCm
        const rocmInfo = await this.detectRocm();
        if (rocmInfo) {
            this.cachedInfo = rocmInfo;
            this.logger.info(`Detected ROCm GPU: ${rocmInfo.deviceName}`);
            return rocmInfo;
        }

        // Check for Apple Metal Performance Shaders
        const mpsInfo = await this.detectMps();
        if (mpsInfo) {
            this.cachedInfo = mpsInfo;
            this.logger.info(`Detected Apple MPS: ${mpsInfo.deviceName}`);
            return mpsInfo;
        }

        // Fallback to CPU
        this.cachedInfo = this.detectCpu();
        this.logger.warn('No GPU acceleration available, using CPU');
        return this.cachedInfo;
    }

The detector implements lazy initialization through caching. Once hardware is detected, subsequent calls return the cached result rather than repeating the detection process. This is important because hardware detection can be expensive and the hardware configuration does not change during application runtime. The async/await pattern is used throughout because detection involves executing system commands and reading files, which are asynchronous operations in Node.js.

Each detection method gathers platform-specific information. For CUDA, this includes device count, memory, and compute capability. For MPS, it includes the chip architecture. This information helps the application make informed decisions about model loading and inference parameters.

    private async detectCuda(): Promise<AcceleratorInfo | null> {
        try {
            // Try to execute nvidia-smi to detect CUDA devices
            const { stdout } = await execAsync('nvidia-smi --query-gpu=name,memory.total --format=csv,noheader');
            
            const lines = stdout.trim().split('\n');
            if (lines.length === 0) {
                return null;
            }

            const firstLine = lines[0].split(',');
            const deviceName = firstLine[0].trim();
            const memoryStr = firstLine[1].trim();
            const memoryMB = parseInt(memoryStr.split(' ')[0]);

            // Get compute capability
            let computeCapability: string | undefined;
            try {
                const { stdout: capStdout } = await execAsync(
                    'nvidia-smi --query-gpu=compute_cap --format=csv,noheader'
                );
                computeCapability = capStdout.trim().split('\n')[0];
            } catch (error) {
                // Compute capability not critical
            }

            return {
                acceleratorType: AcceleratorType.CUDA,
                deviceName,
                deviceCount: lines.length,
                memoryAvailable: memoryMB * 1024 * 1024,
                computeCapability
            };
        } catch (error) {
            // nvidia-smi not available or failed
            return null;
        }
    }

    private async detectMps(): Promise<AcceleratorInfo | null> {
        if (os.platform() !== 'darwin') {
            return null;
        }

        try {
            // Check for Apple Silicon
            const { stdout } = await execAsync('sysctl -n machdep.cpu.brand_string');
            const cpuBrand = stdout.trim();

            if (cpuBrand.includes('Apple')) {
                return {
                    acceleratorType: AcceleratorType.MPS,
                    deviceName: `Apple Silicon ${cpuBrand}`,
                    deviceCount: 1
                };
            }
        } catch (error) {
            // Not Apple Silicon or command failed
        }

        return null;
    }

    private async detectRocm(): Promise<AcceleratorInfo | null> {
        try {
            // Try to execute rocm-smi to detect ROCm devices
            const { stdout } = await execAsync('rocm-smi --showproductname');
            
            if (stdout.includes('GPU')) {
                const lines = stdout.trim().split('\n');
                const deviceLine = lines.find(line => line.includes('GPU'));
                
                return {
                    acceleratorType: AcceleratorType.ROCM,
                    deviceName: deviceLine || 'AMD ROCm GPU',
                    deviceCount: 1
                };
            }
        } catch (error) {
            // rocm-smi not available or failed
        }

        return null;
    }

    private detectCpu(): AcceleratorInfo {
        const cpuCount = os.cpus().length;
        const cpuModel = os.cpus()[0].model;

        return {
            acceleratorType: AcceleratorType.CPU,
            deviceName: cpuModel,
            deviceCount: cpuCount
        };
    }

    public getDeviceString(): string {
        if (!this.cachedInfo) {
            throw new Error('GPU detection not performed. Call detect() first.');
        }

        const info = this.cachedInfo;
        switch (info.acceleratorType) {
            case AcceleratorType.CUDA:
                return 'cuda:0';
            case AcceleratorType.MPS:
                return 'mps';
            case AcceleratorType.ROCM:
                return 'rocm:0';
            default:
                return 'cpu';
        }
    }

    public getOptimizationHints(): Record<string, any> {
        if (!this.cachedInfo) {
            throw new Error('GPU detection not performed. Call detect() first.');
        }

        const hints: Record<string, any> = {
            device: this.getDeviceString(),
            deviceType: this.cachedInfo.acceleratorType
        };

        // Add memory-based recommendations
        if (this.cachedInfo.memoryAvailable) {
            const memoryGB = this.cachedInfo.memoryAvailable / (1024 * 1024 * 1024);
            
            if (memoryGB < 8) {
                hints.recommendQuantization = true;
                hints.maxBatchSize = 1;
            } else if (memoryGB < 16) {
                hints.recommendQuantization = false;
                hints.maxBatchSize = 4;
            } else {
                hints.recommendQuantization = false;
                hints.maxBatchSize = 8;
            }
        }

        return hints;
    }
}

The getDeviceString method provides the string that inference libraries expect when loading models. This abstraction means application code can simply call getDeviceString without knowing anything about the underlying hardware. The getOptimizationHints method provides recommendations based on detected hardware, such as whether to use quantization or what batch size to use.

The component uses TypeScript's type system to ensure type safety. The AcceleratorType enum provides a finite set of possible accelerator types, and the AcceleratorInfo interface defines the structure of hardware information. This prevents runtime errors from typos or incorrect data structures.

Abstract LLM Interface Component

LLM applications often need to support multiple model providers. An application might use OpenAI's GPT models in production but switch to a local Llama model for development or privacy-sensitive deployments. Alternatively, different parts of an application might use different models optimized for specific tasks.

Hard-coding dependencies on specific LLM providers creates tight coupling that makes the application brittle and difficult to modify. The abstract LLM interface component solves this by defining a common contract that all LLM implementations must satisfy. Application code depends on this interface rather than concrete implementations, enabling seamless model swapping.

The interface defines the essential operations that any LLM must support: generating completions, streaming responses, and managing conversation context. It also standardizes how parameters like temperature and top-k are passed to the model.

export interface Message {
    role: 'system' | 'user' | 'assistant';
    content: string;
    name?: string;
}

export interface CompletionResponse {
    content: string;
    model: string;
    finishReason: string;
    usage: {
        promptTokens: number;
        completionTokens: number;
        totalTokens: number;
    };
    rawResponse?: any;
}

export interface CompletionOptions {
    temperature?: number;
    maxTokens?: number;
    topP?: number;
    topK?: number;
    stop?: string[];
    presencePenalty?: number;
    frequencyPenalty?: number;
}

export abstract class BaseLLM {
    protected modelName: string;
    protected config: Record<string, any>;
    protected logger: Logger;

    constructor(modelName: string, config: Record<string, any> = {}) {
        this.modelName = modelName;
        this.config = config;
        this.logger = createLogger(this.constructor.name);
    }

    public abstract complete(
        messages: Message[],
        options?: CompletionOptions
    ): Promise<CompletionResponse>;

    public abstract streamComplete(
        messages: Message[],
        options?: CompletionOptions
    ): AsyncIterableIterator<string>;

    public getModelName(): string {
        return this.modelName;
    }

    protected validateMessages(messages: Message[]): void {
        if (!messages || messages.length === 0) {
            throw new Error('Messages array cannot be empty');
        }

        for (const message of messages) {
            if (!message.role || !message.content) {
                throw new Error('Each message must have role and content');
            }

            if (!['system', 'user', 'assistant'].includes(message.role)) {
                throw new Error(`Invalid role: ${message.role}`);
            }
        }
    }

    protected validateOptions(options?: CompletionOptions): void {
        if (!options) return;

        if (options.temperature !== undefined) {
            if (options.temperature < 0 || options.temperature > 2) {
                throw new Error('Temperature must be between 0 and 2');
            }
        }

        if (options.topP !== undefined) {
            if (options.topP < 0 || options.topP > 1) {
                throw new Error('topP must be between 0 and 1');
            }
        }

        if (options.topK !== undefined && options.topK < 1) {
            throw new Error('topK must be positive');
        }
    }
}

The interface uses TypeScript's type system extensively. The Message interface uses a union type for the role field, ensuring only valid roles can be specified. The CompletionOptions interface makes all fields optional with default values, providing flexibility while maintaining type safety.

The BaseLLM abstract class provides common functionality that all implementations can use. The validateMessages and validateOptions methods ensure that inputs are valid before they reach the implementation-specific code. This validation happens once in the base class rather than being duplicated in each implementation.

Concrete implementations of this interface handle the specifics of communicating with different LLM providers. Here is an implementation for OpenAI's API:

import OpenAI from 'openai';
import { Stream } from 'openai/streaming';

export class OpenAILLM extends BaseLLM {
    private client: OpenAI;

    constructor(modelName: string, apiKey: string, config: Record<string, any> = {}) {
        super(modelName, config);
        
        this.client = new OpenAI({
            apiKey,
            ...config
        });
    }

    public async complete(
        messages: Message[],
        options: CompletionOptions = {}
    ): Promise<CompletionResponse> {
        this.validateMessages(messages);
        this.validateOptions(options);

        const response = await this.client.chat.completions.create({
            model: this.modelName,
            messages: messages as OpenAI.Chat.ChatCompletionMessageParam[],
            temperature: options.temperature ?? 0.7,
            max_tokens: options.maxTokens,
            top_p: options.topP ?? 1.0,
            stop: options.stop,
            presence_penalty: options.presencePenalty,
            frequency_penalty: options.frequencyPenalty
        });

        const choice = response.choices[0];

        return {
            content: choice.message.content || '',
            model: response.model,
            finishReason: choice.finish_reason,
            usage: {
                promptTokens: response.usage?.prompt_tokens || 0,
                completionTokens: response.usage?.completion_tokens || 0,
                totalTokens: response.usage?.total_tokens || 0
            },
            rawResponse: response
        };
    }

    public async *streamComplete(
        messages: Message[],
        options: CompletionOptions = {}
    ): AsyncIterableIterator<string> {
        this.validateMessages(messages);
        this.validateOptions(options);

        const stream = await this.client.chat.completions.create({
            model: this.modelName,
            messages: messages as OpenAI.Chat.ChatCompletionMessageParam[],
            temperature: options.temperature ?? 0.7,
            max_tokens: options.maxTokens,
            top_p: options.topP ?? 1.0,
            stream: true,
            stop: options.stop,
            presence_penalty: options.presencePenalty,
            frequency_penalty: options.frequencyPenalty
        });

        for await (const chunk of stream) {
            const content = chunk.choices[0]?.delta?.content;
            if (content) {
                yield content;
            }
        }
    }
}

The OpenAI implementation demonstrates how the abstract interface adapts to a specific backend. The complete method makes an HTTP request to OpenAI's API and transforms the response into the standard CompletionResponse format. The streamComplete method uses async generators, a powerful TypeScript feature that allows yielding values asynchronously.

The streaming implementation is particularly elegant in TypeScript. The async generator syntax makes it easy to consume streams using for-await-of loops. Application code can iterate over the stream naturally without managing callbacks or event listeners.

Here is an implementation for local models using a hypothetical local inference library:

import { LlamaModel, LlamaContext, LlamaChatSession } from 'node-llama-cpp';

export class LocalLlamaLLM extends BaseLLM {
    private model?: LlamaModel;
    private context?: LlamaContext;
    private device: string;

    constructor(
        modelPath: string,
        device?: string,
        config: Record<string, any> = {}
    ) {
        super(modelPath, config);
        this.device = device || 'cpu';
    }

    public async initialize(): Promise<void> {
        this.logger.info(`Loading model from ${this.modelName} on ${this.device}`);

        this.model = new LlamaModel({
            modelPath: this.modelName,
            gpuLayers: this.device.startsWith('cuda') ? 35 : 0
        });

        this.context = new LlamaContext({
            model: this.model,
            contextSize: this.config.contextSize || 4096
        });

        this.logger.info('Model loaded successfully');
    }

    public async complete(
        messages: Message[],
        options: CompletionOptions = {}
    ): Promise<CompletionResponse> {
        if (!this.model || !this.context) {
            throw new Error('Model not initialized. Call initialize() first.');
        }

        this.validateMessages(messages);
        this.validateOptions(options);

        const session = new LlamaChatSession({
            context: this.context
        });

        // Format messages for the model
        const formattedPrompt = this.formatMessages(messages);

        const startTime = Date.now();
        const response = await session.prompt(formattedPrompt, {
            temperature: options.temperature ?? 0.7,
            topK: options.topK ?? 40,
            topP: options.topP ?? 0.95,
            maxTokens: options.maxTokens ?? 512
        });

        const endTime = Date.now();
        const duration = (endTime - startTime) / 1000;

        this.logger.info(`Generated completion in ${duration.toFixed(2)}s`);

        // Estimate token counts (simplified)
        const promptTokens = Math.ceil(formattedPrompt.length / 4);
        const completionTokens = Math.ceil(response.length / 4);

        return {
            content: response,
            model: this.modelName,
            finishReason: 'stop',
            usage: {
                promptTokens,
                completionTokens,
                totalTokens: promptTokens + completionTokens
            }
        };
    }

    public async *streamComplete(
        messages: Message[],
        options: CompletionOptions = {}
    ): AsyncIterableIterator<string> {
        if (!this.model || !this.context) {
            throw new Error('Model not initialized. Call initialize() first.');
        }

        this.validateMessages(messages);
        this.validateOptions(options);

        const session = new LlamaChatSession({
            context: this.context
        });

        const formattedPrompt = this.formatMessages(messages);

        const stream = session.promptWithMeta(formattedPrompt, {
            temperature: options.temperature ?? 0.7,
            topK: options.topK ?? 40,
            topP: options.topP ?? 0.95,
            maxTokens: options.maxTokens ?? 512,
            onToken: (tokens: number[]) => {
                // Tokens are yielded through the async generator
            }
        });

        for await (const token of stream) {
            yield token;
        }
    }

    private formatMessages(messages: Message[]): string {
        const formatted = messages.map(msg => {
            const prefix = msg.role === 'user' ? 'User' : 
                          msg.role === 'assistant' ? 'Assistant' : 
                          'System';
            return `${prefix}: ${msg.content}`;
        });

        return formatted.join('\n\n') + '\n\nAssistant:';
    }

    public async dispose(): Promise<void> {
        if (this.context) {
            this.context.dispose();
        }
        if (this.model) {
            this.model.dispose();
        }
        this.logger.info('Model resources disposed');
    }
}

The local implementation shows how the same interface adapts to a completely different backend. Instead of making HTTP requests, it loads models into memory and runs inference locally. The initialize method is specific to local models because they require loading before use, while API-based models are ready immediately.

The dispose method demonstrates proper resource management. Local models consume significant memory and GPU resources that must be explicitly released. TypeScript's type system helps ensure that dispose is called by making it part of the class interface.

Configuration Management Component

LLM applications require extensive configuration. Model parameters like temperature and top-k affect generation quality. Context window sizes determine how much conversation history the model can consider. API keys and endpoints vary between environments. Hard-coding these values makes applications inflexible and difficult to maintain.

The configuration management component provides a clean way to externalize all configuration into files that can be modified without changing code. It supports both JSON and YAML formats, validates configuration values, and provides sensible defaults.

The component uses a hierarchical structure where general settings can be overridden by environment-specific values. For example, a base configuration might specify default model parameters, while a production configuration overrides the model endpoint and API key.

import * as fs from 'fs/promises';
import * as path from 'path';
import * as yaml from 'js-yaml';
import { z } from 'zod';

// Define configuration schema using Zod for runtime validation
const LLMModelConfigSchema = z.object({
    modelName: z.string(),
    temperature: z.number().min(0).max(2).default(0.7),
    maxTokens: z.number().positive().optional(),
    topP: z.number().min(0).max(1).default(1.0),
    topK: z.number().positive().optional(),
    contextWindow: z.number().positive().default(4096),
    systemMessage: z.string().optional()
});

const ApplicationConfigSchema = z.object({
    llm: LLMModelConfigSchema,
    apiKeys: z.record(z.string()).default({}),
    loggingLevel: z.enum(['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL']).default('INFO'),
    enableStreaming: z.boolean().default(true),
    maxRetries: z.number().nonnegative().default(3),
    timeoutSeconds: z.number().positive().default(30)
});

export type LLMModelConfig = z.infer<typeof LLMModelConfigSchema>;
export type ApplicationConfig = z.infer<typeof ApplicationConfigSchema>;

export class ConfigurationManager {
    private configPath?: string;
    private config?: ApplicationConfig;
    private logger: Logger;

    constructor(configPath?: string) {
        this.configPath = configPath;
        this.logger = createLogger('ConfigurationManager');
    }

    public async load(configPath?: string): Promise<ApplicationConfig> {
        const pathToLoad = configPath || this.configPath;

        if (!pathToLoad) {
            this.logger.warn('No configuration file specified, using defaults');
            return this.createDefaultConfig();
        }

        try {
            await fs.access(pathToLoad);
        } catch (error) {
            throw new Error(`Configuration file not found: ${pathToLoad}`);
        }

        const ext = path.extname(pathToLoad).toLowerCase();
        let configData: any;

        if (ext === '.json') {
            configData = await this.loadJson(pathToLoad);
        } else if (ext === '.yaml' || ext === '.yml') {
            configData = await this.loadYaml(pathToLoad);
        } else {
            throw new Error(`Unsupported configuration format: ${ext}`);
        }

        // Validate and parse configuration
        try {
            this.config = ApplicationConfigSchema.parse(configData);
            this.logger.info(`Loaded configuration from ${pathToLoad}`);
            return this.config;
        } catch (error) {
            if (error instanceof z.ZodError) {
                const issues = error.issues.map(i => `${i.path.join('.')}: ${i.message}`);
                throw new Error(`Configuration validation failed:\n${issues.join('\n')}`);
            }
            throw error;
        }
    }

The configuration manager uses Zod, a TypeScript-first schema validation library. Zod schemas provide both runtime validation and compile-time type inference. The z.infer utility extracts TypeScript types from schemas, ensuring that the type definitions always match the validation rules.

This approach provides several benefits. Configuration is validated at load time, catching errors before they cause runtime failures. The type system ensures that code using the configuration accesses only valid fields. Default values are specified once in the schema rather than scattered throughout the code.

    private async loadJson(filePath: string): Promise<any> {
        const content = await fs.readFile(filePath, 'utf-8');
        return JSON.parse(content);
    }

    private async loadYaml(filePath: string): Promise<any> {
        const content = await fs.readFile(filePath, 'utf-8');
        return yaml.load(content);
    }

    private createDefaultConfig(): ApplicationConfig {
        return ApplicationConfigSchema.parse({
            llm: {
                modelName: 'gpt-3.5-turbo',
                temperature: 0.7,
                topP: 1.0,
                contextWindow: 4096
            },
            apiKeys: {},
            loggingLevel: 'INFO',
            enableStreaming: true,
            maxRetries: 3,
            timeoutSeconds: 30
        });
    }

    public async save(config: ApplicationConfig, outputPath: string): Promise<void> {
        const ext = path.extname(outputPath).toLowerCase();

        let content: string;
        if (ext === '.json') {
            content = JSON.stringify(config, null, 2);
        } else if (ext === '.yaml' || ext === '.yml') {
            content = yaml.dump(config);
        } else {
            throw new Error(`Unsupported output format: ${ext}`);
        }

        await fs.writeFile(outputPath, content, 'utf-8');
        this.logger.info(`Saved configuration to ${outputPath}`);
    }

    public getConfig(): ApplicationConfig {
        if (!this.config) {
            throw new Error('Configuration not loaded. Call load() first.');
        }
        return this.config;
    }

    public async merge(overrides: Partial<ApplicationConfig>): Promise<ApplicationConfig> {
        const current = this.config || this.createDefaultConfig();
        
        const merged = {
            ...current,
            ...overrides,
            llm: {
                ...current.llm,
                ...(overrides.llm || {})
            },
            apiKeys: {
                ...current.apiKeys,
                ...(overrides.apiKeys || {})
            }
        };

        this.config = ApplicationConfigSchema.parse(merged);
        return this.config;
    }

    public async loadWithEnvironmentOverrides(
        configPath: string,
        envPrefix: string = 'LLM_'
    ): Promise<ApplicationConfig> {
        const baseConfig = await this.load(configPath);

        const overrides: Partial<ApplicationConfig> = {};

        // Check for environment variable overrides
        if (process.env[`${envPrefix}MODEL_NAME`]) {
            overrides.llm = {
                ...baseConfig.llm,
                modelName: process.env[`${envPrefix}MODEL_NAME`]!
            };
        }

        if (process.env[`${envPrefix}TEMPERATURE`]) {
            overrides.llm = {
                ...(overrides.llm || baseConfig.llm),
                temperature: parseFloat(process.env[`${envPrefix}TEMPERATURE`]!)
            };
        }

        if (process.env[`${envPrefix}API_KEY`]) {
            overrides.apiKeys = {
                ...baseConfig.apiKeys,
                default: process.env[`${envPrefix}API_KEY`]!
            };
        }

        if (Object.keys(overrides).length > 0) {
            this.logger.info('Applied environment variable overrides');
            return this.merge(overrides);
        }

        return baseConfig;
    }
}

The save method enables round-tripping configuration. Applications can load configuration, modify it programmatically, and save the updated version. This is useful for tools that help users configure the application through a graphical interface.

The merge method provides a way to combine configurations. This is particularly useful when loading a base configuration and applying environment-specific overrides. The method performs a deep merge, ensuring that nested objects like llm and apiKeys are merged correctly rather than replaced entirely.

The loadWithEnvironmentOverrides method demonstrates a common pattern in production applications. Base configuration comes from a file, but sensitive values like API keys and environment-specific settings come from environment variables. This allows the same configuration file to be used across environments while keeping secrets out of version control.

A typical configuration file in YAML format might look like this:

llm:
  modelName: "meta-llama/Llama-2-7b-chat-hf"
  temperature: 0.8
  maxTokens: 2048
  topP: 0.95
  topK: 50
  contextWindow: 4096
  systemMessage: "You are a helpful AI assistant."

apiKeys:
  openai: "sk-..."
  anthropic: "sk-ant-..."

loggingLevel: "INFO"
enableStreaming: true
maxRetries: 3
timeoutSeconds: 60

The hierarchical structure makes configuration files readable and maintainable. Related settings are grouped together, and the structure mirrors the code's type definitions, making it easy to understand how configuration maps to application behavior.

Tool Calling Framework Component

Modern LLMs can use external tools to extend their capabilities beyond text generation. A model might call a web search API to find current information, execute code to perform calculations, or query a database to retrieve specific data. The tool calling framework component provides infrastructure for defining tools, invoking them safely, and integrating results back into the conversation.

The framework uses a plugin architecture where each tool is a self-contained unit with a clear interface. Tools declare their name, description, and parameters using a schema that the LLM can understand. When the model decides to use a tool, the framework validates the parameters, executes the tool, and formats the result.

import { z } from 'zod';

export interface ToolParameter {
    name: string;
    type: 'string' | 'number' | 'boolean' | 'array' | 'object';
    description: string;
    required: boolean;
    enum?: any[];
}

export interface ToolSchema {
    name: string;
    description: string;
    parameters: ToolParameter[];
}

export interface ToolExecutionResult {
    success: boolean;
    result?: any;
    error?: string;
}

export abstract class BaseTool {
    protected logger: Logger;

    constructor() {
        this.logger = createLogger(this.constructor.name);
    }

    public abstract getSchema(): ToolSchema;

    public abstract execute(parameters: Record<string, any>): Promise<any>;

    public async validateAndExecute(parameters: Record<string, any>): Promise<ToolExecutionResult> {
        const schema = this.getSchema();

        // Validate required parameters
        for (const param of schema.parameters) {
            if (param.required && !(param.name in parameters)) {
                return {
                    success: false,
                    error: `Missing required parameter: ${param.name}`
                };
            }
        }

        try {
            const result = await this.execute(parameters);
            return {
                success: true,
                result
            };
        } catch (error) {
            this.logger.error('Tool execution failed', error);
            return {
                success: false,
                error: error instanceof Error ? error.message : String(error)
            };
        }
    }

    public toOpenAIFormat(): Record<string, any> {
        const schema = this.getSchema();
        const properties: Record<string, any> = {};
        const required: string[] = [];

        for (const param of schema.parameters) {
            properties[param.name] = {
                type: param.type,
                description: param.description
            };

            if (param.enum) {
                properties[param.name].enum = param.enum;
            }

            if (param.required) {
                required.push(param.name);
            }
        }

        return {
            type: 'function',
            function: {
                name: schema.name,
                description: schema.description,
                parameters: {
                    type: 'object',
                    properties,
                    required
                }
            }
        };
    }
}

The BaseTool class defines the contract that all tools must implement. The getSchema method returns metadata that describes what the tool does and what parameters it accepts. The execute method performs the actual work. The validateAndExecute method adds a safety layer that checks parameters before execution and handles errors gracefully.

The toOpenAIFormat method shows how the schema can be adapted to different formats. This flexibility allows the same tool definitions to work with different LLM providers. The method transforms the internal schema representation into the format that OpenAI's function calling API expects.

Here is a concrete implementation of a web search tool using DuckDuckGo:

import axios from 'axios';

interface SearchResult {
    position: number;
    title: string;
    url: string;
    snippet: string;
}

interface WebSearchResult {
    query: string;
    timestamp: string;
    results: SearchResult[];
    count: number;
}

export class WebSearchTool extends BaseTool {
    private maxResults: number;

    constructor(maxResults: number = 5) {
        super();
        this.maxResults = maxResults;
    }

    public getSchema(): ToolSchema {
        return {
            name: 'web_search',
            description: 'Search the web for current information using DuckDuckGo. Use this when you need up-to-date information or facts that you do not have in your training data.',
            parameters: [
                {
                    name: 'query',
                    type: 'string',
                    description: 'The search query to execute',
                    required: true
                },
                {
                    name: 'maxResults',
                    type: 'number',
                    description: 'Maximum number of results to return (default: 5)',
                    required: false
                }
            ]
        };
    }

    public async execute(parameters: Record<string, any>): Promise<WebSearchResult> {
        const query = parameters.query as string;
        const maxResults = (parameters.maxResults as number) || this.maxResults;

        this.logger.info(`Executing web search: ${query}`);

        try {
            // Use DuckDuckGo HTML API
            const response = await axios.get('https://html.duckduckgo.com/html/', {
                params: { q: query },
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; LLMApp/1.0)'
                }
            });

            const results = this.parseSearchResults(response.data, maxResults);

            return {
                query,
                timestamp: new Date().toISOString(),
                results,
                count: results.length
            };
        } catch (error) {
            this.logger.error('Search failed', error);
            throw new Error(`Web search failed: ${error instanceof Error ? error.message : String(error)}`);
        }
    }

    private parseSearchResults(html: string, maxResults: number): SearchResult[] {
        const results: SearchResult[] = [];
        
        // Simple regex-based parsing (in production, use a proper HTML parser)
        const resultPattern = /<a class="result__a" href="([^"]+)">([^<]+)<\/a>[\s\S]*?<a class="result__snippet"[^>]*>([^<]+)</g;
        
        let match;
        let position = 1;
        
        while ((match = resultPattern.exec(html)) !== null && position <= maxResults) {
            results.push({
                position,
                title: this.decodeHtml(match[2]),
                url: this.decodeHtml(match[1]),
                snippet: this.decodeHtml(match[3])
            });
            position++;
        }

        return results;
    }

    private decodeHtml(text: string): string {
        return text
            .replace(/&amp;/g, '&')
            .replace(/&lt;/g, '<')
            .replace(/&gt;/g, '>')
            .replace(/&quot;/g, '"')
            .replace(/&#39;/g, "'")
            .trim();
    }
}

The web search tool demonstrates several important patterns. It encapsulates the complexity of interacting with the DuckDuckGo API behind a simple interface. The execute method returns structured data that is easy for both the LLM and application code to process. Error handling ensures that search failures are logged and reported rather than crashing the application.

The tool framework also includes a registry that manages available tools and routes execution requests:

export class ToolRegistry {
    private tools: Map<string, BaseTool>;
    private logger: Logger;

    constructor() {
        this.tools = new Map();
        this.logger = createLogger('ToolRegistry');
    }

    public register(tool: BaseTool): void {
        const schema = tool.getSchema();
        this.tools.set(schema.name, tool);
        this.logger.info(`Registered tool: ${schema.name}`);
    }

    public getTool(name: string): BaseTool | undefined {
        return this.tools.get(name);
    }

    public getAllSchemas(): ToolSchema[] {
        return Array.from(this.tools.values()).map(tool => tool.getSchema());
    }

    public getAllOpenAIFormats(): Record<string, any>[] {
        return Array.from(this.tools.values()).map(tool => tool.toOpenAIFormat());
    }

    public async executeTool(name: string, parameters: Record<string, any>): Promise<ToolExecutionResult> {
        const tool = this.getTool(name);
        
        if (!tool) {
            return {
                success: false,
                error: `Tool not found: ${name}`
            };
        }

        return tool.validateAndExecute(parameters);
    }

    public hasTools(): boolean {
        return this.tools.size > 0;
    }

    public getToolNames(): string[] {
        return Array.from(this.tools.keys());
    }
}

The registry provides a central point for managing tools. Applications register all available tools at startup, and the registry handles routing execution requests to the appropriate tool. This centralization makes it easy to add, remove, or modify tools without changing the core application logic.

For paid services like SERP API, the framework supports the same interface with different implementations:

export class SerpAPISearchTool extends BaseTool {
    private apiKey: string;
    private maxResults: number;

    constructor(apiKey: string, maxResults: number = 5) {
        super();
        this.apiKey = apiKey;
        this.maxResults = maxResults;
    }

    public getSchema(): ToolSchema {
        return {
            name: 'web_search',
            description: 'Search the web using SERP API for high-quality, structured results.',
            parameters: [
                {
                    name: 'query',
                    type: 'string',
                    description: 'The search query',
                    required: true
                },
                {
                    name: 'maxResults',
                    type: 'number',
                    description: 'Maximum results to return',
                    required: false
                }
            ]
        };
    }

    public async execute(parameters: Record<string, any>): Promise<WebSearchResult> {
        const query = parameters.query as string;
        const maxResults = (parameters.maxResults as number) || this.maxResults;

        const response = await axios.get('https://serpapi.com/search', {
            params: {
                q: query,
                api_key: this.apiKey,
                num: maxResults
            }
        });

        const organicResults = response.data.organic_results || [];

        const results: SearchResult[] = organicResults
            .slice(0, maxResults)
            .map((result: any, index: number) => ({
                position: index + 1,
                title: result.title || '',
                url: result.link || '',
                snippet: result.snippet || ''
            }));

        return {
            query,
            timestamp: new Date().toISOString(),
            results,
            count: results.length
        };
    }
}

Both search tools implement the same schema, making them interchangeable. An application can switch from the free DuckDuckGo service to the paid SERP API by simply registering a different tool instance, without changing any other code.

Model Context Protocol Integration Component

The Model Context Protocol, developed by Anthropic, provides a standardized way for LLM applications to access external context sources. MCP servers expose resources like files, databases, or APIs through a uniform interface. MCP clients consume these resources and make them available to LLMs.

The MCP integration component provides both client and server implementations, allowing applications to act as either consumers or providers of context. This enables sophisticated architectures where multiple applications share context through MCP.

export interface MCPResource {
    uri: string;
    name: string;
    description?: string;
    mimeType?: string;
}

export interface MCPTool {
    name: string;
    description: string;
    inputSchema: Record<string, any>;
}

export interface MCPResourceContent {
    uri: string;
    content: string;
    mimeType: string;
}

export class MCPClient {
    private serverUrl: string;
    private logger: Logger;
    private connected: boolean;

    constructor(serverUrl: string) {
        this.serverUrl = serverUrl;
        this.logger = createLogger('MCPClient');
        this.connected = false;
    }

    public async connect(): Promise<void> {
        this.logger.info(`Connecting to MCP server: ${this.serverUrl}`);
        
        // In a real implementation, this would establish a WebSocket or HTTP connection
        // For this example, we show the interface structure
        this.connected = true;
    }

    public async listResources(): Promise<MCPResource[]> {
        if (!this.connected) {
            throw new Error('Not connected to MCP server');
        }

        this.logger.info('Listing MCP resources');
        
        // Real implementation would make an RPC call
        return [];
    }

    public async readResource(uri: string): Promise<MCPResourceContent> {
        if (!this.connected) {
            throw new Error('Not connected to MCP server');
        }

        this.logger.info(`Reading resource: ${uri}`);
        
        // Real implementation would fetch the resource
        return {
            uri,
            content: '',
            mimeType: 'text/plain'
        };
    }

    public async listTools(): Promise<MCPTool[]> {
        if (!this.connected) {
            throw new Error('Not connected to MCP server');
        }

        this.logger.info('Listing MCP tools');
        return [];
    }

    public async callTool(name: string, arguments_: Record<string, any>): Promise<any> {
        if (!this.connected) {
            throw new Error('Not connected to MCP server');
        }

        this.logger.info(`Calling MCP tool: ${name}`);
        
        // Real implementation would make RPC call
        return { result: null };
    }

    public async disconnect(): Promise<void> {
        if (this.connected) {
            this.logger.info('Disconnecting from MCP server');
            this.connected = false;
        }
    }

    public isConnected(): boolean {
        return this.connected;
    }
}

The MCP client provides asynchronous methods for all operations because network communication is inherently asynchronous. The async/await pattern makes the code easy to read and maintain while handling the complexity of asynchronous operations.

The client separates resource access from tool invocation. Resources are passive data sources that the client reads. Tools are active operations that the server executes on behalf of the client. This distinction is important because it affects caching, permissions, and error handling.

The server implementation mirrors the client:

type ResourceHandler = () => Promise<string>;
type ToolHandler = (args: Record<string, any>) => Promise<any>;

interface RegisteredResource {
    name: string;
    handler: ResourceHandler;
    description?: string;
    mimeType?: string;
}

interface RegisteredTool {
    handler: ToolHandler;
    description: string;
    inputSchema: Record<string, any>;
}

export class MCPServer {
    private name: string;
    private version: string;
    private resources: Map<string, RegisteredResource>;
    private tools: Map<string, RegisteredTool>;
    private logger: Logger;

    constructor(name: string, version: string) {
        this.name = name;
        this.version = version;
        this.resources = new Map();
        this.tools = new Map();
        this.logger = createLogger('MCPServer');
    }

    public registerResource(
        uri: string,
        name: string,
        handler: ResourceHandler,
        options?: {
            description?: string;
            mimeType?: string;
        }
    ): void {
        this.resources.set(uri, {
            name,
            handler,
            description: options?.description,
            mimeType: options?.mimeType
        });
        this.logger.info(`Registered resource: ${uri}`);
    }

    public registerTool(
        name: string,
        handler: ToolHandler,
        description: string,
        inputSchema: Record<string, any>
    ): void {
        this.tools.set(name, {
            handler,
            description,
            inputSchema
        });
        this.logger.info(`Registered tool: ${name}`);
    }

    public async handleListResources(): Promise<MCPResource[]> {
        const resources: MCPResource[] = [];
        
        for (const [uri, info] of this.resources.entries()) {
            resources.push({
                uri,
                name: info.name,
                description: info.description,
                mimeType: info.mimeType
            });
        }

        return resources;
    }

    public async handleReadResource(uri: string): Promise<MCPResourceContent> {
        const resource = this.resources.get(uri);
        
        if (!resource) {
            throw new Error(`Resource not found: ${uri}`);
        }

        const content = await resource.handler();

        return {
            uri,
            content,
            mimeType: resource.mimeType || 'text/plain'
        };
    }

    public async handleCallTool(name: string, arguments_: Record<string, any>): Promise<any> {
        const tool = this.tools.get(name);
        
        if (!tool) {
            throw new Error(`Tool not found: ${name}`);
        }

        const result = await tool.handler(arguments_);
        return { result };
    }

    public getServerInfo(): { name: string; version: string } {
        return {
            name: this.name,
            version: this.version
        };
    }
}

The server uses a registration pattern where handlers are registered for specific URIs and tool names. This makes it easy to add new resources and tools dynamically. The handlers are async functions, allowing them to perform I/O operations efficiently.

An example of using the MCP server to expose a file system:

import * as fs from 'fs/promises';
import * as path from 'path';

export async function createFileServer(basePath: string): Promise<MCPServer> {
    const server = new MCPServer('file_server', '1.0.0');
    const baseDir = path.resolve(basePath);

    // Register resources for text files
    const files = await fs.readdir(baseDir, { recursive: true });
    
    for (const file of files) {
        const filePath = path.join(baseDir, file as string);
        const stat = await fs.stat(filePath);
        
        if (stat.isFile() && filePath.endsWith('.txt')) {
            const relativePath = path.relative(baseDir, filePath);
            const uri = `file:///${relativePath.replace(/\\/g, '/')}`;

            server.registerResource(
                uri,
                relativePath,
                async () => {
                    return await fs.readFile(filePath, 'utf-8');
                },
                {
                    description: `Text file: ${relativePath}`,
                    mimeType: 'text/plain'
                }
            );
        }
    }

    // Register a search tool
    server.registerTool(
        'search_files',
        async (args: Record<string, any>) => {
            const query = args.query as string;
            const results: string[] = [];

            for (const file of files) {
                const filePath = path.join(baseDir, file as string);
                const stat = await fs.stat(filePath);
                
                if (stat.isFile() && filePath.endsWith('.txt')) {
                    const content = await fs.readFile(filePath, 'utf-8');
                    if (content.toLowerCase().includes(query.toLowerCase())) {
                        results.push(path.relative(baseDir, filePath));
                    }
                }
            }

            return results;
        },
        'Search for files containing specific text',
        {
            type: 'object',
            properties: {
                query: {
                    type: 'string',
                    description: 'Text to search for'
                }
            },
            required: ['query']
        }
    );

    return server;
}

This example shows how MCP enables powerful integrations. The file server exposes an entire directory tree as MCP resources, making all files accessible to any MCP client. The search tool provides a way to find files by content, demonstrating how MCP tools can perform operations beyond simple data retrieval.

Message Management and Chat History Component

Conversational LLM applications must manage the flow of messages between users and the model. Each interaction involves system messages that set behavior, user messages containing requests, and assistant messages with responses. Managing this conversation state correctly is essential for coherent multi-turn interactions.

The message management component provides structures for representing messages and utilities for managing conversation history. It handles concerns like context window limits, message formatting, and conversation persistence.

export interface ConversationMessage {
    role: 'system' | 'user' | 'assistant';
    content: string;
    timestamp: Date;
    metadata: Record<string, any>;
}

type TokenCounter = (text: string) => number;

export class ChatHistory {
    private messages: ConversationMessage[];
    private maxContextTokens: number;
    private tokenCounter: TokenCounter;
    private logger: Logger;

    constructor(
        systemMessage?: string,
        maxContextTokens: number = 4096,
        tokenCounter?: TokenCounter
    ) {
        this.messages = [];
        this.maxContextTokens = maxContextTokens;
        this.tokenCounter = tokenCounter || this.defaultTokenCounter;
        this.logger = createLogger('ChatHistory');

        if (systemMessage) {
            this.addSystemMessage(systemMessage);
        }
    }

    private defaultTokenCounter(text: string): number {
        // Rough estimation: 1 token per 4 characters
        return Math.ceil(text.length / 4);
    }

    public addSystemMessage(content: string, metadata: Record<string, any> = {}): void {
        const message: ConversationMessage = {
            role: 'system',
            content,
            timestamp: new Date(),
            metadata
        };
        this.messages.push(message);
        this.logger.debug('Added system message');
    }

    public addUserMessage(content: string, metadata: Record<string, any> = {}): void {
        const message: ConversationMessage = {
            role: 'user',
            content,
            timestamp: new Date(),
            metadata
        };
        this.messages.push(message);
        this.logger.debug('Added user message');
    }

    public addAssistantMessage(content: string, metadata: Record<string, any> = {}): void {
        const message: ConversationMessage = {
            role: 'assistant',
            content,
            timestamp: new Date(),
            metadata
        };
        this.messages.push(message);
        this.logger.debug('Added assistant message');
    }

    public getMessagesForLLM(): Message[] {
        // Always include system messages
        const systemMessages = this.messages.filter(msg => msg.role === 'system');
        const conversationMessages = this.messages.filter(msg => msg.role !== 'system');

        // Count tokens for system messages
        const systemTokens = systemMessages.reduce(
            (sum, msg) => sum + this.tokenCounter(msg.content),
            0
        );

        // Calculate available tokens for conversation
        const availableTokens = this.maxContextTokens - systemTokens;

        // Add conversation messages from most recent, staying within limit
        const selectedMessages: ConversationMessage[] = [];
        let currentTokens = 0;

        for (let i = conversationMessages.length - 1; i >= 0; i--) {
            const msg = conversationMessages[i];
            const msgTokens = this.tokenCounter(msg.content);

            if (currentTokens + msgTokens > availableTokens) {
                break;
            }

            selectedMessages.unshift(msg);
            currentTokens += msgTokens;
        }

        // Combine system and selected conversation messages
        const allMessages = [...systemMessages, ...selectedMessages];

        // Convert to Message format
        return allMessages.map(msg => ({
            role: msg.role,
            content: msg.content
        }));
    }

The chat history component implements intelligent context window management. It ensures that the total tokens sent to the LLM never exceed the model's context limit. System messages are always included because they define the assistant's behavior. Conversation messages are included starting from the most recent, working backward until the token limit is reached.

This approach ensures that the model always has the most relevant context. Recent messages are more important for maintaining conversation coherence than older messages. If the conversation becomes very long, older messages are automatically dropped.

The token counter is pluggable. The default implementation uses a simple character-based estimation, but applications can provide more accurate counters using tokenizers specific to their model:

    public setTokenCounter(counter: TokenCounter): void {
        this.tokenCounter = counter;
        this.logger.info('Updated token counter');
    }

    public getTotalTokens(): number {
        return this.messages.reduce(
            (sum, msg) => sum + this.tokenCounter(msg.content),
            0
        );
    }

    public clear(): void {
        this.messages = this.messages.filter(msg => msg.role === 'system');
        this.logger.info('Cleared conversation history');
    }

    public async save(filePath: string): Promise<void> {
        const data = {
            maxContextTokens: this.maxContextTokens,
            messages: this.messages.map(msg => ({
                role: msg.role,
                content: msg.content,
                timestamp: msg.timestamp.toISOString(),
                metadata: msg.metadata
            }))
        };

        await fs.writeFile(filePath, JSON.stringify(data, null, 2), 'utf-8');
        this.logger.info(`Saved conversation to ${filePath}`);
    }

    public static async load(filePath: string): Promise<ChatHistory> {
        const content = await fs.readFile(filePath, 'utf-8');
        const data = JSON.parse(content);

        const history = new ChatHistory(undefined, data.maxContextTokens);
        history.messages = data.messages.map((msg: any) => ({
            role: msg.role,
            content: msg.content,
            timestamp: new Date(msg.timestamp),
            metadata: msg.metadata || {}
        }));

        return history;
    }

    public getMessages(): ConversationMessage[] {
        return [...this.messages];
    }

    public getMessageCount(): number {
        return this.messages.length;
    }

    public getLastMessage(): ConversationMessage | undefined {
        return this.messages[this.messages.length - 1];
    }
}

The save and load methods enable conversation persistence. Applications can save conversations to disk and resume them later. This is essential for applications that need to maintain state across sessions or allow users to review past conversations.

The metadata field in ConversationMessage provides extensibility. Applications can attach arbitrary data to messages, such as user IDs, confidence scores, or references to external resources. This metadata is preserved when saving and loading conversations.

Circuit Breaker and Rate Limiting Component

Production LLM applications must handle failures gracefully and respect API rate limits. External LLM services can experience outages, network issues can cause timeouts, and exceeding rate limits can result in blocked requests. The circuit breaker and rate limiting component provides resilience mechanisms that prevent cascading failures and ensure compliance with service quotas.

A circuit breaker monitors requests to an external service and automatically stops sending requests when the service appears to be failing. This prevents wasting resources on requests that will likely fail and gives the service time to recover. After a cooldown period, the circuit breaker allows a test request through to check if the service has recovered.

export enum CircuitState {
    CLOSED = 'closed',
    OPEN = 'open',
    HALF_OPEN = 'half_open'
}

export class CircuitBreaker {
    private failureThreshold: number;
    private recoveryTimeout: number;
    private expectedError: typeof Error;
    
    private failureCount: number;
    private lastFailureTime?: Date;
    private state: CircuitState;
    private logger: Logger;

    constructor(
        failureThreshold: number = 5,
        recoveryTimeout: number = 60,
        expectedError: typeof Error = Error
    ) {
        this.failureThreshold = failureThreshold;
        this.recoveryTimeout = recoveryTimeout;
        this.expectedError = expectedError;
        
        this.failureCount = 0;
        this.state = CircuitState.CLOSED;
        this.logger = createLogger('CircuitBreaker');
    }

    public async call<T>(func: () => Promise<T>): Promise<T> {
        if (this.state === CircuitState.OPEN) {
            if (this.shouldAttemptReset()) {
                this.logger.info('Circuit breaker entering half-open state');
                this.state = CircuitState.HALF_OPEN;
            } else {
                throw new Error('Circuit breaker is OPEN');
            }
        }

        try {
            const result = await func();
            this.onSuccess();
            return result;
        } catch (error) {
            if (error instanceof this.expectedError) {
                this.onFailure();
            }
            throw error;
        }
    }

    private shouldAttemptReset(): boolean {
        if (!this.lastFailureTime) {
            return false;
        }

        const elapsed = (Date.now() - this.lastFailureTime.getTime()) / 1000;
        return elapsed >= this.recoveryTimeout;
    }

    private onSuccess(): void {
        if (this.state === CircuitState.HALF_OPEN) {
            this.logger.info('Circuit breaker closing after successful test');
            this.state = CircuitState.CLOSED;
        }

        this.failureCount = 0;
    }

    private onFailure(): void {
        this.failureCount++;
        this.lastFailureTime = new Date();

        if (this.failureCount >= this.failureThreshold) {
            this.logger.warn(
                `Circuit breaker opening after ${this.failureCount} failures`
            );
            this.state = CircuitState.OPEN;
        }
    }

    public reset(): void {
        this.logger.info('Manually resetting circuit breaker');
        this.state = CircuitState.CLOSED;
        this.failureCount = 0;
        this.lastFailureTime = undefined;
    }

    public getState(): CircuitState {
        return this.state;
    }

    public getFailureCount(): number {
        return this.failureCount;
    }
}

The circuit breaker uses a state machine with three states. In the closed state, requests flow normally. When failures exceed the threshold, the circuit opens and blocks all requests. After the recovery timeout, the circuit enters half-open state and allows one test request. If the test succeeds, the circuit closes. If it fails, the circuit reopens.

This mechanism prevents overwhelming a failing service with requests while still allowing automatic recovery. The recovery timeout gives the service time to stabilize before testing whether it is healthy again.

Rate limiting complements the circuit breaker by preventing the application from exceeding service quotas:

export class RateLimiter {
    private maxRequests: number;
    private timeWindow: number;
    private burstSize: number;
    
    private tokens: number;
    private lastUpdate: number;
    private logger: Logger;

    constructor(
        maxRequests: number,
        timeWindow: number,
        burstSize?: number
    ) {
        this.maxRequests = maxRequests;
        this.timeWindow = timeWindow;
        this.burstSize = burstSize || maxRequests;
        
        this.tokens = this.burstSize;
        this.lastUpdate = Date.now();
        this.logger = createLogger('RateLimiter');
    }

    public acquire(tokens: number = 1): boolean {
        this.refill();

        if (this.tokens >= tokens) {
            this.tokens -= tokens;
            return true;
        }

        return false;
    }

    public async waitAndAcquire(tokens: number = 1, timeout?: number): Promise<void> {
        const startTime = Date.now();

        while (!this.acquire(tokens)) {
            if (timeout && (Date.now() - startTime) > timeout) {
                throw new Error('Rate limiter timeout exceeded');
            }

            const waitTime = this.timeUntilNextToken();
            await this.sleep(Math.min(waitTime, 100));
        }
    }

    private refill(): void {
        const now = Date.now();
        const elapsed = (now - this.lastUpdate) / 1000;

        const tokensToAdd = (elapsed / this.timeWindow) * this.maxRequests;
        this.tokens = Math.min(this.burstSize, this.tokens + tokensToAdd);
        this.lastUpdate = now;
    }

    private timeUntilNextToken(): number {
        if (this.tokens >= 1) {
            return 0;
        }

        const tokensNeeded = 1 - this.tokens;
        return (tokensNeeded / this.maxRequests) * this.timeWindow * 1000;
    }

    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }

    public getAvailableTokens(): number {
        this.refill();
        return this.tokens;
    }

    public reset(): void {
        this.tokens = this.burstSize;
        this.lastUpdate = Date.now();
        this.logger.info('Rate limiter reset');
    }
}

The rate limiter implements the token bucket algorithm. Tokens are added to the bucket at a steady rate determined by maxRequests and timeWindow. Each request consumes one token. If no tokens are available, the request must wait.

The burstSize parameter allows short bursts of requests above the average rate. This is useful for handling legitimate traffic spikes without triggering rate limits. The bucket can hold more tokens than the steady-state rate, allowing accumulated tokens to be spent quickly.

The waitAndAcquire method blocks until tokens are available. This is convenient for applications that can afford to wait rather than failing immediately. The timeout parameter prevents indefinite blocking.

Combining circuit breaker and rate limiter creates robust request handling:

export class ResilientLLMClient {
    private llm: BaseLLM;
    private rateLimiter: RateLimiter;
    private circuitBreaker: CircuitBreaker;
    private logger: Logger;

    constructor(
        llm: BaseLLM,
        maxRequestsPerMinute: number = 60,
        circuitBreakerThreshold: number = 5
    ) {
        this.llm = llm;
        this.rateLimiter = new RateLimiter(maxRequestsPerMinute, 60);
        this.circuitBreaker = new CircuitBreaker(circuitBreakerThreshold);
        this.logger = createLogger('ResilientLLMClient');
    }

    public async complete(
        messages: Message[],
        options?: CompletionOptions
    ): Promise<CompletionResponse> {
        await this.rateLimiter.waitAndAcquire();

        return this.circuitBreaker.call(async () => {
            return this.llm.complete(messages, options);
        });
    }

    public async *streamComplete(
        messages: Message[],
        options?: CompletionOptions
    ): AsyncIterableIterator<string> {
        await this.rateLimiter.waitAndAcquire();

        const generator = this.llm.streamComplete(messages, options);
        
        for await (const chunk of generator) {
            yield chunk;
        }
    }

    public getCircuitState(): CircuitState {
        return this.circuitBreaker.getState();
    }

    public getAvailableTokens(): number {
        return this.rateLimiter.getAvailableTokens();
    }

    public resetCircuitBreaker(): void {
        this.circuitBreaker.reset();
    }
}

The resilient client wraps an LLM implementation with both rate limiting and circuit breaking. Every request first waits for rate limit tokens, then executes through the circuit breaker. This ensures that the application respects rate limits and handles failures gracefully.

Additional Useful Components

Beyond the core components already discussed, several additional utilities enhance LLM application development.

Prompt Template Component

Prompt engineering is critical for LLM applications, but hard-coding prompts makes them difficult to modify and test. A prompt template component provides a structured way to define, parameterize, and manage prompts.

export class PromptTemplate {
    private template: string;
    private description?: string;
    private logger: Logger;

    constructor(template: string, description?: string) {
        this.template = template;
        this.description = description;
        this.logger = createLogger('PromptTemplate');
    }

    public format(variables: Record<string, any>): string {
        let result = this.template;

        for (const [key, value] of Object.entries(variables)) {
            const placeholder = `\${${key}}`;
            result = result.replace(new RegExp(placeholder.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'), 'g'), String(value));
        }

        // Check for unfilled placeholders
        const remainingPlaceholders = result.match(/\$\{[^}]+\}/g);
        if (remainingPlaceholders) {
            throw new Error(`Missing variables: ${remainingPlaceholders.join(', ')}`);
        }

        return result;
    }

    public getVariables(): string[] {
        const matches = this.template.match(/\$\{([^}]+)\}/g);
        if (!matches) return [];

        return matches.map(match => match.slice(2, -1));
    }

    public getDescription(): string | undefined {
        return this.description;
    }

    public getTemplate(): string {
        return this.template;
    }
}

export class PromptLibrary {
    private templates: Map<string, PromptTemplate>;
    private logger: Logger;

    constructor() {
        this.templates = new Map();
        this.logger = createLogger('PromptLibrary');
    }

    public register(name: string, template: PromptTemplate): void {
        this.templates.set(name, template);
        this.logger.info(`Registered prompt template: ${name}`);
    }

    public get(name: string): PromptTemplate {
        const template = this.templates.get(name);
        if (!template) {
            throw new Error(`Template not found: ${name}`);
        }
        return template;
    }

    public format(name: string, variables: Record<string, any>): string {
        const template = this.get(name);
        return template.format(variables);
    }

    public has(name: string): boolean {
        return this.templates.has(name);
    }

    public getAll(): string[] {
        return Array.from(this.templates.keys());
    }
}

Prompt templates use simple string substitution for variable replacement. This provides a straightforward syntax while preventing code injection vulnerabilities that could occur with more powerful templating systems.

A template library manages collections of prompts. Applications can define all prompts in one place and reference them by name. This separation of prompts from code makes it easy to experiment with different phrasings and maintain consistency across the application.

Logging Component

Understanding LLM application behavior requires comprehensive logging. A logging component standardizes log formatting and provides utilities for tracking LLM interactions.

export enum LogLevel {
    DEBUG = 0,
    INFO = 1,
    WARNING = 2,
    ERROR = 3,
    CRITICAL = 4
}

export interface Logger {
    debug(message: string, ...args: any[]): void;
    info(message: string, ...args: any[]): void;
    warn(message: string, ...args: any[]): void;
    error(message: string, ...args: any[]): void;
    critical(message: string, ...args: any[]): void;
}

class LoggerImpl implements Logger {
    private name: string;
    private level: LogLevel;
    private logFile?: string;

    constructor(name: string, level: LogLevel = LogLevel.INFO, logFile?: string) {
        this.name = name;
        this.level = level;
        this.logFile = logFile;
    }

    public debug(message: string, ...args: any[]): void {
        this.log(LogLevel.DEBUG, message, ...args);
    }

    public info(message: string, ...args: any[]): void {
        this.log(LogLevel.INFO, message, ...args);
    }

    public warn(message: string, ...args: any[]): void {
        this.log(LogLevel.WARNING, message, ...args);
    }

    public error(message: string, ...args: any[]): void {
        this.log(LogLevel.ERROR, message, ...args);
    }

    public critical(message: string, ...args: any[]): void {
        this.log(LogLevel.CRITICAL, message, ...args);
    }

    private log(level: LogLevel, message: string, ...args: any[]): void {
        if (level < this.level) return;

        const timestamp = new Date().toISOString();
        const levelName = LogLevel[level];
        const formattedMessage = `${timestamp} - ${this.name} - ${levelName} - ${message}`;

        console.log(formattedMessage, ...args);

        if (this.logFile) {
            this.writeToFile(formattedMessage, args);
        }
    }

    private async writeToFile(message: string, args: any[]): Promise<void> {
        if (!this.logFile) return;

        const fullMessage = args.length > 0 
            ? `${message} ${JSON.stringify(args)}\n`
            : `${message}\n`;

        try {
            await fs.appendFile(this.logFile, fullMessage, 'utf-8');
        } catch (error) {
            console.error('Failed to write to log file:', error);
        }
    }
}

export function createLogger(name: string, level?: LogLevel, logFile?: string): Logger {
    return new LoggerImpl(name, level, logFile);
}

export class LLMLogger {
    private logger: Logger;
    private logFile?: string;

    constructor(name: string, logFile?: string) {
        this.logger = createLogger(name);
        this.logFile = logFile;
    }

    public async logCompletion(
        messages: Message[],
        response: CompletionResponse,
        duration: number,
        metadata?: Record<string, any>
    ): Promise<void> {
        const logEntry = {
            timestamp: new Date().toISOString(),
            type: 'completion',
            model: response.model,
            durationSeconds: duration,
            tokenUsage: response.usage,
            messageCount: messages.length,
            finishReason: response.finishReason,
            metadata: metadata || {}
        };

        this.logger.info(`Completion: ${response.model} (${duration.toFixed(2)}s)`);

        if (this.logFile) {
            await this.writeStructuredLog(logEntry);
        }
    }

    private async writeStructuredLog(entry: Record<string, any>): Promise<void> {
        if (!this.logFile) return;

        try {
            await fs.appendFile(
                this.logFile,
                JSON.stringify(entry) + '\n',
                'utf-8'
            );
        } catch (error) {
            this.logger.error('Failed to write structured log', error);
        }
    }
}

The LLM logger creates structured logs that can be analyzed to understand usage patterns, costs, and performance. Each completion is logged with timing information, token usage, and custom metadata.

Complete Running Example

The following complete example demonstrates how all components integrate to create a functional LLM application. This application provides a conversational interface with web search capabilities, configuration management, and resilience mechanisms.

import * as readline from 'readline';
import * as path from 'path';

export class LLMApplication {
    private config!: ApplicationConfig;
    private configManager: ConfigurationManager;
    private gpuDetector: GPUDetector;
    private acceleratorInfo?: AcceleratorInfo;
    private llm!: BaseLLM;
    private toolRegistry: ToolRegistry;
    private chatHistory!: ChatHistory;
    private resilientClient!: ResilientLLMClient;
    private llmLogger: LLMLogger;
    private logger: Logger;

    constructor(configPath: string) {
        this.logger = createLogger('LLMApplication');
        this.logger.info('Initializing LLM Application');

        this.configManager = new ConfigurationManager(configPath);
        this.gpuDetector = new GPUDetector();
        this.toolRegistry = new ToolRegistry();
        this.llmLogger = new LLMLogger('llm_interactions', 'llm_interactions.jsonl');
    }

    public async initialize(): Promise<void> {
        // Load configuration
        this.config = await this.configManager.load();

        // Detect GPU
        this.acceleratorInfo = await this.gpuDetector.detect();
        this.logger.info(`Using accelerator: ${this.acceleratorInfo.acceleratorType}`);

        // Initialize LLM
        await this.initializeLLM();

        // Register tools
        this.registerTools();

        // Initialize chat history
        this.chatHistory = new ChatHistory(
            this.config.llm.systemMessage,
            this.config.llm.contextWindow
        );

        // Initialize resilient client
        this.resilientClient = new ResilientLLMClient(
            this.llm,
            60,
            5
        );

        this.logger.info('Application initialization complete');
    }

    private async initializeLLM(): Promise<void> {
        const modelName = this.config.llm.modelName;

        if (modelName.startsWith('gpt-') || modelName.startsWith('claude-')) {
            const apiKey = this.config.apiKeys.openai;
            if (!apiKey) {
                throw new Error('OpenAI API key required for GPT models');
            }

            this.logger.info(`Initializing OpenAI LLM: ${modelName}`);
            this.llm = new OpenAILLM(modelName, apiKey);
        } else {
            const device = this.gpuDetector.getDeviceString();
            this.logger.info(`Initializing local LLM: ${modelName} on ${device}`);

            this.llm = new LocalLlamaLLM(modelName, device, {
                contextSize: this.config.llm.contextWindow
            });

            if (this.llm instanceof LocalLlamaLLM) {
                await this.llm.initialize();
            }
        }
    }

    private registerTools(): void {
        const searchTool = new WebSearchTool(5);
        this.toolRegistry.register(searchTool);
        this.logger.info('Registered web search tool');
    }

    public async processUserInput(userInput: string): Promise<string> {
        this.logger.info(`Processing user input: ${userInput.substring(0, 50)}...`);

        this.chatHistory.addUserMessage(userInput);

        let responseText = '';

        if (userInput.toLowerCase().includes('search') || userInput.toLowerCase().includes('find')) {
            const searchResult = await this.toolRegistry.executeTool('web_search', {
                query: userInput,
                maxResults: 3
            });

            if (searchResult.success) {
                const resultsText = this.formatSearchResults(searchResult.result);

                const augmentedInput = `User asked: ${userInput}\n\nHere are relevant search results:\n${resultsText}\n\nPlease provide a helpful response based on this information.`;

                const messages = this.chatHistory.getMessages();
                messages[messages.length - 1].content = augmentedInput;
            }
        }

        const messages = this.chatHistory.getMessagesForLLM();

        const startTime = Date.now();

        try {
            const response = await this.resilientClient.complete(messages, {
                temperature: this.config.llm.temperature,
                maxTokens: this.config.llm.maxTokens,
                topP: this.config.llm.topP,
                topK: this.config.llm.topK
            });

            const duration = (Date.now() - startTime) / 1000;
            responseText = response.content;

            await this.llmLogger.logCompletion(messages, response, duration);

            this.chatHistory.addAssistantMessage(responseText);

            this.logger.info(`Generated response in ${duration.toFixed(2)}s`);
        } catch (error) {
            this.logger.error('Error generating response', error);
            responseText = 'I apologize, but I encountered an error processing your request. Please try again.';
        }

        return responseText;
    }

    private formatSearchResults(searchData: any): string {
        const results = searchData.results || [];
        const formatted = results.map((result: any) => 
            `[${result.position}] ${result.title}\nURL: ${result.url}\n${result.snippet}\n`
        );

        return formatted.join('\n');
    }

    public async saveConversation(filePath: string): Promise<void> {
        await this.chatHistory.save(filePath);
        this.logger.info(`Saved conversation to ${filePath}`);
    }

    public async loadConversation(filePath: string): Promise<void> {
        this.chatHistory = await ChatHistory.load(filePath);
        this.logger.info(`Loaded conversation from ${filePath}`);
    }

    public async runInteractive(): Promise<void> {
        console.log('LLM Application Started');
        console.log(`Using model: ${this.config.llm.modelName}`);
        console.log(`Accelerator: ${this.acceleratorInfo?.acceleratorType}`);
        console.log("Type 'quit' to exit, 'save' to save conversation, 'clear' to clear history\n");

        const rl = readline.createInterface({
            input: process.stdin,
            output: process.stdout
        });

        const askQuestion = (query: string): Promise<string> => {
            return new Promise(resolve => rl.question(query, resolve));
        };

        while (true) {
            try {
                const userInput = await askQuestion('You: ');

                if (!userInput.trim()) {
                    continue;
                }

                if (userInput.toLowerCase() === 'quit') {
                    console.log('Goodbye!');
                    rl.close();
                    break;
                }

                if (userInput.toLowerCase() === 'save') {
                    const filename = `conversation_${Date.now()}.json`;
                    await this.saveConversation(filename);
                    console.log(`Conversation saved to ${filename}`);
                    continue;
                }

                if (userInput.toLowerCase() === 'clear') {
                    this.chatHistory.clear();
                    console.log('Conversation history cleared');
                    continue;
                }

                const response = await this.processUserInput(userInput);
                console.log(`\nAssistant: ${response}\n`);
            } catch (error) {
                this.logger.error('Error in interactive loop', error);
                console.log(`Error: ${error instanceof Error ? error.message : String(error)}`);
            }
        }
    }

    public async dispose(): Promise<void> {
        if (this.llm instanceof LocalLlamaLLM) {
            await this.llm.dispose();
        }
        this.logger.info('Application disposed');
    }
}

export async function createDefaultConfig(outputPath: string): Promise<void> {
    const config: ApplicationConfig = {
        llm: {
            modelName: 'gpt-3.5-turbo',
            temperature: 0.7,
            maxTokens: 2048,
            topP: 0.95,
            topK: 50,
            contextWindow: 4096,
            systemMessage: 'You are a helpful AI assistant with access to web search. Provide accurate, helpful responses.'
        },
        apiKeys: {},
        loggingLevel: 'INFO',
        enableStreaming: true,
        maxRetries: 3,
        timeoutSeconds: 60
    };

    const manager = new ConfigurationManager();
    await manager.save(config, outputPath);
    console.log(`Created default configuration at ${outputPath}`);
}

async function main(): Promise<void> {
    const args = process.argv.slice(2);
    const configPath = args.find(arg => arg.startsWith('--config='))?.split('=')[1] || 'config.yaml';
    const createConfig = args.includes('--create-config');

    if (createConfig) {
        await createDefaultConfig(configPath);
        return;
    }

    try {
        await fs.access(configPath);
    } catch {
        console.log(`Configuration file not found: ${configPath}`);
        console.log('Create one with: node app.js --create-config');
        return;
    }

    const app = new LLMApplication(configPath);
    
    try {
        await app.initialize();
        await app.runInteractive();
    } finally {
        await app.dispose();
    }
}

if (require.main === module) {
    main().catch(error => {
        console.error('Fatal error:', error);
        process.exit(1);
    });
}

This complete example demonstrates how all components work together. The application initializes by loading configuration, detecting GPU hardware, setting up the LLM, registering tools, and creating the chat history manager. The processUserInput method orchestrates the entire flow: adding messages to history, potentially invoking tools, generating completions through the resilient client, and logging interactions.

The interactive loop provides a simple command-line interface where users can have conversations, save and load conversation history, and clear the context. The application handles errors gracefully and provides informative logging throughout.

To use this application, users first create a configuration file:

node app.js --create-config

Then edit the configuration to add API keys and adjust parameters. Finally, run the application:

node app.js --config=config.yaml

The application demonstrates production-ready patterns including proper error handling, comprehensive logging, configuration management, resource cleanup, and graceful degradation when services are unavailable.

Conclusion

This article has presented a comprehensive TypeScript library for LLM application development. Each component addresses a specific recurring challenge: GPU detection eliminates platform-specific code, the abstract LLM interface enables model swapping, configuration management externalizes settings, tool calling extends LLM capabilities, MCP integration enables context sharing, message management handles conversation state, and circuit breakers with rate limiting provide resilience.

The components follow clean architecture principles with clear separation of concerns. Each component has a well-defined interface and can be used independently or in combination with others. TypeScript's powerful type system provides compile-time safety and excellent developer experience, catching errors before they reach production.

The running example demonstrates how these components integrate to create a complete, production-ready application. By providing these reusable components, the library eliminates the need for developers to repeatedly solve the same problems. Instead of spending time on infrastructure, developers can focus on the unique aspects of their applications: domain-specific logic, user experience, and business value.

The library is designed to be extensible. New LLM providers can be added by implementing the BaseLLM interface. New tools can be registered with the tool registry. Additional resilience mechanisms can wrap the existing components. This extensibility ensures that the library can evolve with the rapidly changing LLM ecosystem.

Future enhancements could include streaming response support in more components, integration with vector databases for retrieval-augmented generation, support for multi-modal models, enhanced observability with metrics and tracing, and WebSocket support for real-time applications. The foundation provided by these components makes such enhancements straightforward to implement while maintaining backward compatibility.

The goal of this library is to accelerate LLM application development by providing robust, well-tested components that handle common requirements. By building on this foundation, developers can create sophisticated LLM applications more quickly and with greater confidence in their reliability and maintainability.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Friday, June 05, 2026

Building a TypeScript Library for LLM Application Development