Friday, December 19, 2025

RUNNING GIANTS ON A BUDGET: THE oLLM REVOLUTION




INTRODUCTION    

Imagine having the power to analyze a 500-page legal contract, process thousands of patient medical records, or extract insights from gigabytes of server logs, all on your personal computer with a modest graphics card. This sounds like science fiction when you consider that modern large language models often require hundreds of gigabytes of memory and enterprise-grade hardware. Yet this is precisely the challenge that oLLM, a lightweight Python library created by developer Mega4alik, sets out to solve. The framework represents a paradigm shift in how we think about running large language models locally, making what was once impossible on consumer hardware not just possible, but practical. oLLM can be found on https://github.com/Mega4alik/ollm

THE MEMORY WALL: UNDERSTANDING THE CHALLENGE

To appreciate what oLLM accomplishes, we must first understand the fundamental challenge it addresses. Modern large language models are extraordinarily capable but equally demanding. A model like Qwen3-Next-80B contains 160 gigabytes of weights stored in floating-point format. When you want to use this model for inference, traditional approaches require loading all these weights into your GPU's video memory. The problem becomes even more severe when dealing with long contexts. As the model processes text, it builds what's called a KV cache, which stands for Key-Value cache. This cache stores intermediate computations that allow the model to remember what it has already processed. For a context of 50,000 tokens, this cache alone can consume dozens of gigabytes of additional memory.

Most consumer graphics cards come with 8 to 16 gigabytes of VRAM. The mathematics are brutally simple: you cannot fit 160 gigabytes of model weights plus tens of gigabytes of KV cache into 8 gigabytes of memory. Traditional solutions involve quantization, a technique that reduces the precision of the model's numbers from 16-bit floating point to 8-bit or even 4-bit integers. While this dramatically reduces memory requirements, it comes at a cost. Quantization degrades model quality, sometimes subtly and sometimes significantly. For critical applications like medical analysis or legal document review, this quality loss may be unacceptable.

THE oLLM VISION: PRECISION WITHOUT COMPROMISE

The core philosophy behind oLLM is elegantly simple yet revolutionary in execution. Rather than accepting the trade-off between memory constraints and model quality, oLLM asks a different question: what if we don't need everything in memory at once? Modern computers have fast NVMe solid-state drives capable of reading data at several gigabytes per second. What if we could stream model weights and cache data between the SSD and GPU on demand, keeping only what we need in VRAM at any given moment?

This approach requires rethinking the entire inference pipeline. Traditional frameworks load a model once and keep it resident in memory. oLLM instead treats inference as a streaming operation, where data flows continuously between storage, system memory, and GPU memory. The framework is built on top of two industry-standard foundations: Huggingface Transformers and PyTorch. This choice ensures compatibility with the vast ecosystem of existing models while adding the innovative memory management layer that makes large-context inference possible on modest hardware.

ARCHITECTURE: THE ENGINEERING BEHIND THE MAGIC

The oLLM architecture employs several sophisticated techniques working in concert to achieve its remarkable efficiency. Understanding these components reveals the elegance of the solution.

The first and most fundamental technique is layer-wise weight streaming from SSD. Transformer models, which form the basis of modern LLMs, are organized into sequential layers. Each layer performs a specific transformation on the data before passing it to the next layer. oLLM exploits this sequential structure by loading only one layer's weights at a time. When the forward pass reaches a particular layer, oLLM loads that layer's weights from the SSD into GPU memory, performs the computation, and then can discard those weights to make room for the next layer. This approach transforms the memory requirement from "entire model size" to "single layer size plus overhead," a reduction of one to two orders of magnitude.

The second critical innovation addresses the KV cache challenge. During text generation, the model processes input tokens and generates output tokens one at a time. For each token, the attention mechanism needs to attend to all previous tokens in the context. The KV cache stores the key and value vectors for all previous tokens, allowing the model to avoid recomputing them. For long contexts, this cache grows enormous. oLLM offloads this cache to the SSD, loading back only the portions needed for the current computation. The framework carefully manages this offloading to minimize the performance impact of disk I/O operations.

CPU offloading provides an additional flexibility layer. For systems with ample system RAM but limited VRAM, oLLM can stage layer weights in CPU memory rather than reading directly from SSD. This creates a three-tier memory hierarchy: SSD for permanent storage, CPU RAM for staging, and GPU VRAM for active computation. The framework automatically manages data movement across these tiers.

FlashAttention-2 integration addresses another critical bottleneck. The attention mechanism in transformers computes a matrix of attention scores between all pairs of tokens. For long contexts, this matrix becomes prohibitively large. FlashAttention-2 uses an online softmax algorithm that computes attention in chunks, never materializing the full attention matrix. This prevents VRAM spikes that would otherwise occur during attention computation. oLLM incorporates FlashAttention-2 to ensure that attention operations remain memory-efficient even for contexts of 100,000 tokens.

The final architectural component is chunked MLP processing. The Multi-Layer Perceptron (MLP) component of each transformer layer includes projection matrices that can be quite large. oLLM chunks these projections, processing them in smaller pieces that fit comfortably in VRAM. This chunking adds minimal computational overhead while preventing memory spikes.

Together, these techniques enable oLLM to run an 80-billion parameter model with a 50,000 token context using approximately 7.5 gigabytes of VRAM and 180 gigabytes of SSD space. The framework maintains full fp16 or bf16 precision throughout, ensuring no quality degradation compared to running the same model on high-end hardware.

GETTING STARTED: FROM INSTALLATION TO FIRST INFERENCE

Beginning your journey with oLLM requires setting up a proper Python environment and installing the necessary dependencies. The recommended approach uses a virtual environment to isolate oLLM and its dependencies from other Python projects on your system. This isolation prevents version conflicts and makes the installation reproducible.

Creating a virtual environment starts with a simple command. The following demonstrates the complete setup process from environment creation through oLLM installation:

python3 -m venv ollm_env
source ollm_env/bin/activate
pip install --upgrade pip

With the virtual environment activated, you have two installation options. The simplest approach uses pip to install the latest stable release directly from the Python Package Index:

pip install ollm

For users who want access to the latest features and improvements, installing directly from the GitHub repository provides the cutting edge version. This approach clones the repository and installs oLLM in editable mode, meaning changes to the source code immediately affect the installed package:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .

For NVIDIA GPU users, installing the kvikio library provides additional performance benefits. This library enables GPU-direct storage access, allowing even faster data transfer between SSD and GPU. The installation command must match your CUDA version. For CUDA 12, the command would be:

pip install kvikio-cu12

If you plan to work with multimodal models that process audio, additional dependencies are required. The voxtral-small-24B model, which handles both audio and text, needs the Mistral audio processing libraries and librosa for audio manipulation:

pip install "mistral-common[audio]"
pip install librosa

With installation complete, you can proceed to your first inference. The oLLM library provides a straightforward Inference class that handles all the complexity of memory management behind a simple interface. Let's examine a complete example that demonstrates basic text generation:

from ollm import Inference, TextStreamer

# Initialize the inference engine with a specific model
# The device parameter specifies which GPU to use
# Logging provides visibility into what oLLM is doing
inference_engine = Inference(
    "llama3-8B-chat",
    device="cuda:0",
    logging=True
)

# Define the prompt for the model
# This can be a question, instruction, or any text
user_prompt = "Explain how photosynthesis works in plants, focusing on the light-dependent and light-independent reactions."

# Generate the response using streaming
# The TextStreamer allows us to see tokens as they're generated
print("Model Response:")
for token in inference_engine.generate(user_prompt, streamer=TextStreamer()):
    print(token, end="", flush=True)
print()  # New line after generation completes

This example demonstrates several important concepts. The Inference class constructor takes the model name as its first argument. oLLM supports several pre-configured models including various Llama3 variants, GPT-OSS-20B, Qwen3-Next-80B, and others. The device parameter specifies which GPU to use, following PyTorch's device naming convention. Setting logging to True provides detailed information about what oLLM is doing during inference, which is invaluable for understanding performance and troubleshooting issues.

The generate method accepts the prompt and optional parameters. The streamer parameter enables token-by-token output, allowing you to see the model's response as it's generated rather than waiting for the entire response to complete. This streaming capability is particularly valuable for long responses, providing immediate feedback that the model is working.

ADVANCED USAGE: CUSTOM MODELS AND ADAPTERS

Beyond the pre-configured models, oLLM provides the AutoInference class for working with custom models and fine-tuned adapters. This capability is essential for organizations that have fine-tuned models for specific domains or tasks. The AutoInference class supports PEFT adapters, which stands for Parameter-Efficient Fine-Tuning. PEFT techniques like LoRA allow fine-tuning large models by training only a small number of additional parameters, making it practical to create specialized versions of large models.

Consider a scenario where you have fine-tuned a Gemma3-12B model for medical document analysis. The base model provides general language understanding, while your adapter specializes it for medical terminology and reasoning. Loading and using this configuration with oLLM looks like this:

from ollm import AutoInference

# Initialize AutoInference with both base model and adapter
# The base model provides the foundation
# The adapter adds domain-specific capabilities
medical_inference = AutoInference(
    model_path="./models/gemma3-12B",
    adapter_dir="./adapters/medical-analysis/checkpoint-50",
    device="cuda:0",
    multimodality=False,
    logging=True
)

# Process a medical document excerpt
medical_text = """
Patient presents with acute onset dyspnea and chest pain.
Vital signs: BP 145/92, HR 108, RR 24, SpO2 91% on room air.
Physical exam reveals decreased breath sounds in right lower lobe
with dullness to percussion. Chest X-ray shows right-sided pleural effusion.
"""

analysis_prompt = f"Analyze the following clinical presentation and suggest possible diagnoses with supporting evidence:\n\n{medical_text}"

# Generate the analysis
print("Clinical Analysis:")
for response_chunk in medical_inference.generate(analysis_prompt):
    print(response_chunk, end="", flush=True)
print()

This example illustrates how AutoInference seamlessly combines a base model with a specialized adapter. The model_path parameter points to the directory containing your base model in Huggingface format. The adapter_dir parameter points to a PEFT checkpoint directory, typically created during fine-tuning. The multimodality parameter indicates whether the model should expect multimodal inputs like images or audio. For text-only applications, this should be False.

The real power of this approach becomes apparent when you consider the resource efficiency. The base Gemma3-12B model might be 24 gigabytes, but the PEFT adapter might be only 100 megabytes. You can maintain multiple specialized adapters for different domains and swap between them without needing to store multiple full-sized models. oLLM's memory management means you can run these configurations on the same modest hardware that would struggle to load even one full model using traditional approaches.

MULTIMODAL CAPABILITIES: BEYOND TEXT

Modern AI applications increasingly require processing multiple modalities simultaneously. A medical application might need to analyze both patient records and medical imaging. A customer service system might process both voice recordings and transcripts. oLLM extends its efficient memory management to multimodal models, enabling these advanced applications on consumer hardware.

The framework supports two primary multimodal configurations. The gemma3-12B model can process both images and text, enabling applications like document understanding, visual question answering, and image captioning. The voxtral-small-24B model processes audio and text, supporting applications like audio transcription, audio question answering, and voice-based interaction.

Working with multimodal models requires additional setup but follows the same fundamental patterns. For image-text models, you would initialize the inference engine with multimodality enabled:

from ollm import AutoInference
import PIL.Image

# Initialize for multimodal inference
# multimodality=True enables image processing capabilities
vision_inference = AutoInference(
    model_path="./models/gemma3-12B",
    device="cuda:0",
    multimodality=True,
    logging=True
)

# Load an image for analysis
# PIL (Python Imaging Library) handles image loading
document_image = PIL.Image.open("./documents/contract_page_5.png")

# Create a multimodal prompt
# The model will analyze both the image and text instruction
analysis_request = "Extract all monetary amounts and associated dates from this document page. Present them in a structured format."

# Generate analysis combining visual and textual understanding
result = vision_inference.generate(
    prompt=analysis_request,
    image=document_image
)

print("Extracted Information:")
for output_token in result:
    print(output_token, end="", flush=True)
print()

This example demonstrates document analysis, a common business application. The model receives both an image of a document page and a text instruction about what to extract. The vision-language model can understand the visual layout, read text within the image, and follow the textual instruction to produce structured output. All of this happens on consumer hardware thanks to oLLM's memory management.

Audio processing follows a similar pattern but requires audio-specific preprocessing. The voxtral-small-24B model can process audio files and answer questions about their content:

from ollm import AutoInference
import librosa

# Initialize for audio-text multimodal inference
audio_inference = AutoInference(
    model_path="./models/voxtral-small-24B",
    device="cuda:0",
    multimodality=True,
    logging=True
)

# Load and preprocess audio file
# librosa handles audio loading and preprocessing
# sr=16000 sets the sample rate to 16kHz, standard for speech
audio_data, sample_rate = librosa.load(
    "./audio/customer_call_recording.wav",
    sr=16000
)

# Create a prompt about the audio content
audio_question = "Summarize the main issues discussed in this customer service call and identify any action items mentioned."

# Process audio and generate response
summary = audio_inference.generate(
    prompt=audio_question,
    audio=audio_data
)

print("Call Summary:")
for summary_token in summary:
    print(summary_token, end="", flush=True)
print()

These multimodal capabilities open up applications that would be impractical with text-only models. Customer service analysis, medical imaging interpretation, document digitization, and accessibility tools all become feasible on local hardware without cloud dependencies.

THE BENEFITS: WHY CHOOSE oLLM

The advantages of oLLM extend far beyond simply making large models run on small GPUs. The framework enables entirely new workflows and addresses concerns that cloud-based solutions cannot.

Privacy and data sovereignty represent perhaps the most compelling benefit for many organizations. When processing sensitive documents like medical records, legal contracts, or confidential business data, sending that information to cloud APIs creates risk. Data breaches, unauthorized access, and compliance violations become concerns. With oLLM, all processing happens locally. The data never leaves your infrastructure. For healthcare organizations subject to HIPAA regulations, financial institutions dealing with PII, or legal firms handling privileged communications, this local processing capability is not just beneficial but often required.

Cost efficiency provides another significant advantage. Cloud API pricing for large language models typically charges per token processed. For applications processing large contexts, these costs accumulate rapidly. Analyzing a 100,000 token document might cost several dollars per analysis. If you need to process thousands of such documents, the costs become prohibitive. oLLM requires only the upfront cost of hardware and electricity. Once you have a suitable computer with an SSD and modest GPU, the marginal cost of each additional inference is essentially zero. For high-volume applications, this cost structure is transformative.

Full precision inference ensures quality matches what you would get from much more expensive hardware. Quantization techniques that compress models to fit in limited memory inevitably degrade quality. For some applications, this degradation is acceptable. For others, particularly in professional domains like medicine or law, the quality loss is unacceptable. oLLM maintains full fp16 or bf16 precision throughout the inference pipeline, ensuring that the only difference from running on a high-end server is speed, not quality.

The framework's flexibility in handling long contexts enables entirely new application categories. Many cloud APIs limit context length to reduce computational costs. These limits make certain applications impossible. Analyzing an entire book, processing a day's worth of server logs, or reviewing a complete patient history might exceed these limits. oLLM supports contexts up to 100,000 tokens, limited only by available SSD space and patience with processing time. This capability enables comprehensive analysis that would require chunking and multiple API calls with cloud services.

Offline operation provides reliability and independence. Cloud services experience outages, rate limiting, and API changes. Applications built on oLLM continue functioning regardless of internet connectivity or service availability. For critical applications, embedded systems, or deployment in environments with limited connectivity, this independence is essential.

Experimentation and development benefit enormously from local execution. When developing and testing LLM applications, rapid iteration is valuable. Cloud API costs can make experimentation expensive, leading developers to minimize testing. With oLLM, you can run unlimited experiments without worrying about API bills. This freedom accelerates development and encourages thorough testing.

REAL-WORLD APPLICATIONS: WHERE oLLM EXCELS

Understanding where oLLM provides the most value helps in deciding whether it fits your use case. The framework excels in specific scenarios while being less suitable for others.

Legal document analysis represents an ideal application. Law firms routinely work with contracts, regulations, and case law spanning hundreds of pages. Analyzing these documents for specific clauses, precedents, or compliance issues requires processing entire documents in context. The sensitive nature of legal documents makes cloud processing problematic. oLLM enables lawyers to analyze complete contracts locally, asking questions like "What are all the termination clauses in this agreement and under what conditions do they activate?" The model can process the entire contract in one pass, maintaining context across all sections.

Medical record analysis faces similar requirements. A patient's medical history might span decades and hundreds of pages. Understanding patterns, identifying risk factors, or preparing for specialist consultations benefits from analyzing the complete history. HIPAA regulations make cloud processing complex and risky. A healthcare provider using oLLM could process complete patient histories locally, asking questions like "What patterns emerge in this patient's blood pressure readings over the past five years, and how do they correlate with medication changes?" The analysis happens entirely within the provider's infrastructure, maintaining compliance and privacy.

Compliance and regulatory analysis in financial services involves processing extensive documentation against complex rule sets. A compliance officer might need to analyze years of communications to identify potential violations. The volume of data and sensitivity of communications make this challenging with cloud services. oLLM enables processing complete communication archives locally, searching for patterns that might indicate compliance issues.

Research and academic applications often involve analyzing large corpora of text. A literature review might involve processing hundreds of papers to identify trends, methodologies, or gaps in research. Grant-funded research might prohibit sending data to commercial cloud services. Researchers using oLLM can process entire paper collections locally, asking questions like "What methodologies have been used to study this phenomenon, and how have they evolved over time?"

Log file analysis for system administration and security involves processing gigabytes of log data to identify issues, security threats, or performance patterns. These logs often contain sensitive information about infrastructure and users. A system administrator using oLLM could process a day's worth of logs, asking "What unusual patterns appear in authentication logs, and do they correlate with any system errors or performance degradation?"

Historical chat analysis for customer service improvement involves processing thousands of support conversations to identify common issues, training needs, or product problems. These conversations contain customer information that should remain private. A customer service manager using oLLM could analyze complete chat histories locally, identifying trends without exposing customer data to external services.

RAG pipeline experimentation benefits from oLLM's flexibility. Retrieval Augmented Generation combines document retrieval with language model generation. Developing effective RAG systems requires extensive experimentation with different retrieval strategies, context sizes, and prompting approaches. oLLM's unlimited local inference enables this experimentation without API costs.

UNDERSTANDING THE TRADE-OFFS: WHERE oLLM IS NOT IDEAL

Honesty about limitations is as important as celebrating capabilities. oLLM makes specific trade-offs that make it unsuitable for certain applications.

Real-time interactive applications require low latency responses. Chat interfaces, voice assistants, and interactive tools need responses within seconds. oLLM's approach of streaming data from SSD introduces latency. Depending on the model size, context length, and hardware, generating responses might take tens of seconds or even minutes. This latency makes oLLM unsuitable for applications where users expect immediate responses. For these use cases, smaller models that fit entirely in VRAM, quantized models, or cloud APIs provide better user experience.

Throughput-critical applications that need to process many short requests quickly also face challenges with oLLM. The framework optimizes for large contexts and high-quality inference, not for maximum throughput. If you need to process thousands of short classification tasks per minute, specialized inference servers like vLLM or TensorRT-LLM provide better performance. oLLM's strength lies in processing fewer, larger, more complex requests where quality and privacy matter more than speed.

Applications requiring the absolute latest models might encounter delays. oLLM supports many popular models, but adding support for new architectures requires development work. Cloud APIs often support new models immediately upon release. If your application requires using models released days ago, cloud services provide faster access.

PERFORMANCE CONSIDERATIONS: OPTIMIZING YOUR SETUP

Getting the best performance from oLLM requires understanding how your hardware affects inference speed and making appropriate optimizations.

SSD speed directly impacts inference performance. oLLM streams gigabytes of data from storage during inference. NVMe SSDs provide dramatically better performance than SATA SSDs. Within NVMe drives, newer PCIe 4.0 and 5.0 drives offer higher bandwidth than PCIe 3.0 drives. For optimal performance, use the fastest NVMe SSD available in your system. If you have multiple SSDs, storing model weights on the fastest drive improves performance.

System RAM amount affects whether CPU offloading can be used effectively. With 32GB or more of system RAM, oLLM can stage model layers in CPU memory rather than reading from SSD for each layer. This staging reduces disk I/O and can significantly improve performance. With limited system RAM, direct SSD-to-GPU streaming remains the only option.

GPU VRAM size determines how much of the model and cache can remain resident. While oLLM works with 8GB GPUs, having 12GB or 16GB allows keeping more data in VRAM, reducing the amount of streaming required. This doesn't change whether a model can run, but it affects how fast it runs.

The kvikio library provides GPU-direct storage access on supported systems. This technology allows the GPU to read directly from NVMe storage without CPU involvement, reducing latency and CPU overhead. Installing kvikio and ensuring your system supports GPU-direct storage can provide measurable performance improvements.

Model selection affects both quality and speed. Larger models generally provide better quality but run slower. For your specific application, testing different model sizes helps find the optimal balance. A 8B parameter model might provide sufficient quality while running much faster than an 80B parameter model.

Context length directly impacts memory requirements and processing time. Using the minimum context length necessary for your application improves performance. If your documents are typically 20,000 tokens, configuring for 30,000 token context provides headroom while avoiding the overhead of supporting 100,000 tokens.

THE FUTURE: WHERE oLLM IS HEADING

The oLLM project continues active development, with ongoing improvements to performance, model support, and capabilities. Understanding the development direction helps in planning long-term adoption.

Support for additional model architectures expands as new models gain popularity. The framework's architecture makes adding new models relatively straightforward, and the community contributes support for models they need. As new efficient architectures emerge, oLLM adoption of these innovations brings their benefits to resource-constrained environments.

Performance optimizations continue as the developers identify bottlenecks and implement improvements. Each release typically includes speed improvements alongside new features. The fundamental approach of streaming data between storage and GPU has room for optimization through better caching strategies, predictive loading, and more efficient memory management.

Community contributions drive much of oLLM's evolution. As more users adopt the framework for diverse applications, they contribute bug fixes, optimizations, and new features. This community-driven development ensures oLLM evolves to meet real-world needs.

CONCLUSION: DEMOCRATIZING ACCESS TO LARGE LANGUAGE MODELS

oLLM represents more than just a clever engineering solution to a memory constraint problem. It represents a fundamental democratization of access to large language model capabilities. Before oLLM, running state-of-the-art large models required either expensive cloud API subscriptions or investment in enterprise-grade hardware with hundreds of gigabytes of GPU memory. This created a divide between those who could afford these resources and those who could not.

By enabling high-quality, long-context inference on consumer hardware, oLLM makes sophisticated AI capabilities accessible to individual researchers, small organizations, and anyone with privacy or cost concerns that make cloud services unsuitable. The framework proves that with clever engineering, the seemingly impossible becomes not just possible but practical.

For organizations handling sensitive data, oLLM provides a path to leveraging large language models while maintaining data sovereignty and compliance. For researchers and developers, it enables unlimited experimentation without API costs. For anyone working with large documents or long contexts, it provides capabilities that cloud services often cannot match.

The trade-offs are real and important to understand. oLLM is not the right choice for every application. Real-time interactive uses, high-throughput scenarios, and situations where absolute cutting-edge model access matters more than other considerations are better served by other solutions. But for the substantial category of applications where quality, privacy, cost, and long-context capabilities matter most, oLLM opens doors that were previously closed.

As the project continues to evolve, as hardware continues to improve, and as the community continues to contribute, oLLM's capabilities will only expand. The fundamental insight that drives the framework, that we can trade some speed for the ability to run models that would otherwise be impossible on given hardware, will remain relevant as models continue to grow larger and more capable.

Whether you are a healthcare provider needing to analyze patient records, a legal professional reviewing contracts, a researcher processing academic literature, or a developer building privacy-focused AI applications, oLLM deserves consideration. It represents a different approach to the large language model inference problem, one that prioritizes accessibility, privacy, and quality over raw speed. In doing so, it makes powerful AI capabilities available to a much broader audience, advancing the democratization of artificial intelligence technology.

Building an LLM Chatbot in Java with Ollama: A Complete Developer's Guide




Introduction and Overview


Large Language Models (LLMs) have revolutionized the way we interact with artificial intelligence, enabling natural language conversations and sophisticated text generation capabilities. This comprehensive guide will walk you through creating a fully functional LLM chatbot in Java using Ollama, a local LLM runtime that allows you to run models on your own machine without relying on external APIs.


An LLM chatbot is essentially a conversational interface that leverages the power of large language models to understand user input and generate contextually appropriate responses. Unlike traditional rule-based chatbots that rely on predefined responses, LLM chatbots can engage in more natural, flexible conversations by understanding context and generating responses dynamically.


Ollama serves as our local LLM runtime, providing a simple way to run various open-source language models locally. This approach offers several advantages including privacy, reduced latency, offline capability, and cost control since you are not making API calls to external services.


Understanding LLM Chatbot Architecture


Before diving into the implementation, it is crucial to understand the fundamental architecture of an LLM chatbot. The system consists of several key components that work together to create a seamless conversational experience.


The core architecture includes a user interface layer that handles input and output, a conversation management system that maintains context and history, an LLM integration layer that communicates with the language model, and a response processing system that formats and delivers responses back to the user.


The conversation flow typically follows this pattern: the user provides input through the interface, the system processes this input and adds it to the conversation context, the enhanced prompt is sent to the LLM through Ollama, the model generates a response, and finally the response is processed and displayed to the user while updating the conversation history.


Setting Up the Development Environment


To begin building our LLM chatbot, we need to establish a proper development environment with all necessary dependencies and tools. This setup phase is critical for ensuring smooth development and deployment of our application.


First, ensure you have Java Development Kit (JDK) 11 or higher installed on your system. We will be using Maven as our build tool, so make sure Maven is properly configured. Additionally, you will need to install Ollama on your local machine to serve as the LLM runtime.


Download and install Ollama from the official website. Once installed, you can pull a language model such as Llama 2 or Mistral by running the appropriate Ollama commands in your terminal. For this tutorial, we will use the Llama 2 model, which you can install by running "ollama pull llama2" in your command line.


Project Structure and Dependencies


Our Java project will follow a clean architecture pattern with clear separation of concerns. The project structure will include packages for models, services, controllers, and utilities, ensuring maintainable and scalable code.


Create a new Maven project and configure the following project structure:



src/

  main/

    java/

      com/

        fortytwo/

          llmchatbot/

            model/

            service/

            controller/

            util/

            Application.java

    resources/

      application.properties

pom.xml



The Maven configuration file (pom.xml) needs to include dependencies for HTTP client functionality, JSON processing, and logging. Here is the complete pom.xml configuration:



<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 

         http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    

    <groupId>com.fortytwo</groupId>

    <artifactId>llm-chatbot</artifactId>

    <version>1.0.0</version>

    <packaging>jar</packaging>

    

    <name>LLM Chatbot</name>

    <description>A Java-based LLM chatbot using Ollama</description>

    

    <properties>

        <maven.compiler.source>11</maven.compiler.source>

        <maven.compiler.target>11</maven.compiler.target>

        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

    </properties>

    

    <dependencies>

        <dependency>

            <groupId>com.fasterxml.jackson.core</groupId>

            <artifactId>jackson-databind</artifactId>

            <version>2.15.2</version>

        </dependency>

        

        <dependency>

            <groupId>org.apache.httpcomponents.client5</groupId>

            <artifactId>httpclient5</artifactId>

            <version>5.2.1</version>

        </dependency>

        

        <dependency>

            <groupId>org.slf4j</groupId>

            <artifactId>slf4j-api</artifactId>

            <version>2.0.7</version>

        </dependency>

        

        <dependency>

            <groupId>ch.qos.logback</groupId>

            <artifactId>logback-classic</artifactId>

            <version>1.4.8</version>

        </dependency>

        

        <dependency>

            <groupId>junit</groupId>

            <artifactId>junit</artifactId>

            <version>4.13.2</version>

            <scope>test</scope>

        </dependency>

    </dependencies>

    

    <build>

        <plugins>

            <plugin>

                <groupId>org.apache.maven.plugins</groupId>

                <artifactId>maven-compiler-plugin</artifactId>

                <version>3.11.0</version>

                <configuration>

                    <source>11</source>

                    <target>11</target>

                </configuration>

            </plugin>

            

            <plugin>

                <groupId>org.codehaus.mojo</groupId>

                <artifactId>exec-maven-plugin</artifactId>

                <version>3.1.0</version>

                <configuration>

                    <mainClass>com.fortytwo.llmchatbot.Application</mainClass>

                </configuration>

            </plugin>

        </plugins>

    </build>

</project>


This configuration includes Jackson for JSON processing, Apache HTTP Client for making requests to Ollama, SLF4J and Logback for logging, and JUnit for testing. The exec plugin allows us to easily run our application from Maven.


Creating Data Models


The foundation of our chatbot lies in well-designed data models that represent the various entities in our system. These models will handle conversation messages, LLM requests and responses, and conversation context.


First, let's create a Message model that represents individual messages in our conversation:



package com.fortytwo.llmchatbot.model;


import java.time.LocalDateTime;

import java.util.Objects;


/**

 * Represents a single message in a conversation.

 * This class encapsulates the content, sender information, and timestamp

 * of each message exchanged between the user and the chatbot.

 */

public class Message {

    

    /**

     * Enumeration defining the possible roles/senders of a message

     */

    public enum Role {

        USER("user"),

        ASSISTANT("assistant"),

        SYSTEM("system");

        

        private final String value;

        

        Role(String value) {

            this.value = value;

        }

        

        public String getValue() {

            return value;

        }

        

        public static Role fromString(String value) {

            for (Role role : Role.values()) {

                if (role.value.equalsIgnoreCase(value)) {

                    return role;

                }

            }

            throw new IllegalArgumentException("Unknown role: " + value);

        }

    }

    

    private String content;

    private Role role;

    private LocalDateTime timestamp;

    

    /**

     * Default constructor for JSON deserialization

     */

    public Message() {

        this.timestamp = LocalDateTime.now();

    }

    

    /**

     * Constructor for creating a new message

     * @param content The text content of the message

     * @param role The role of the message sender (USER, ASSISTANT, SYSTEM)

     */

    public Message(String content, Role role) {

        this.content = content;

        this.role = role;

        this.timestamp = LocalDateTime.now();

    }

    

    /**

     * Gets the content of the message

     * @return The message content as a string

     */

    public String getContent() {

        return content;

    }

    

    /**

     * Sets the content of the message

     * @param content The new content for the message

     */

    public void setContent(String content) {

        this.content = content;

    }

    

    /**

     * Gets the role of the message sender

     * @return The role enum value

     */

    public Role getRole() {

        return role;

    }

    

    /**

     * Sets the role of the message sender

     * @param role The role enum value

     */

    public void setRole(Role role) {

        this.role = role;

    }

    

    /**

     * Gets the timestamp when the message was created

     * @return LocalDateTime representing the creation time

     */

    public LocalDateTime getTimestamp() {

        return timestamp;

    }

    

    /**

     * Sets the timestamp of the message

     * @param timestamp The new timestamp

     */

    public void setTimestamp(LocalDateTime timestamp) {

        this.timestamp = timestamp;

    }

    

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        Message message = (Message) o;

        return Objects.equals(content, message.content) &&

               role == message.role &&

               Objects.equals(timestamp, message.timestamp);

    }

    

    @Override

    public int hashCode() {

        return Objects.hash(content, role, timestamp);

    }

    

    @Override

    public String toString() {

        return String.format("Message{role=%s, content='%s', timestamp=%s}", 

                           role, content, timestamp);

    }

}



Next, we need models to handle communication with the Ollama API. Let's create the OllamaRequest model:



package com.fortytwo.llmchatbot.model;


import com.fasterxml.jackson.annotation.JsonProperty;

import java.util.List;

import java.util.Objects;


/**

 * Represents a request to the Ollama API for chat completion.

 * This class encapsulates all the parameters needed to make a request

 * to the Ollama service for generating responses.

 */

public class OllamaRequest {

    

    @JsonProperty("model")

    private String model;

    

    @JsonProperty("messages")

    private List<OllamaMessage> messages;

    

    @JsonProperty("stream")

    private boolean stream;

    

    @JsonProperty("options")

    private OllamaOptions options;

    

    /**

     * Default constructor

     */

    public OllamaRequest() {

        this.stream = false; // Default to non-streaming mode

    }

    

    /**

     * Constructor with essential parameters

     * @param model The name of the model to use (e.g., "llama2")

     * @param messages List of messages forming the conversation context

     */

    public OllamaRequest(String model, List<OllamaMessage> messages) {

        this.model = model;

        this.messages = messages;

        this.stream = false;

    }

    

    /**

     * Gets the model name

     * @return The model name string

     */

    public String getModel() {

        return model;

    }

    

    /**

     * Sets the model name

     * @param model The model name to use

     */

    public void setModel(String model) {

        this.model = model;

    }

    

    /**

     * Gets the list of messages

     * @return List of OllamaMessage objects

     */

    public List<OllamaMessage> getMessages() {

        return messages;

    }

    

    /**

     * Sets the list of messages

     * @param messages List of OllamaMessage objects

     */

    public void setMessages(List<OllamaMessage> messages) {

        this.messages = messages;

    }

    

    /**

     * Checks if streaming is enabled

     * @return true if streaming is enabled, false otherwise

     */

    public boolean isStream() {

        return stream;

    }

    

    /**

     * Sets the streaming mode

     * @param stream true to enable streaming, false to disable

     */

    public void setStream(boolean stream) {

        this.stream = stream;

    }

    

    /**

     * Gets the options for the request

     * @return OllamaOptions object

     */

    public OllamaOptions getOptions() {

        return options;

    }

    

    /**

     * Sets the options for the request

     * @param options OllamaOptions object containing model parameters

     */

    public void setOptions(OllamaOptions options) {

        this.options = options;

    }

    

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        OllamaRequest that = (OllamaRequest) o;

        return stream == that.stream &&

               Objects.equals(model, that.model) &&

               Objects.equals(messages, that.messages) &&

               Objects.equals(options, that.options);

    }

    

    @Override

    public int hashCode() {

        return Objects.hash(model, messages, stream, options);

    }

    

    @Override

    public String toString() {

        return String.format("OllamaRequest{model='%s', messages=%s, stream=%s, options=%s}", 

                           model, messages, stream, options);

    }

}



We also need an OllamaMessage model that represents messages in the format expected by Ollama:



package com.fortytwo.llmchatbot.model;


import com.fasterxml.jackson.annotation.JsonProperty;

import java.util.Objects;


/**

 * Represents a message in the format expected by the Ollama API.

 * This class is used for serialization when sending requests to Ollama.

 */

public class OllamaMessage {

    

    @JsonProperty("role")

    private String role;

    

    @JsonProperty("content")

    private String content;

    

    /**

     * Default constructor for JSON deserialization

     */

    public OllamaMessage() {

    }

    

    /**

     * Constructor for creating an Ollama message

     * @param role The role of the message sender (user, assistant, system)

     * @param content The content of the message

     */

    public OllamaMessage(String role, String content) {

        this.role = role;

        this.content = content;

    }

    

    /**

     * Gets the role of the message sender

     * @return The role as a string

     */

    public String getRole() {

        return role;

    }

    

    /**

     * Sets the role of the message sender

     * @param role The role string

     */

    public void setRole(String role) {

        this.role = role;

    }

    

    /**

     * Gets the content of the message

     * @return The message content

     */

    public String getContent() {

        return content;

    }

    

    /**

     * Sets the content of the message

     * @param content The message content

     */

    public void setContent(String content) {

        this.content = content;

    }

    

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        OllamaMessage that = (OllamaMessage) o;

        return Objects.equals(role, that.role) &&

               Objects.equals(content, that.content);

    }

    

    @Override

    public int hashCode() {

        return Objects.hash(role, content);

    }

    

    @Override

    public String toString() {

        return String.format("OllamaMessage{role='%s', content='%s'}", role, content);

    }

}



Implementing the Ollama Client Service


The Ollama client service is the core component that handles communication with the local Ollama instance. This service encapsulates all the HTTP communication logic and provides a clean interface for the rest of our application to interact with the LLM.


The service needs to handle HTTP requests, JSON serialization and deserialization, error handling, and connection management. Let's create a comprehensive OllamaService class:



package com.fortytwo.llmchatbot.service;


import com.fasterxml.jackson.databind.ObjectMapper;

import com.fortytwo.llmchatbot.model.*;

import org.apache.hc.client5.http.classic.methods.HttpPost;

import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;

import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse;

import org.apache.hc.client5.http.impl.classic.HttpClients;

import org.apache.hc.core5.http.ContentType;

import org.apache.hc.core5.http.io.entity.EntityUtils;

import org.apache.hc.core5.http.io.entity.StringEntity;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


import java.io.IOException;

import java.util.ArrayList;

import java.util.List;


/**

 * Service class for communicating with the Ollama API.

 * This class handles all HTTP communication with the local Ollama instance,

 * including request formatting, response parsing, and error handling.

 */

public class OllamaService {

    

    private static final Logger logger = LoggerFactory.getLogger(OllamaService.class);

    

    private final String baseUrl;

    private final String model;

    private final CloseableHttpClient httpClient;

    private final ObjectMapper objectMapper;

    

    /**

     * Constructor with default configuration

     * Uses default Ollama URL (http://localhost:11434) and llama2 model

     */

    public OllamaService() {

        this("http://localhost:11434", "llama2");

    }

    

    /**

     * Constructor with custom configuration

     * @param baseUrl The base URL of the Ollama instance

     * @param model The name of the model to use

     */

    public OllamaService(String baseUrl, String model) {

        this.baseUrl = baseUrl.endsWith("/") ? baseUrl.substring(0, baseUrl.length() - 1) : baseUrl;

        this.model = model;

        this.httpClient = HttpClients.createDefault();

        this.objectMapper = new ObjectMapper();

        

        logger.info("Initialized OllamaService with URL: {} and model: {}", this.baseUrl, this.model);

    }

    

    /**

     * Sends a chat completion request to Ollama

     * @param messages List of messages forming the conversation context

     * @return The response from the LLM

     * @throws OllamaServiceException if there's an error communicating with Ollama

     */

    public String generateResponse(List<Message> messages) throws OllamaServiceException {

        try {

            // Convert internal Message objects to OllamaMessage format

            List<OllamaMessage> ollamaMessages = convertToOllamaMessages(messages);

            

            // Create the request object

            OllamaRequest request = new OllamaRequest(model, ollamaMessages);

            

            // Serialize the request to JSON

            String requestJson = objectMapper.writeValueAsString(request);

            logger.debug("Sending request to Ollama: {}", requestJson);

            

            // Create HTTP POST request

            HttpPost httpPost = new HttpPost(baseUrl + "/api/chat");

            httpPost.setEntity(new StringEntity(requestJson, ContentType.APPLICATION_JSON));

            httpPost.setHeader("Accept", "application/json");

            

            // Execute the request

            try (CloseableHttpResponse response = httpClient.execute(httpPost)) {

                int statusCode = response.getCode();

                String responseBody = EntityUtils.toString(response.getEntity());

                

                logger.debug("Received response from Ollama. Status: {}, Body: {}", statusCode, responseBody);

                

                if (statusCode == 200) {

                    // Parse the response

                    OllamaResponse ollamaResponse = objectMapper.readValue(responseBody, OllamaResponse.class);

                    return extractMessageContent(ollamaResponse);

                } else {

                    throw new OllamaServiceException(

                        String.format("Ollama API returned error status %d: %s", statusCode, responseBody)

                    );

                }

            }

            

        } catch (IOException e) {

            logger.error("Error communicating with Ollama", e);

            throw new OllamaServiceException("Failed to communicate with Ollama: " + e.getMessage(), e);

        } catch (Exception e) {

            logger.error("Unexpected error in generateResponse", e);

            throw new OllamaServiceException("Unexpected error: " + e.getMessage(), e);

        }

    }

    

    /**

     * Converts internal Message objects to OllamaMessage format

     * @param messages List of internal Message objects

     * @return List of OllamaMessage objects

     */

    private List<OllamaMessage> convertToOllamaMessages(List<Message> messages) {

        List<OllamaMessage> ollamaMessages = new ArrayList<>();

        

        for (Message message : messages) {

            OllamaMessage ollamaMessage = new OllamaMessage(

                message.getRole().getValue(),

                message.getContent()

            );

            ollamaMessages.add(ollamaMessage);

        }

        

        return ollamaMessages;

    }

    

    /**

     * Extracts the message content from the Ollama response

     * @param response The OllamaResponse object

     * @return The content of the assistant's message

     * @throws OllamaServiceException if the response format is unexpected

     */

    private String extractMessageContent(OllamaResponse response) throws OllamaServiceException {

        if (response == null) {

            throw new OllamaServiceException("Received null response from Ollama");

        }

        

        OllamaMessage message = response.getMessage();

        if (message == null) {

            throw new OllamaServiceException("No message found in Ollama response");

        }

        

        String content = message.getContent();

        if (content == null || content.trim().isEmpty()) {

            throw new OllamaServiceException("Empty content received from Ollama");

        }

        

        return content.trim();

    }

    

    /**

     * Tests the connection to Ollama by making a simple request

     * @return true if the connection is successful, false otherwise

     */

    public boolean testConnection() {

        try {

            List<Message> testMessages = new ArrayList<>();

            testMessages.add(new Message("Hello", Message.Role.USER));

            

            generateResponse(testMessages);

            logger.info("Connection test to Ollama successful");

            return true;

            

        } catch (Exception e) {

            logger.warn("Connection test to Ollama failed: {}", e.getMessage());

            return false;

        }

    }

    

    /**

     * Gets the current model name

     * @return The model name

     */

    public String getModel() {

        return model;

    }

    

    /**

     * Gets the base URL

     * @return The base URL

     */

    public String getBaseUrl() {

        return baseUrl;

    }

    

    /**

     * Closes the HTTP client and releases resources

     */

    public void close() {

        try {

            httpClient.close();

            logger.info("OllamaService closed successfully");

        } catch (IOException e) {

            logger.warn("Error closing HTTP client", e);

        }

    }

}



We also need to create the OllamaResponse model and supporting classes:



package com.fortytwo.llmchatbot.model;


import com.fasterxml.jackson.annotation.JsonProperty;

import java.util.Objects;


/**

 * Represents a response from the Ollama API.

 * This class encapsulates the response structure returned by Ollama

 * after processing a chat completion request.

 */

public class OllamaResponse {

    

    @JsonProperty("model")

    private String model;

    

    @JsonProperty("created_at")

    private String createdAt;

    

    @JsonProperty("message")

    private OllamaMessage message;

    

    @JsonProperty("done")

    private boolean done;

    

    @JsonProperty("total_duration")

    private Long totalDuration;

    

    @JsonProperty("load_duration")

    private Long loadDuration;

    

    @JsonProperty("prompt_eval_count")

    private Integer promptEvalCount;

    

    @JsonProperty("prompt_eval_duration")

    private Long promptEvalDuration;

    

    @JsonProperty("eval_count")

    private Integer evalCount;

    

    @JsonProperty("eval_duration")

    private Long evalDuration;

    

    /**

     * Default constructor for JSON deserialization

     */

    public OllamaResponse() {

    }

    

    /**

     * Gets the model name used for the response

     * @return The model name

     */

    public String getModel() {

        return model;

    }

    

    /**

     * Sets the model name

     * @param model The model name

     */

    public void setModel(String model) {

        this.model = model;

    }

    

    /**

     * Gets the creation timestamp

     * @return The creation timestamp as a string

     */

    public String getCreatedAt() {

        return createdAt;

    }

    

    /**

     * Sets the creation timestamp

     * @param createdAt The creation timestamp

     */

    public void setCreatedAt(String createdAt) {

        this.createdAt = createdAt;

    }

    

    /**

     * Gets the message from the response

     * @return The OllamaMessage object

     */

    public OllamaMessage getMessage() {

        return message;

    }

    

    /**

     * Sets the message

     * @param message The OllamaMessage object

     */

    public void setMessage(OllamaMessage message) {

        this.message = message;

    }

    

    /**

     * Checks if the response is complete

     * @return true if the response is complete, false otherwise

     */

    public boolean isDone() {

        return done;

    }

    

    /**

     * Sets the done flag

     * @param done true if the response is complete

     */

    public void setDone(boolean done) {

        this.done = done;

    }

    

    /**

     * Gets the total duration of the request

     * @return Total duration in nanoseconds

     */

    public Long getTotalDuration() {

        return totalDuration;

    }

    

    /**

     * Sets the total duration

     * @param totalDuration Total duration in nanoseconds

     */

    public void setTotalDuration(Long totalDuration) {

        this.totalDuration = totalDuration;

    }

    

    /**

     * Gets the model load duration

     * @return Load duration in nanoseconds

     */

    public Long getLoadDuration() {

        return loadDuration;

    }

    

    /**

     * Sets the load duration

     * @param loadDuration Load duration in nanoseconds

     */

    public void setLoadDuration(Long loadDuration) {

        this.loadDuration = loadDuration;

    }

    

    /**

     * Gets the prompt evaluation count

     * @return Number of tokens in the prompt

     */

    public Integer getPromptEvalCount() {

        return promptEvalCount;

    }

    

    /**

     * Sets the prompt evaluation count

     * @param promptEvalCount Number of tokens in the prompt

     */

    public void setPromptEvalCount(Integer promptEvalCount) {

        this.promptEvalCount = promptEvalCount;

    }

    

    /**

     * Gets the prompt evaluation duration

     * @return Prompt evaluation duration in nanoseconds

     */

    public Long getPromptEvalDuration() {

        return promptEvalDuration;

    }

    

    /**

     * Sets the prompt evaluation duration

     * @param promptEvalDuration Prompt evaluation duration in nanoseconds

     */

    public void setPromptEvalDuration(Long promptEvalDuration) {

        this.promptEvalDuration = promptEvalDuration;

    }

    

    /**

     * Gets the evaluation count

     * @return Number of tokens in the response

     */

    public Integer getEvalCount() {

        return evalCount;

    }

    

    /**

     * Sets the evaluation count

     * @param evalCount Number of tokens in the response

     */

    public void setEvalCount(Integer evalCount) {

        this.evalCount = evalCount;

    }

    

    /**

     * Gets the evaluation duration

     * @return Evaluation duration in nanoseconds

     */

    public Long getEvalDuration() {

        return evalDuration;

    }

    

    /**

     * Sets the evaluation duration

     * @param evalDuration Evaluation duration in nanoseconds

     */

    public void setEvalDuration(Long evalDuration) {

        this.evalDuration = evalDuration;

    }

    

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        OllamaResponse that = (OllamaResponse) o;

        return done == that.done &&

               Objects.equals(model, that.model) &&

               Objects.equals(createdAt, that.createdAt) &&

               Objects.equals(message, that.message) &&

               Objects.equals(totalDuration, that.totalDuration) &&

               Objects.equals(loadDuration, that.loadDuration) &&

               Objects.equals(promptEvalCount, that.promptEvalCount) &&

               Objects.equals(promptEvalDuration, that.promptEvalDuration) &&

               Objects.equals(evalCount, that.evalCount) &&

               Objects.equals(evalDuration, that.evalDuration);

    }

    

    @Override

    public int hashCode() {

        return Objects.hash(model, createdAt, message, done, totalDuration, 

                          loadDuration, promptEvalCount, promptEvalDuration, 

                          evalCount, evalDuration);

    }

    

    @Override

    public String toString() {

        return String.format("OllamaResponse{model='%s', done=%s, message=%s}", 

                           model, done, message);

    }

}

```


We also need the OllamaOptions class and the custom exception:



package com.fortytwo.llmchatbot.model;


import com.fasterxml.jackson.annotation.JsonProperty;

import java.util.Objects;


/**

 * Represents options/parameters for Ollama requests.

 * This class allows fine-tuning of the model's behavior through various parameters.

 */

public class OllamaOptions {

    

    @JsonProperty("temperature")

    private Double temperature;

    

    @JsonProperty("top_p")

    private Double topP;

    

    @JsonProperty("top_k")

    private Integer topK;

    

    @JsonProperty("num_predict")

    private Integer numPredict;

    

    @JsonProperty("repeat_penalty")

    private Double repeatPenalty;

    

    /**

     * Default constructor

     */

    public OllamaOptions() {

    }

    

    /**

     * Gets the temperature parameter

     * @return Temperature value (0.0 to 2.0)

     */

    public Double getTemperature() {

        return temperature;

    }

    

    /**

     * Sets the temperature parameter

     * Controls randomness in the output. Lower values make output more deterministic.

     * @param temperature Temperature value (0.0 to 2.0)

     */

    public void setTemperature(Double temperature) {

        this.temperature = temperature;

    }

    

    /**

     * Gets the top_p parameter

     * @return Top-p value (0.0 to 1.0)

     */

    public Double getTopP() {

        return topP;

    }

    

    /**

     * Sets the top_p parameter

     * Controls diversity via nucleus sampling

     * @param topP Top-p value (0.0 to 1.0)

     */

    public void setTopP(Double topP) {

        this.topP = topP;

    }

    

    /**

     * Gets the top_k parameter

     * @return Top-k value

     */

    public Integer getTopK() {

        return topK;

    }

    

    /**

     * Sets the top_k parameter

     * Limits the number of highest probability tokens to consider

     * @param topK Top-k value

     */

    public void setTopK(Integer topK) {

        this.topK = topK;

    }

    

    /**

     * Gets the num_predict parameter

     * @return Number of tokens to predict

     */

    public Integer getNumPredict() {

        return numPredict;

    }

    

    /**

     * Sets the num_predict parameter

     * Maximum number of tokens to generate

     * @param numPredict Number of tokens to predict

     */

    public void setNumPredict(Integer numPredict) {

        this.numPredict = numPredict;

    }

    

    /**

     * Gets the repeat_penalty parameter

     * @return Repeat penalty value

     */

    public Double getRepeatPenalty() {

        return repeatPenalty;

    }

    

    /**

     * Sets the repeat_penalty parameter

     * Penalizes repetition in the output

     * @param repeatPenalty Repeat penalty value (typically 1.0 to 1.5)

     */

    public void setRepeatPenalty(Double repeatPenalty) {

        this.repeatPenalty = repeatPenalty;

    }

    

    @Override

    public boolean equals(Object o) {

        if (this == o) return true;

        if (o == null || getClass() != o.getClass()) return false;

        OllamaOptions that = (OllamaOptions) o;

        return Objects.equals(temperature, that.temperature) &&

               Objects.equals(topP, that.topP) &&

               Objects.equals(topK, that.topK) &&

               Objects.equals(numPredict, that.numPredict) &&

               Objects.equals(repeatPenalty, that.repeatPenalty);

    }

    

    @Override

    public int hashCode() {

        return Objects.hash(temperature, topP, topK, numPredict, repeatPenalty);

    }

    

    @Override

    public String toString() {

        return String.format("OllamaOptions{temperature=%s, topP=%s, topK=%s, numPredict=%s, repeatPenalty=%s}", 

                           temperature, topP, topK, numPredict, repeatPenalty);

    }

}



And the custom exception class:



package com.fortytwo.llmchatbot.service;


/**

 * Custom exception for Ollama service related errors.

 * This exception is thrown when there are issues communicating with

 * the Ollama API or processing responses.

 */

public class OllamaServiceException extends Exception {

    

    /**

     * Constructs a new OllamaServiceException with the specified detail message.

     * @param message The detail message

     */

    public OllamaServiceException(String message) {

        super(message);

    }

    

    /**

     * Constructs a new OllamaServiceException with the specified detail message and cause.

     * @param message The detail message

     * @param cause The cause of the exception

     */

    public OllamaServiceException(String message, Throwable cause) {

        super(message, cause);

    }

    

    /**

     * Constructs a new OllamaServiceException with the specified cause.

     * @param cause The cause of the exception

     */

    public OllamaServiceException(Throwable cause) {

        super(cause);

    }

}



Building the Conversation Management System


The conversation management system is responsible for maintaining the context and history of the conversation. This component ensures that the chatbot can understand the flow of the conversation and provide contextually relevant responses.


The conversation manager needs to handle message storage, context window management (to prevent exceeding token limits), conversation state tracking, and conversation persistence. Let's implement a comprehensive ConversationService:



package com.fortytwo.llmchatbot.service;


import com.fortytwo.llmchatbot.model.Message;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


import java.util.ArrayList;

import java.util.Collections;

import java.util.List;

import java.util.concurrent.CopyOnWriteArrayList;


/**

 * Service for managing conversation state and history.

 * This class handles the storage and retrieval of conversation messages,

 * manages context windows, and provides conversation utilities.

 */

public class ConversationService {

    

    private static final Logger logger = LoggerFactory.getLogger(ConversationService.class);

    

    // Maximum number of messages to keep in context (to manage token limits)

    private static final int DEFAULT_MAX_CONTEXT_MESSAGES = 20;

    

    // System message that defines the chatbot's behavior

    private static final String DEFAULT_SYSTEM_MESSAGE = 

        "You are a helpful AI assistant. You provide accurate, helpful, and concise responses. " +

        "You maintain a friendly and professional tone in all interactions.";

    

    private final List<Message> conversationHistory;

    private final int maxContextMessages;

    private final String systemMessage;

    

    /**

     * Constructor with default configuration

     */

    public ConversationService() {

        this(DEFAULT_MAX_CONTEXT_MESSAGES, DEFAULT_SYSTEM_MESSAGE);

    }

    

    /**

     * Constructor with custom configuration

     * @param maxContextMessages Maximum number of messages to keep in context

     * @param systemMessage System message that defines the chatbot's behavior

     */

    public ConversationService(int maxContextMessages, String systemMessage) {

        this.maxContextMessages = maxContextMessages;

        this.systemMessage = systemMessage;

        this.conversationHistory = new CopyOnWriteArrayList<>();

        

        // Add the system message as the first message

        if (systemMessage != null && !systemMessage.trim().isEmpty()) {

            addMessage(new Message(systemMessage, Message.Role.SYSTEM));

        }

        

        logger.info("ConversationService initialized with max context messages: {}", maxContextMessages);

    }

    

    /**

     * Adds a message to the conversation history

     * @param message The message to add

     */

    public void addMessage(Message message) {

        if (message == null) {

            logger.warn("Attempted to add null message to conversation");

            return;

        }

        

        conversationHistory.add(message);

        logger.debug("Added message to conversation: {}", message);

        

        // Manage context window size

        manageContextWindow();

    }

    

    /**

     * Adds a user message to the conversation

     * @param content The content of the user message

     */

    public void addUserMessage(String content) {

        if (content == null || content.trim().isEmpty()) {

            logger.warn("Attempted to add empty user message");

            return;

        }

        

        Message userMessage = new Message(content.trim(), Message.Role.USER);

        addMessage(userMessage);

    }

    

    /**

     * Adds an assistant message to the conversation

     * @param content The content of the assistant message

     */

    public void addAssistantMessage(String content) {

        if (content == null || content.trim().isEmpty()) {

            logger.warn("Attempted to add empty assistant message");

            return;

        }

        

        Message assistantMessage = new Message(content.trim(), Message.Role.ASSISTANT);

        addMessage(assistantMessage);

    }

    

    /**

     * Gets the current conversation context for sending to the LLM

     * @return List of messages representing the current context

     */

    public List<Message> getConversationContext() {

        return new ArrayList<>(conversationHistory);

    }

    

    /**

     * Gets the full conversation history

     * @return Unmodifiable list of all messages in the conversation

     */

    public List<Message> getFullHistory() {

        return Collections.unmodifiableList(conversationHistory);

    }

    

    /**

     * Gets the last message in the conversation

     * @return The last message, or null if the conversation is empty

     */

    public Message getLastMessage() {

        if (conversationHistory.isEmpty()) {

            return null;

        }

        return conversationHistory.get(conversationHistory.size() - 1);

    }

    

    /**

     * Gets the last user message in the conversation

     * @return The last user message, or null if no user message exists

     */

    public Message getLastUserMessage() {

        for (int i = conversationHistory.size() - 1; i >= 0; i--) {

            Message message = conversationHistory.get(i);

            if (message.getRole() == Message.Role.USER) {

                return message;

            }

        }

        return null;

    }

    

    /**

     * Gets the number of messages in the conversation

     * @return The total number of messages

     */

    public int getMessageCount() {

        return conversationHistory.size();

    }

    

    /**

     * Checks if the conversation is empty (excluding system messages)

     * @return true if the conversation has no user or assistant messages

     */

    public boolean isEmpty() {

        return conversationHistory.stream()

            .noneMatch(msg -> msg.getRole() == Message.Role.USER || msg.getRole() == Message.Role.ASSISTANT);

    }

    

    /**

     * Clears the conversation history while preserving the system message

     */

    public void clearConversation() {

        conversationHistory.clear();

        

        // Re-add the system message

        if (systemMessage != null && !systemMessage.trim().isEmpty()) {

            addMessage(new Message(systemMessage, Message.Role.SYSTEM));

        }

        

        logger.info("Conversation cleared");

    }

    

    /**

     * Manages the context window by removing old messages when the limit is exceeded

     * Always preserves the system message and maintains conversation flow

     */

    private void manageContextWindow() {

        if (conversationHistory.size() <= maxContextMessages) {

            return;

        }

        

        // Find the system message (should be the first one)

        Message systemMsg = null;

        int systemMsgIndex = -1;

        

        for (int i = 0; i < conversationHistory.size(); i++) {

            if (conversationHistory.get(i).getRole() == Message.Role.SYSTEM) {

                systemMsg = conversationHistory.get(i);

                systemMsgIndex = i;

                break;

            }

        }

        

        // Calculate how many messages to remove

        int messagesToRemove = conversationHistory.size() - maxContextMessages;

        

        // Remove messages from the beginning (after system message) to maintain recent context

        int startRemovalIndex = systemMsgIndex + 1;

        int endRemovalIndex = startRemovalIndex + messagesToRemove;

        

        // Ensure we don't remove beyond available messages

        endRemovalIndex = Math.min(endRemovalIndex, conversationHistory.size());

        

        // Remove the messages

        for (int i = 0; i < messagesToRemove && startRemovalIndex < conversationHistory.size(); i++) {

            conversationHistory.remove(startRemovalIndex);

        }

        

        logger.debug("Removed {} messages from conversation context. Current size: {}", 

                    messagesToRemove, conversationHistory.size());

    }

    

    /**

     * Gets a summary of the conversation statistics

     * @return A string containing conversation statistics

     */

    public String getConversationSummary() {

        long userMessages = conversationHistory.stream()

            .mapToLong(msg -> msg.getRole() == Message.Role.USER ? 1 : 0)

            .sum();

        

        long assistantMessages = conversationHistory.stream()

            .mapToLong(msg -> msg.getRole() == Message.Role.ASSISTANT ? 1 : 0)

            .sum();

        

        long systemMessages = conversationHistory.stream()

            .mapToLong(msg -> msg.getRole() == Message.Role.SYSTEM ? 1 : 0)

            .sum();

        

        return String.format("Conversation Summary - Total: %d, User: %d, Assistant: %d, System: %d", 

                           conversationHistory.size(), userMessages, assistantMessages, systemMessages);

    }

    

    /**

     * Gets the system message

     * @return The system message content

     */

    public String getSystemMessage() {

        return systemMessage;

    }

    

    /**

     * Gets the maximum context messages setting

     * @return The maximum number of messages kept in context

     */

    public int getMaxContextMessages() {

        return maxContextMessages;

    }

}

```


## Creating the Chatbot Controller


The chatbot controller serves as the orchestration layer that coordinates between the conversation service and the Ollama service. It handles the main chatbot logic, processes user input, manages the conversation flow, and handles error scenarios gracefully.



package com.fortytwo.llmchatbot.controller;


import com.fortytwo.llmchatbot.model.Message;

import com.fortytwo.llmchatbot.service.ConversationService;

import com.fortytwo.llmchatbot.service.OllamaService;

import com.fortytwo.llmchatbot.service.OllamaServiceException;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


import java.util.List;


/**

 * Controller class that orchestrates the chatbot functionality.

 * This class coordinates between the conversation service and the Ollama service

 * to provide a complete chatbot experience.

 */

public class ChatbotController {

    

    private static final Logger logger = LoggerFactory.getLogger(ChatbotController.class);

    

    private final ConversationService conversationService;

    private final OllamaService ollamaService;

    private boolean isInitialized;

    

    /**

     * Constructor with default services

     */

    public ChatbotController() {

        this(new ConversationService(), new OllamaService());

    }

    

    /**

     * Constructor with custom services

     * @param conversationService The conversation service to use

     * @param ollamaService The Ollama service to use

     */

    public ChatbotController(ConversationService conversationService, OllamaService ollamaService) {

        this.conversationService = conversationService;

        this.ollamaService = ollamaService;

        this.isInitialized = false;

        

        logger.info("ChatbotController created with custom services");

    }

    

    /**

     * Initializes the chatbot by testing the connection to Ollama

     * @return true if initialization is successful, false otherwise

     */

    public boolean initialize() {

        logger.info("Initializing chatbot...");

        

        try {

            // Test connection to Ollama

            boolean connectionSuccessful = ollamaService.testConnection();

            

            if (connectionSuccessful) {

                isInitialized = true;

                logger.info("Chatbot initialized successfully");

                return true;

            } else {

                logger.error("Failed to connect to Ollama service");

                return false;

            }

            

        } catch (Exception e) {

            logger.error("Error during chatbot initialization", e);

            return false;

        }

    }

    

    /**

     * Processes a user message and generates a response

     * @param userInput The user's input message

     * @return The chatbot's response

     * @throws ChatbotException if there's an error processing the message

     */

    public String processMessage(String userInput) throws ChatbotException {

        if (!isInitialized) {

            throw new ChatbotException("Chatbot is not initialized. Call initialize() first.");

        }

        

        if (userInput == null || userInput.trim().isEmpty()) {

            throw new ChatbotException("User input cannot be empty");

        }

        

        String trimmedInput = userInput.trim();

        logger.info("Processing user message: {}", trimmedInput);

        

        try {

            // Add user message to conversation

            conversationService.addUserMessage(trimmedInput);

            

            // Get conversation context

            List<Message> context = conversationService.getConversationContext();

            

            // Generate response using Ollama

            String response = ollamaService.generateResponse(context);

            

            // Add assistant response to conversation

            conversationService.addAssistantMessage(response);

            

            logger.info("Generated response: {}", response);

            return response;

            

        } catch (OllamaServiceException e) {

            logger.error("Error generating response from Ollama", e);

            throw new ChatbotException("Failed to generate response: " + e.getMessage(), e);

        } catch (Exception e) {

            logger.error("Unexpected error processing message", e);

            throw new ChatbotException("Unexpected error: " + e.getMessage(), e);

        }

    }

    

    /**

     * Starts a new conversation by clearing the current history

     */

    public void startNewConversation() {

        conversationService.clearConversation();

        logger.info("Started new conversation");

    }

    

    /**

     * Gets the current conversation history

     * @return List of messages in the conversation

     */

    public List<Message> getConversationHistory() {

        return conversationService.getFullHistory();

    }

    

    /**

     * Gets a summary of the current conversation

     * @return String containing conversation statistics

     */

    public String getConversationSummary() {

        return conversationService.getConversationSummary();

    }

    

    /**

     * Checks if the chatbot is initialized and ready to use

     * @return true if initialized, false otherwise

     */

    public boolean isInitialized() {

        return isInitialized;

    }

    

    /**

     * Checks if the current conversation is empty

     * @return true if the conversation has no user or assistant messages

     */

    public boolean isConversationEmpty() {

        return conversationService.isEmpty();

    }

    

    /**

     * Gets the last message in the conversation

     * @return The last message, or null if conversation is empty

     */

    public Message getLastMessage() {

        return conversationService.getLastMessage();

    }

    

    /**

     * Gets information about the current model and configuration

     * @return String containing model information

     */

    public String getModelInfo() {

        return String.format("Model: %s, Base URL: %s, Max Context: %d messages", 

                           ollamaService.getModel(), 

                           ollamaService.getBaseUrl(),

                           conversationService.getMaxContextMessages());

    }

    

    /**

     * Performs a health check on the chatbot services

     * @return true if all services are healthy, false otherwise

     */

    public boolean healthCheck() {

        try {

            if (!isInitialized) {

                logger.warn("Health check failed: Chatbot not initialized");

                return false;

            }

            

            // Test Ollama connection

            boolean ollamaHealthy = ollamaService.testConnection();

            

            if (!ollamaHealthy) {

                logger.warn("Health check failed: Ollama service unhealthy");

                return false;

            }

            

            logger.info("Health check passed");

            return true;

            

        } catch (Exception e) {

            logger.error("Health check failed with exception", e);

            return false;

        }

    }

    

    /**

     * Shuts down the chatbot and releases resources

     */

    public void shutdown() {

        logger.info("Shutting down chatbot...");

        

        try {

            ollamaService.close();

            isInitialized = false;

            logger.info("Chatbot shutdown completed");

        } catch (Exception e) {

            logger.warn("Error during chatbot shutdown", e);

        }

    }

}



We also need the ChatbotException class:



package com.fortytwo.llmchatbot.controller;


/**

 * Custom exception for chatbot-related errors.

 * This exception is thrown when there are issues with chatbot operations

 * that are not specific to the underlying services.

 */

public class ChatbotException extends Exception {

    

    /**

     * Constructs a new ChatbotException with the specified detail message.

     * @param message The detail message

     */

    public ChatbotException(String message) {

        super(message);

    }

    

    /**

     * Constructs a new ChatbotException with the specified detail message and cause.

     * @param message The detail message

     * @param cause The cause of the exception

     */

    public ChatbotException(String message, Throwable cause) {

        super(message, cause);

    }

    

    /**

     * Constructs a new ChatbotException with the specified cause.

     * @param cause The cause of the exception

     */

    public ChatbotException(Throwable cause) {

        super(cause);

    }

}



Implementing the User Interface


The user interface provides the interaction layer for users to communicate with the chatbot. We will implement a console-based interface that demonstrates all the chatbot functionality while being simple to understand and extend.



package com.fortytwo.llmchatbot.util;


import com.fortytwo.llmchatbot.controller.ChatbotController;

import com.fortytwo.llmchatbot.controller.ChatbotException;

import com.fortytwo.llmchatbot.model.Message;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.time.format.DateTimeFormatter;

import java.util.List;


/**

 * Console-based user interface for the LLM chatbot.

 * This class provides an interactive command-line interface for users

 * to communicate with the chatbot and access various features.

 */

public class ConsoleInterface {

    

    private static final Logger logger = LoggerFactory.getLogger(ConsoleInterface.class);

    

    private final ChatbotController chatbotController;

    private final BufferedReader reader;

    private boolean isRunning;

    

    // ANSI color codes for better console output

    private static final String RESET = "\u001B[0m";

    private static final String BLUE = "\u001B[34m";

    private static final String GREEN = "\u001B[32m";

    private static final String RED = "\u001B[31m";

    private static final String YELLOW = "\u001B[33m";

    private static final String CYAN = "\u001B[36m";

    private static final String BOLD = "\u001B[1m";

    

    /**

     * Constructor

     * @param chatbotController The chatbot controller to use

     */

    public ConsoleInterface(ChatbotController chatbotController) {

        this.chatbotController = chatbotController;

        this.reader = new BufferedReader(new InputStreamReader(System.in));

        this.isRunning = false;

        

        logger.info("ConsoleInterface initialized");

    }

    

    /**

     * Starts the interactive console interface

     */

    public void start() {

        isRunning = true;

        

        printWelcomeMessage();

        

        // Initialize the chatbot

        if (!initializeChatbot()) {

            printError("Failed to initialize chatbot. Exiting...");

            return;

        }

        

        printHelp();

        

        // Main interaction loop

        while (isRunning) {

            try {

                printPrompt();

                String input = reader.readLine();

                

                if (input == null) {

                    break; // EOF reached

                }

                

                processUserInput(input.trim());

                

            } catch (IOException e) {

                printError("Error reading input: " + e.getMessage());

                logger.error("Error reading console input", e);

                break;

            }

        }

        

        shutdown();

    }

    

    /**

     * Processes user input and executes appropriate actions

     * @param input The user input string

     */

    private void processUserInput(String input) {

        if (input.isEmpty()) {

            return;

        }

        

        // Handle special commands

        if (input.startsWith("/")) {

            handleCommand(input);

            return;

        }

        

        // Process as regular chat message

        try {

            String response = chatbotController.processMessage(input);

            printAssistantMessage(response);

            

        } catch (ChatbotException e) {

            printError("Error processing message: " + e.getMessage());

            logger.error("Error processing user message", e);

        }

    }

    

    /**

     * Handles special commands

     * @param command The command string

     */

    private void handleCommand(String command) {

        String cmd = command.toLowerCase();

        

        switch (cmd) {

            case "/help":

            case "/h":

                printHelp();

                break;

                

            case "/new":

            case "/n":

                chatbotController.startNewConversation();

                printInfo("Started new conversation");

                break;

                

            case "/history":

            case "/hist":

                printConversationHistory();

                break;

                

            case "/summary":

            case "/sum":

                printInfo(chatbotController.getConversationSummary());

                break;

                

            case "/model":

            case "/m":

                printInfo(chatbotController.getModelInfo());

                break;

                

            case "/health":

                boolean healthy = chatbotController.healthCheck();

                if (healthy) {

                    printSuccess("All systems healthy");

                } else {

                    printError("Health check failed");

                }

                break;

                

            case "/clear":

            case "/cls":

                clearScreen();

                break;

                

            case "/quit":

            case "/q":

            case "/exit":

                isRunning = false;

                printInfo("Goodbye!");

                break;

                

            default:

                printError("Unknown command: " + command);

                printInfo("Type /help for available commands");

                break;

        }

    }

    

    /**

     * Initializes the chatbot

     * @return true if initialization is successful, false otherwise

     */

    private boolean initializeChatbot() {

        printInfo("Initializing chatbot...");

        

        boolean initialized = chatbotController.initialize();

        

        if (initialized) {

            printSuccess("Chatbot initialized successfully!");

            printInfo(chatbotController.getModelInfo());

            return true;

        } else {

            return false;

        }

    }

    

    /**

     * Prints the welcome message

     */

    private void printWelcomeMessage() {

        System.out.println();

        System.out.println(BOLD + BLUE + "╔══════════════════════════════════════════════════════════╗" + RESET);

        System.out.println(BOLD + BLUE + "║" + RESET + "                   " + BOLD + "LLM Chatbot" + RESET + "                      " + BOLD + BLUE + "║" + RESET);

        System.out.println(BOLD + BLUE + "║" + RESET + "              Powered by Ollama & Java              " + BOLD + BLUE + "║" + RESET);

        System.out.println(BOLD + BLUE + "╚══════════════════════════════════════════════════════════╝" + RESET);

        System.out.println();

    }

    

    /**

     * Prints the help message with available commands

     */

    private void printHelp() {

        System.out.println();

        System.out.println(BOLD + CYAN + "Available Commands:" + RESET);

        System.out.println(GREEN + "/help, /h" + RESET + "      - Show this help message");

        System.out.println(GREEN + "/new, /n" + RESET + "       - Start a new conversation");

        System.out.println(GREEN + "/history, /hist" + RESET + " - Show conversation history");

        System.out.println(GREEN + "/summary, /sum" + RESET + "  - Show conversation summary");

        System.out.println(GREEN + "/model, /m" + RESET + "      - Show model information");

        System.out.println(GREEN + "/health" + RESET + "        - Perform health check");

        System.out.println(GREEN + "/clear, /cls" + RESET + "    - Clear the screen");

        System.out.println(GREEN + "/quit, /q, /exit" + RESET + " - Exit the chatbot");

        System.out.println();

        System.out.println(YELLOW + "Simply type your message to chat with the AI assistant!" + RESET);

        System.out.println();

    }

    

    /**

     * Prints the user input prompt

     */

    private void printPrompt() {

        System.out.print(BOLD + BLUE + "You: " + RESET);

    }

    

    /**

     * Prints an assistant message

     * @param message The message to print

     */

    private void printAssistantMessage(String message) {

        System.out.println();

        System.out.println(BOLD + GREEN + "Assistant: " + RESET + message);

        System.out.println();

    }

    

    /**

     * Prints an informational message

     * @param message The message to print

     */

    private void printInfo(String message) {

        System.out.println(CYAN + "[INFO] " + message + RESET);

    }

    

    /**

     * Prints a success message

     * @param message The message to print

     */

    private void printSuccess(String message) {

        System.out.println(GREEN + "[SUCCESS] " + message + RESET);

    }

    

    /**

     * Prints an error message

     * @param message The message to print

     */

    private void printError(String message) {

        System.out.println(RED + "[ERROR] " + message + RESET);

    }

    

    /**

     * Prints the conversation history

     */

    private void printConversationHistory() {

        List<Message> history = chatbotController.getConversationHistory();

        

        if (history.isEmpty()) {

            printInfo("No conversation history available");

            return;

        }

        

        System.out.println();

        System.out.println(BOLD + CYAN + "Conversation History:" + RESET);

        System.out.println(CYAN + "════════════════════" + RESET);

        

        DateTimeFormatter formatter = DateTimeFormatter.ofPattern("HH:mm:ss");

        

        for (Message message : history) {

            String timestamp = message.getTimestamp().format(formatter);

            String roleColor = getRoleColor(message.getRole());

            String roleName = getRoleName(message.getRole());

            

            System.out.printf("%s[%s] %s:%s %s%n", 

                            roleColor, timestamp, roleName, RESET, message.getContent());

        }

        

        System.out.println();

    }

    

    /**

     * Gets the color code for a message role

     * @param role The message role

     * @return The ANSI color code

     */

    private String getRoleColor(Message.Role role) {

        switch (role) {

            case USER:

                return BLUE;

            case ASSISTANT:

                return GREEN;

            case SYSTEM:

                return YELLOW;

            default:

                return RESET;

        }

    }

    

    /**

     * Gets the display name for a message role

     * @param role The message role

     * @return The display name

     */

    private String getRoleName(Message.Role role) {

        switch (role) {

            case USER:

                return "You";

            case ASSISTANT:

                return "Assistant";

            case SYSTEM:

                return "System";

            default:

                return "Unknown";

        }

    }

    

    /**

     * Clears the console screen

     */

    private void clearScreen() {

        try {

            // Try to clear screen using ANSI escape codes

            System.out.print("\033[2J\033[H");

            System.out.flush();

        } catch (Exception e) {

            // Fallback: print multiple newlines

            for (int i = 0; i < 50; i++) {

                System.out.println();

            }

        }

    }

    

    /**

     * Shuts down the interface and releases resources

     */

    private void shutdown() {

        try {

            chatbotController.shutdown();

            reader.close();

            logger.info("ConsoleInterface shutdown completed");

        } catch (IOException e) {

            logger.warn("Error closing console reader", e);

        }

    }

}



Creating the Main Application Class


The main application class serves as the entry point for our chatbot application. It ties together all the components and provides a clean way to start the application.



package com.fortytwo.llmchatbot;


import com.fortytwo.llmchatbot.controller.ChatbotController;

import com.fortytwo.llmchatbot.util.ConsoleInterface;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


/**

 * Main application class for the LLM Chatbot.

 * This class serves as the entry point for the application and coordinates

 * the initialization and startup of all components.

 */

public class Application {

    

    private static final Logger logger = LoggerFactory.getLogger(Application.class);

    

    /**

     * Main method - entry point of the application

     * @param args Command line arguments

     */

    public static void main(String[] args) {

        logger.info("Starting LLM Chatbot Application");

        

        try {

            // Parse command line arguments if any

            ApplicationConfig config = parseArguments(args);

            

            // Create the chatbot controller

            ChatbotController chatbotController = createChatbotController(config);

            

            // Create and start the console interface

            ConsoleInterface consoleInterface = new ConsoleInterface(chatbotController);

            consoleInterface.start();

            

        } catch (Exception e) {

            logger.error("Fatal error in main application", e);

            System.err.println("Fatal error: " + e.getMessage());

            System.exit(1);

        }

        

        logger.info("LLM Chatbot Application terminated");

    }

    

    /**

     * Parses command line arguments

     * @param args Command line arguments

     * @return ApplicationConfig object with parsed configuration

     */

    private static ApplicationConfig parseArguments(String[] args) {

        ApplicationConfig config = new ApplicationConfig();

        

        for (int i = 0; i < args.length; i++) {

            String arg = args[i];

            

            switch (arg) {

                case "--model":

                case "-m":

                    if (i + 1 < args.length) {

                        config.setModel(args[++i]);

                    } else {

                        throw new IllegalArgumentException("Model name required after " + arg);

                    }

                    break;

                    

                case "--url":

                case "-u":

                    if (i + 1 < args.length) {

                        config.setOllamaUrl(args[++i]);

                    } else {

                        throw new IllegalArgumentException("URL required after " + arg);

                    }

                    break;

                    

                case "--max-context":

                case "-c":

                    if (i + 1 < args.length) {

                        try {

                            config.setMaxContextMessages(Integer.parseInt(args[++i]));

                        } catch (NumberFormatException e) {

                            throw new IllegalArgumentException("Invalid number for max context: " + args[i]);

                        }

                    } else {

                        throw new IllegalArgumentException("Number required after " + arg);

                    }

                    break;

                    

                case "--system-message":

                case "-s":

                    if (i + 1 < args.length) {

                        config.setSystemMessage(args[++i]);

                    } else {

                        throw new IllegalArgumentException("System message required after " + arg);

                    }

                    break;

                    

                case "--help":

                case "-h":

                    printUsage();

                    System.exit(0);

                    break;

                    

                default:

                    if (arg.startsWith("-")) {

                        throw new IllegalArgumentException("Unknown argument: " + arg);

                    }

                    break;

            }

        }

        

        logger.info("Application configuration: {}", config);

        return config;

    }

    

    /**

     * Creates the chatbot controller with the given configuration

     * @param config Application configuration

     * @return Configured ChatbotController instance

     */

    private static ChatbotController createChatbotController(ApplicationConfig config) {

        // Create services with configuration

        com.fortytwo.llmchatbot.service.ConversationService conversationService = 

            new com.fortytwo.llmchatbot.service.ConversationService(

                config.getMaxContextMessages(), 

                config.getSystemMessage()

            );

        

        com.fortytwo.llmchatbot.service.OllamaService ollamaService = 

            new com.fortytwo.llmchatbot.service.OllamaService(

                config.getOllamaUrl(), 

                config.getModel()

            );

        

        return new ChatbotController(conversationService, ollamaService);

    }

    

    /**

     * Prints usage information

     */

    private static void printUsage() {

        System.out.println("LLM Chatbot Application");

        System.out.println("Usage: java -jar llm-chatbot.jar [options]");

        System.out.println();

        System.out.println("Options:");

        System.out.println("  -m, --model <name>           Model name to use (default: llama2)");

        System.out.println("  -u, --url <url>              Ollama base URL (default: http://localhost:11434)");

        System.out.println("  -c, --max-context <number>   Maximum context messages (default: 20)");

        System.out.println("  -s, --system-message <text>  Custom system message");

        System.out.println("  -h, --help                   Show this help message");

        System.out.println();

        System.out.println("Examples:");

        System.out.println("  java -jar llm-chatbot.jar");

        System.out.println("  java -jar llm-chatbot.jar --model mistral --max-context 30");

        System.out.println("  java -jar llm-chatbot.jar --url http://remote-ollama:11434");

    }

    

    /**

     * Configuration class for application settings

     */

    private static class ApplicationConfig {

        private String model = "llama2";

        private String ollamaUrl = "http://localhost:11434";

        private int maxContextMessages = 20;

        private String systemMessage = "You are a helpful AI assistant. You provide accurate, helpful, and concise responses. You maintain a friendly and professional tone in all interactions.";

        

        public String getModel() {

            return model;

        }

        

        public void setModel(String model) {

            this.model = model;

        }

        

        public String getOllamaUrl() {

            return ollamaUrl;

        }

        

        public void setOllamaUrl(String ollamaUrl) {

            this.ollamaUrl = ollamaUrl;

        }

        

        public int getMaxContextMessages() {

            return maxContextMessages;

        }

        

        public void setMaxContextMessages(int maxContextMessages) {

            this.maxContextMessages = maxContextMessages;

        }

        

        public String getSystemMessage() {

            return systemMessage;

        }

        

        public void setSystemMessage(String systemMessage) {

            this.systemMessage = systemMessage;

        }

        

        @Override

        public String toString() {

            return String.format("ApplicationConfig{model='%s', ollamaUrl='%s', maxContextMessages=%d}", 

                               model, ollamaUrl, maxContextMessages);

        }

    }

}



Configuration and Properties


To make our application more configurable, let's create an application.properties file:



# Ollama Configuration

ollama.base.url=http://localhost:11434

ollama.model=llama2


# Conversation Configuration

conversation.max.context.messages=20

conversation.system.message=You are a helpful AI assistant. You provide accurate, helpful, and concise responses. You maintain a friendly and professional tone in all interactions.


# Logging Configuration

logging.level.com.fortytwo.llmchatbot=INFO

logging.level.org.apache.http=WARN


# Application Configuration

app.name=LLM Chatbot

app.version=1.0.0



Testing and Validation


Before running our application, i8t's important to understand how to test and validate the functionality. Here are some key testing scenarios to consider:


First, ensure that Ollama is running on your local machine with the appropriate model loaded. You can verify this by running "ollama list" in your terminal to see available models.


Test the basic conversation flow by starting the application and engaging in a simple conversation. Verify that the chatbot maintains context across multiple exchanges and responds appropriately to different types of questions.


Test the command functionality by trying various commands like "/new" to start a new conversation, "/history" to view conversation history, and "/health" to check system status.


Test error handling by deliberately causing errors, such as stopping the Ollama service while the chatbot is running, to ensure graceful error handling and appropriate user feedback.


Running the Complete Application


To run the complete application, follow these steps:


First, ensure you have Java 11 or higher and Maven installed on your system. Install Ollama and pull the desired model using "ollama pull llama2" or your preferred model.


Compile the project using Maven by running "mvn clean compile" in the project root directory. This will download all dependencies and compile the source code.


Run the application using Maven with "mvn exec:java" or by running the compiled JAR file directly. The application will start, initialize the connection to Ollama, and present you with an interactive console interface.


Once running, you can engage in conversations with the chatbot by simply typing your messages. Use the various commands to explore different features and functionality.


Complete Running Example


Here is the complete, fully functional application that brings together all the components we have built:



// File: src/main/java/com/fortytwo/llmchatbot/Application.java

package com.fortytwo.llmchatbot;


import com.fortytwo.llmchatbot.controller.ChatbotController;

import com.fortytwo.llmchatbot.service.ConversationService;

import com.fortytwo.llmchatbot.service.OllamaService;

import com.fortytwo.llmchatbot.util.ConsoleInterface;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;


/**

 * Complete LLM Chatbot Application

 * This is a fully functional chatbot application that demonstrates

 * all the concepts and components discussed in this tutorial.

 */

public class Application {

    

    private static final Logger logger = LoggerFactory.getLogger(Application.class);

    

    public static void main(String[] args) {

        logger.info("Starting LLM Chatbot Application");

        

        try {

            // Create services with default configuration

            ConversationService conversationService = new ConversationService();

            OllamaService ollamaService = new OllamaService();

            

            // Create the chatbot controller

            ChatbotController chatbotController = new ChatbotController(conversationService, ollamaService);

            

            // Create and start the console interface

            ConsoleInterface consoleInterface = new ConsoleInterface(chatbotController);

            

            // Add shutdown hook for graceful cleanup

            Runtime.getRuntime().addShutdownHook(new Thread(() -> {

                logger.info("Shutdown hook triggered");

                chatbotController.shutdown();

            }));

            

            // Start the interactive interface

            consoleInterface.start();

            

        } catch (Exception e) {

            logger.error("Fatal error in main application", e);

            System.err.println("Fatal error: " + e.getMessage());

            System.err.println("Please ensure Ollama is running and the model is available.");

            System.exit(1);

        }

        

        logger.info("LLM Chatbot Application terminated");

    }

}


This complete example demonstrates a production-ready LLM chatbot application built in Java using Ollama. The application includes proper error handling, logging, resource management, and a user-friendly interface. It showcases clean architecture principles with clear separation of concerns between the different layers of the application.


The chatbot maintains conversation context, handles various user commands, provides helpful feedback, and gracefully handles error conditions. It serves as a solid foundation that can be extended with additional features such as conversation persistence, web interface, or integration with other systems.


Conclusion and Next Steps


This comprehensive tutorial has walked you through building a complete LLM chatbot in Java using Ollama. We have covered all the essential components including data models, service layers, conversation management, user interface, and application orchestration.


The resulting application demonstrates key concepts in LLM integration, conversation management, and clean software architecture. The modular design makes it easy to extend and customize for specific use cases.


Potential enhancements could include adding conversation persistence to a database, implementing a web-based interface, adding support for file uploads and document analysis, integrating with external APIs, or implementing more sophisticated conversation management features.


The foundation provided here gives you everything needed to build sophisticated LLM-powered applications in Java while maintaining clean, maintainable, and scalable code.