Hitchhiker's Guide to AI, Software Architecture, and Everything Else: „Hello World“ for LLM Applications

When embarking on developing applications powered by Large Language Models (LLMs), such as chatbots, developers face several key decisions. A common starting point involves choosing a development framework, with LangChain and HuggingFace being popular Software Development Kits (SDKs) in this space.

Another crucial choice is whether to leverage a remote API or run an LLM locally on your machine. Many remote APIs adopt the schema popularized by OpenAI for interacting with their services. Alternatively, hosting a model locally provides more control and privacy.

While Python is a predominant language in AI development, and thus used in many examples, TypeScript/JavaScript is also widely adopted. However, other languages like Go, Rust, Java, C++, and C# are viable options as well.

The examples mentioned also utilize Ollama. Ollama is a convenient tool built upon the `llama.cpp` inference framework. It simplifies the management of LLM models and provides a command-line interface for running and interacting with them.

Prerequisites:

1. Python: Ensure you have Python 3.8+ installed.

2. Installation:

For LangChain (OpenAI API):

pip install langchain langchain-openai python-dotenv

For LangChain (OpenAI API):

pip install langchain langchain-community python-dotenv
install and run Ollama: https://ollama.com/
Then pull a model: ollama pull llama3

For Hugging Face (Inference API):

pip install huggingface_hub python-dotenv

For Hugging Face (Local Model):

pip install transformers torch # or tensorflow/jax

API Keys (for remote options):

Create a `.env` file in your project directory.
OpenAI: Add `OPENAI_API_KEY="your_openai_api_key"`
Hugging Face: Add `HUGGINGFACEHUB_API_TOKEN="your_huggingface_api_token"` (Get one from Hugging Face -> Settings -> Access Tokens)

Example 1: LangChain Chatbot

LangChain provides abstractions to easily switch between different LLM providers and manage conversation flow.

1.a) LangChain with Remote API (OpenAI)

import os

from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain_core.messages import HumanMessage, SystemMessage

# Load environment variables (for API key)

load_dotenv()

# --- Configuration ---

# Ensure your OPENAI_API_KEY is set in your .env file or environment

llm = ChatOpenAI(model="gpt-3.5-turbo") # Or "gpt-4", etc.

# --- Simple Prompt Template ---

# You can make this more sophisticated

prompt_template = ChatPromptTemplate.from_messages([

("system", "You are a helpful AI assistant."),

("human", "{user_input}"),

])

# --- Create the Chain ---

# This chain takes user input, formats it, sends it to the LLM,

# and parses the output string.

chain = prompt_template | llm | StrOutputParser()

# --- Chat Loop ---

print("Chatbot initialized (using OpenAI API). Type 'quit' to exit.")

while True:

user_input = input("You: ")

if user_input.lower() == 'quit':

break

# Invoke the chain

response = chain.invoke({"user_input": user_input})

print(f"Bot: {response}")

print("Chatbot session ended.")

#### 1.b) LangChain with Local LLM (using Ollama)

This assumes you have Ollama installed, running, and have pulled a model (e.g., `ollama pull llama3`).

import os

from dotenv import load_dotenv

from langchain_community.chat_models import ChatOllama # Use community for Ollama

from langchain_core.prompts import ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain_core.messages import HumanMessage, SystemMessage

# Load environment variables (optional, not needed for Ollama itself)

load_dotenv()

# --- Configuration ---

# Ensure Ollama service is running

# Specify the model you pulled with Ollama

llm = ChatOllama(model="llama3") # Or "mistral", "gemma", etc.

# --- Simple Prompt Template ---

prompt_template = ChatPromptTemplate.from_messages([

("system", "You are a helpful AI assistant running locally."),

("human", "{user_input}"),

])

# --- Create the Chain ---

chain = prompt_template | llm | StrOutputParser()

# --- Chat Loop ---

print("Chatbot initialized (using local Ollama model). Type 'quit' to exit.")

while True:

user_input = input("You: ")

if user_input.lower() == 'quit':

break

# Invoke the chain

try:

response = chain.invoke({"user_input": user_input})

print(f"Bot: {response}")

except Exception as e:

print(f"Error communicating with Ollama: {e}")

print("Is the Ollama service running and the model available?")

print("Chatbot session ended.")

Example 2: Hugging Face Chatbot

This uses Hugging Face libraries directly.

2.a) Hugging Face with Remote Inference API

This uses the Hugging Face hosted Inference API (requires an API token).

import os

from dotenv import load_dotenv

from huggingface_hub import InferenceClient

# Load environment variables (for API key)

load_dotenv()

# --- Configuration ---

# Ensure HUGGINGFACEHUB_API_TOKEN is set in .env or environment

hf_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")

if not hf_token:

raise ValueError("Hugging Face API token not found. Set HUGGINGFACEHUB_API_TOKEN.")

# Choose a model suitable for chat from the Hub

# Examples: "mistralai/Mistral-7B-Instruct-v0.1", "HuggingFaceH4/zephyr-7b-beta"

# Check model card for recommended prompt format if needed

model_id = "mistralai/Mistral-7B-Instruct-v0.1"

client = InferenceClient(model=model_id, token=hf_token)

# --- Chat Loop ---

print(f"Chatbot initialized (using Hugging Face API: {model_id}). Type 'quit' to exit.")

# Store conversation history for context (optional, basic implementation)

messages = [{"role": "system", "content": "You are a helpful AI assistant."}]

while True:

user_input = input("You: ")

if user_input.lower() == 'quit':

break

# Add user message to history

messages.append({"role": "user", "content": user_input})

try:

# Use chat completion endpoint

response_stream = client.chat_completion(

messages=messages,

max_tokens=500, # Adjust as needed

stream=False, # Set to True for streaming output

)

# Extract the response content

if isinstance(response_stream, dict) and 'choices' in response_stream:

bot_response = response_stream['choices'][0]['message']['content']

else:

# Fallback or handle unexpected format

bot_response = str(response_stream) # May need adjustment based on model/client version

print(f"Bot: {bot_response}")

# Add bot response to history

messages.append({"role": "assistant", "content": bot_response})

# Optional: Limit history size to prevent excessive token usage

# if len(messages) > 10: messages = [messages[0]] + messages[-9:]

except Exception as e:

print(f"Error calling Hugging Face API: {e}")

# Remove the last user message if the API call failed

messages.pop()

print("Chatbot session ended.")

2.b) Hugging Face with Local Model (`transformers`)

This downloads and runs the model directly on your machine (requires significant RAM/VRAM depending on the model).

python

import torch # Or tensorflow/jax

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# --- Configuration ---

# Choose a model suitable for chat/instruction-following

# Smaller models require less resources but may be less capable.

# Examples: "microsoft/DialoGPT-medium", "gpt2", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model_id = "microsoft/DialoGPT-medium" # Example conversational model

# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Example small instruct model

print(f"Loading local model: {model_id}...")

# Use GPU if available, otherwise CPU

device = 0 if torch.cuda.is_available() else -1 # device=0 for CUDA, device=-1 for CPU

# Using pipeline for simplicity

# For instruct models, you might need specific prompt formatting

chatbot = pipeline("conversational", model=model_id, device=device)

# Or for text generation (might need prompt formatting):

# chatbot = pipeline("text-generation", model=model_id, device=device, max_new_tokens=100)

print("Chatbot initialized (using local Hugging Face model). Type 'quit' to exit.")

# For conversational pipeline

from transformers import Conversation

conversation = Conversation()

# For text-generation pipeline (example prompt format for TinyLlama Chat)

# chat_history = [{"role": "system", "content": "You are a friendly chatbot."}]

# def format_prompt(history, user_input):

# prompt = ""

# for msg in history:

# prompt += f"<|{msg['role']}|>\n{msg['content']}</s>\n"

# prompt += f"<|user|>\n{user_input}</s>\n<|assistant|>\n"

# return prompt

while True:

user_input = input("You: ")

if user_input.lower() == 'quit':

break

try:

# --- Using "conversational" pipeline ---

conversation.add_user_input(user_input)

conversation = chatbot(conversation)

# The last response is the newest generated response

bot_response = conversation.generated_responses[-1]

# --- Using "text-generation" pipeline (example) ---

# prompt = format_prompt(chat_history, user_input)

# sequences = chatbot(prompt, num_return_sequences=1)

# generated_text = sequences[0]['generated_text']

# # Extract only the assistant's response part

# bot_response = generated_text.split("<|assistant|>")[-1].strip()

# chat_history.append({"role": "user", "content": user_input})

# chat_history.append({"role": "assistant", "content": bot_response})

# # Optional: Limit history size

print(f"Bot: {bot_response}")

except Exception as e:

print(f"Error during local model inference: {e}")

# If using conversational pipeline, you might need to reset or manage the conversation object on error

print("Chatbot session ended.")

Key Considerations

Model Choice: The quality of your chatbot heavily depends on the chosen LLM. Remote APIs often offer state-of-the-art models, while local models require balancing capability with your hardware resources.
Prompt Engineering: The `system` message and how you structure the prompt significantly influence the bot's behavior and persona.
Context/Memory: These examples have minimal or no memory. For real conversations, you need to manage conversation history (pass previous turns back to the LLM). LangChain offers Memory modules, and with Hugging Face, you'd manage the history list manually or use specific pipeline features.
Error Handling: Add more robust error handling for API issues, model loading failures, etc.
Streaming: For a better user experience, especially with slower models or APIs, implement streaming to show the response word by word. Both LangChain and Hugging Face clients often support this (`stream=True`).
Resource Usage (Local): Running models locally can consume substantial RAM, VRAM (if using GPU), and disk space. Start with smaller models if your resources are limited.
Cost (Remote): API calls usually incur costs based on token usage. Monitor your usage.

Choose the example that best fits your needs regarding framework preference, resource availability, and whether you prefer a managed API or local control.

Minor Points & Potential Enhancements (already noted in "Key Considerations" but relevant to correctness):

HF API Response Parsing: The parsing `if isinstance(response_stream, dict) and 'choices' in response_stream:` is a reasonable attempt but might need adjustment depending on the exact model used via the Inference API, as response structures can sometimes vary slightly. The fallback `str(response_stream)` is a safe default but might not always be the desired output.
Error Handling Granularity: The error handling is basic. More specific exceptions could be caught (e.g., `openai.AuthenticationError`, network errors, `transformers` model loading errors) for more informative feedback.
Context/Memory: As designed, the memory implementation is minimal (HF API example) or non-existent (others). This is correct for a *small, basic* example, but wouldn't sustain a coherent conversation. LangChain's Memory modules or more sophisticated manual history management would be needed for that.
Prompt Formatting (Local HF): The commented-out section for `text-generation` correctly highlights that many models (especially instruction-tuned ones) require specific prompt templates. The example format shown (`<|role|>\n...</s>\n`) is typical but needs to be adapted based on the specific model's documentation.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Thursday, April 24, 2025

„Hello World“ for LLM Applications

No comments:

About Me