When embarking on developing applications powered by Large Language Models (LLMs), such as chatbots, developers face several key decisions. A common starting point involves choosing a development framework, with LangChain and HuggingFace being popular Software Development Kits (SDKs) in this space.
Another crucial choice is whether to leverage a remote API or run an LLM locally on your machine. Many remote APIs adopt the schema popularized by OpenAI for interacting with their services. Alternatively, hosting a model locally provides more control and privacy.
While Python is a predominant language in AI development, and thus used in many examples, TypeScript/JavaScript is also widely adopted. However, other languages like Go, Rust, Java, C++, and C# are viable options as well.
The examples mentioned also utilize Ollama. Ollama is a convenient tool built upon the `llama.cpp` inference framework. It simplifies the management of LLM models and provides a command-line interface for running and interacting with them.
Prerequisites:
1. Python: Ensure you have Python 3.8+ installed.
2. Installation:
- For LangChain (OpenAI API):
- pip install langchain langchain-openai python-dotenv
- For LangChain (OpenAI API):
- pip install langchain langchain-community python-dotenv
- install and run Ollama: https://ollama.com/
- Then pull a model: ollama pull llama3
- For Hugging Face (Inference API):
- pip install huggingface_hub python-dotenv
- For Hugging Face (Local Model):
- pip install transformers torch # or tensorflow/jax
- API Keys (for remote options):
- Create a `.env` file in your project directory.
- OpenAI: Add `OPENAI_API_KEY="your_openai_api_key"`
- Hugging Face: Add `HUGGINGFACEHUB_API_TOKEN="your_huggingface_api_token"` (Get one from Hugging Face -> Settings -> Access Tokens)
Example 1: LangChain Chatbot
LangChain provides abstractions to easily switch between different LLM providers and manage conversation flow.
1.a) LangChain with Remote API (OpenAI)
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, SystemMessage
# Load environment variables (for API key)
load_dotenv()
# --- Configuration ---
# Ensure your OPENAI_API_KEY is set in your .env file or environment
llm = ChatOpenAI(model="gpt-3.5-turbo") # Or "gpt-4", etc.
# --- Simple Prompt Template ---
# You can make this more sophisticated
prompt_template = ChatPromptTemplate.from_messages([
("system", "You are a helpful AI assistant."),
("human", "{user_input}"),
])
# --- Create the Chain ---
# This chain takes user input, formats it, sends it to the LLM,
# and parses the output string.
chain = prompt_template | llm | StrOutputParser()
# --- Chat Loop ---
print("Chatbot initialized (using OpenAI API). Type 'quit' to exit.")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
# Invoke the chain
response = chain.invoke({"user_input": user_input})
print(f"Bot: {response}")
print("Chatbot session ended.")
#### 1.b) LangChain with Local LLM (using Ollama)
This assumes you have Ollama installed, running, and have pulled a model (e.g., `ollama pull llama3`).
import os
from dotenv import load_dotenv
from langchain_community.chat_models import ChatOllama # Use community for Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, SystemMessage
# Load environment variables (optional, not needed for Ollama itself)
load_dotenv()
# --- Configuration ---
# Ensure Ollama service is running
# Specify the model you pulled with Ollama
llm = ChatOllama(model="llama3") # Or "mistral", "gemma", etc.
# --- Simple Prompt Template ---
prompt_template = ChatPromptTemplate.from_messages([
("system", "You are a helpful AI assistant running locally."),
("human", "{user_input}"),
])
# --- Create the Chain ---
chain = prompt_template | llm | StrOutputParser()
# --- Chat Loop ---
print("Chatbot initialized (using local Ollama model). Type 'quit' to exit.")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
# Invoke the chain
try:
response = chain.invoke({"user_input": user_input})
print(f"Bot: {response}")
except Exception as e:
print(f"Error communicating with Ollama: {e}")
print("Is the Ollama service running and the model available?")
print("Chatbot session ended.")
Example 2: Hugging Face Chatbot
This uses Hugging Face libraries directly.
2.a) Hugging Face with Remote Inference API
This uses the Hugging Face hosted Inference API (requires an API token).
import os
from dotenv import load_dotenv
from huggingface_hub import InferenceClient
# Load environment variables (for API key)
load_dotenv()
# --- Configuration ---
# Ensure HUGGINGFACEHUB_API_TOKEN is set in .env or environment
hf_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
if not hf_token:
raise ValueError("Hugging Face API token not found. Set HUGGINGFACEHUB_API_TOKEN.")
# Choose a model suitable for chat from the Hub
# Examples: "mistralai/Mistral-7B-Instruct-v0.1", "HuggingFaceH4/zephyr-7b-beta"
# Check model card for recommended prompt format if needed
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
client = InferenceClient(model=model_id, token=hf_token)
# --- Chat Loop ---
print(f"Chatbot initialized (using Hugging Face API: {model_id}). Type 'quit' to exit.")
# Store conversation history for context (optional, basic implementation)
messages = [{"role": "system", "content": "You are a helpful AI assistant."}]
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
# Add user message to history
messages.append({"role": "user", "content": user_input})
try:
# Use chat completion endpoint
response_stream = client.chat_completion(
messages=messages,
max_tokens=500, # Adjust as needed
stream=False, # Set to True for streaming output
)
# Extract the response content
if isinstance(response_stream, dict) and 'choices' in response_stream:
bot_response = response_stream['choices'][0]['message']['content']
else:
# Fallback or handle unexpected format
bot_response = str(response_stream) # May need adjustment based on model/client version
print(f"Bot: {bot_response}")
# Add bot response to history
messages.append({"role": "assistant", "content": bot_response})
# Optional: Limit history size to prevent excessive token usage
# if len(messages) > 10: messages = [messages[0]] + messages[-9:]
except Exception as e:
print(f"Error calling Hugging Face API: {e}")
# Remove the last user message if the API call failed
messages.pop()
print("Chatbot session ended.")
2.b) Hugging Face with Local Model (`transformers`)
This downloads and runs the model directly on your machine (requires significant RAM/VRAM depending on the model).
python
import torch # Or tensorflow/jax
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
# --- Configuration ---
# Choose a model suitable for chat/instruction-following
# Smaller models require less resources but may be less capable.
# Examples: "microsoft/DialoGPT-medium", "gpt2", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_id = "microsoft/DialoGPT-medium" # Example conversational model
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Example small instruct model
print(f"Loading local model: {model_id}...")
# Use GPU if available, otherwise CPU
device = 0 if torch.cuda.is_available() else -1 # device=0 for CUDA, device=-1 for CPU
# Using pipeline for simplicity
# For instruct models, you might need specific prompt formatting
chatbot = pipeline("conversational", model=model_id, device=device)
# Or for text generation (might need prompt formatting):
# chatbot = pipeline("text-generation", model=model_id, device=device, max_new_tokens=100)
print("Chatbot initialized (using local Hugging Face model). Type 'quit' to exit.")
# For conversational pipeline
from transformers import Conversation
conversation = Conversation()
# For text-generation pipeline (example prompt format for TinyLlama Chat)
# chat_history = [{"role": "system", "content": "You are a friendly chatbot."}]
# def format_prompt(history, user_input):
# prompt = ""
# for msg in history:
# prompt += f"<|{msg['role']}|>\n{msg['content']}</s>\n"
# prompt += f"<|user|>\n{user_input}</s>\n<|assistant|>\n"
# return prompt
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
try:
# --- Using "conversational" pipeline ---
conversation.add_user_input(user_input)
conversation = chatbot(conversation)
# The last response is the newest generated response
bot_response = conversation.generated_responses[-1]
# --- Using "text-generation" pipeline (example) ---
# prompt = format_prompt(chat_history, user_input)
# sequences = chatbot(prompt, num_return_sequences=1)
# generated_text = sequences[0]['generated_text']
# # Extract only the assistant's response part
# bot_response = generated_text.split("<|assistant|>")[-1].strip()
# chat_history.append({"role": "user", "content": user_input})
# chat_history.append({"role": "assistant", "content": bot_response})
# # Optional: Limit history size
print(f"Bot: {bot_response}")
except Exception as e:
print(f"Error during local model inference: {e}")
# If using conversational pipeline, you might need to reset or manage the conversation object on error
print("Chatbot session ended.")
Key Considerations
- Model Choice: The quality of your chatbot heavily depends on the chosen LLM. Remote APIs often offer state-of-the-art models, while local models require balancing capability with your hardware resources.
- Prompt Engineering: The `system` message and how you structure the prompt significantly influence the bot's behavior and persona.
- Context/Memory: These examples have minimal or no memory. For real conversations, you need to manage conversation history (pass previous turns back to the LLM). LangChain offers Memory modules, and with Hugging Face, you'd manage the history list manually or use specific pipeline features.
- Error Handling: Add more robust error handling for API issues, model loading failures, etc.
- Streaming: For a better user experience, especially with slower models or APIs, implement streaming to show the response word by word. Both LangChain and Hugging Face clients often support this (`stream=True`).
- Resource Usage (Local): Running models locally can consume substantial RAM, VRAM (if using GPU), and disk space. Start with smaller models if your resources are limited.
- Cost (Remote): API calls usually incur costs based on token usage. Monitor your usage.
Choose the example that best fits your needs regarding framework preference, resource availability, and whether you prefer a managed API or local control.
Minor Points & Potential Enhancements (already noted in "Key Considerations" but relevant to correctness):
- HF API Response Parsing: The parsing `if isinstance(response_stream, dict) and 'choices' in response_stream:` is a reasonable attempt but might need adjustment depending on the exact model used via the Inference API, as response structures can sometimes vary slightly. The fallback `str(response_stream)` is a safe default but might not always be the desired output.
- Error Handling Granularity: The error handling is basic. More specific exceptions could be caught (e.g., `openai.AuthenticationError`, network errors, `transformers` model loading errors) for more informative feedback.
- Context/Memory: As designed, the memory implementation is minimal (HF API example) or non-existent (others). This is correct for a *small, basic* example, but wouldn't sustain a coherent conversation. LangChain's Memory modules or more sophisticated manual history management would be needed for that.
- Prompt Formatting (Local HF): The commented-out section for `text-generation` correctly highlights that many models (especially instruction-tuned ones) require specific prompt templates. The example format shown (`<|role|>\n...</s>\n`) is typical but needs to be adapted based on the specific model's documentation.
No comments:
Post a Comment