INTRODUCTION
Many software engineers find natural language processing intimidating because it often involves large and complex frameworks. The emergence of the Transformer architecture changed that by introducing an attention-based design that can be pretrained on vast text corpora and then applied to a variety of language tasks. HuggingFace became the center of a vibrant community by packaging these pretrained Transformer models into an intuitive Python library. In this article you will learn how to leverage HuggingFace Transformers to build your own chatbot in multiple ways. You will first build and run the chatbot locally on your machine, and then you will see how to invoke remote models hosted by HuggingFace and by OpenAI. Finally you will explore higher-level orchestration and chain frameworks provided by LangGraph and LangChain. By following these examples you will gain a clear understanding of how text is converted to tokens, how models generate continuations, and how to integrate these capabilities into your own applications.
ENVIRONMENT AND DEPENDENCIES
Before writing any code you need a Python environment and several libraries. You will install the HuggingFace Transformers library together with its fast tokenizers component to run models locally. You will also install requests to call HTTP endpoints, the OpenAI client library to call OpenAI’s hosted models, LangGraph for agent orchestration, and LangChain for composable chains. The following command installs all of these packages:
# The following command installs core dependencies for local and remote LLM usage
pip install transformers tokenizers requests openai langgraph langchain
After running this command you will have access to modules such as transformers.AutoTokenizer, transformers.AutoModelForCausalLM, requests for HTTP calls, openai for OpenAI’s Python client, langgraph.prebuilt for building agents, and langchain.llms and langchain.chat_models for constructing chains.
BUILDING A LOCAL CHATBOT
To run a chatbot locally you need a tokenizer to convert text into a sequence of integer token IDs and a pretrained Transformer model to generate continuations from those tokens. You will then wrap these components in a simple interactive loop that reads user input from the console and prints the model’s responses.
The following code example shows how to load a tokenizer and a causal language model from HuggingFace’s model hub. You can replace the model identifier with any other causal model you prefer. This example uses PyTorch under the hood.
# The following code loads a tokenizer and a causal language model from HuggingFace
from transformers import AutoTokenizer, AutoModelForCausalLM
# The tokenizer converts text into integer token IDs that the model can process
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# The model generates text continuations given input token IDs
model = AutoModelForCausalLM.from_pretrained('gpt2')
# Verify that the tokenizer and the model use the same vocabulary
assert tokenizer.vocab_size == model.config.vocab_size
With the tokenizer and model loaded you can implement an interactive chat loop. You will move the model to a GPU if one is available to speed up inference. You will maintain the entire conversation history as a single text string so that the model has full context. Each time the user submits a message, you will append it to the history, tokenize the updated history, generate new tokens as the model’s response, decode only the newly generated tokens back to text, and then display them.
# The following script implements a basic interactive chat loop using PyTorch
import torch
# Select GPU if available for faster inference
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# Initialize an empty string to hold the conversation history
chat_history = ""
while True:
# Prompt the user for input
user_input = input("User: ")
# Allow the user to exit by typing 'quit'
if user_input.strip().lower() == 'quit':
print("Chatbot: Goodbye!")
break
# Append the user’s message and a placeholder for the model’s reply
chat_history += f"User: {user_input}\nChatbot:"
# Tokenize the full conversation history and move tensors to the selected device
inputs = tokenizer(chat_history, return_tensors='pt').to(device)
# Generate up to 100 new tokens as the chatbot’s response
outputs = model.generate(**inputs, max_new_tokens=100)
# Decode only the newly generated tokens, skipping the original prompt tokens
generated_text = tokenizer.decode(
outputs[0][inputs['input_ids'].shape[-1]:],
skip_special_tokens=True
)
# Append the generated response to the conversation history
chat_history += generated_text + "\n"
# Print the chatbot’s reply to the console
print(f"Chatbot: {generated_text}")
BUILDING A REMOTE CHATBOT VIA HUGGINGFACE INFERENCE API
Running large models locally can require substantial hardware. HuggingFace offers an Inference API that hosts models in the cloud and exposes them via HTTP endpoints. To use this service you must obtain an API token from your HuggingFace account settings page. The example below shows how to authenticate and send a prompt to a remote model.
# The following code shows how to call a remote model using HuggingFace Inference API
import requests
# Replace 'YOUR_HF_TOKEN' with your HuggingFace API token
api_token = "YOUR_HF_TOKEN"
headers = {"Authorization": f"Bearer {api_token}"}
# Specify the model you wish to call, for example 'bigscience/bloom'
model_id = "bigscience/bloom"
# Prepare the JSON payload with the prompt and generation parameters
payload = {
"inputs": "Hello, how are you today?",
"parameters": {
"temperature": 0.7,
"max_new_tokens": 100
}
}
# Send a POST request to the inference endpoint
response = requests.post(
f"https://api-inference.huggingface.co/models/{model_id}",
headers=headers,
json=payload
)
# Parse the JSON response and print the first generated text
result = response.json()
print("Chatbot:", result[0]['generated_text'])
You can wrap this logic in a console loop to provide an interactive experience. You should also include retry logic to handle rate limits or transient network errors. HuggingFace’s API supports streaming responses as well by setting “stream”: true in the payload and reading chunks as they arrive.
BUILDING A REMOTE CHATBOT WITH OPENAI’S API
Many developers choose to use OpenAI’s hosted models for state-of-the-art performance and convenience. To call OpenAI’s Chat Completion endpoints you must first install their Python client and set your API key in the environment variable OPENAI_API_KEY. The example below shows how to prepare your application for OpenAI calls.
# The following code installs the OpenAI client library and configures the API key
pip install openai
import os
import openai
# The OpenAI client reads the API key from the OPENAI_API_KEY environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")
Once configured you can implement a chat loop that maintains a list of message dictionaries, where each dictionary contains a “role” and “content.” By including a system message at the start you set the behavior of the assistant. Each time the user provides input, you append it to the history, call the ChatCompletion endpoint, extract the assistant’s reply, append it back to the history, and then display it.
# This script implements an interactive chat loop using OpenAI’s ChatCompletion API
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
# Initialize the conversation with a system message that defines assistant behavior
chat_history = [
{"role": "system", "content": "You are a helpful assistant."}
]
while True:
# Read the user’s message from the console
user_input = input("User: ")
# Exit if the user types 'quit'
if user_input.strip().lower() == "quit":
print("Assistant: Goodbye!")
break
# Append the user’s message to the conversation history
chat_history.append({"role": "user", "content": user_input})
# Call the ChatCompletion endpoint with the conversation history
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=chat_history,
temperature=0.7,
max_tokens=150
)
# Extract the assistant’s reply from the API response
assistant_message = response.choices[0].message["content"]
# Append the assistant’s message back to the history
chat_history.append({"role": "assistant", "content": assistant_message})
# Print the assistant’s reply to the console
print("Assistant:", assistant_message)
If you prefer to display tokens as they are generated rather than waiting for the full response, you can enable streaming by passing stream=True to ChatCompletion.create and iterating over the returned chunks, printing each delta.content as it arrives.
BUILDING A CHATBOT WITH LANGGRAPH
LangGraph provides a low-level framework for orchestrating stateful agents that can run long-lived workflows, include human-in-the-loop breakpoints, and maintain durable memory. To get started you will install LangGraph and then create a simple react-style agent that uses OpenAI’s GPT-3.5-turbo model. You will then run an interactive loop that sends user messages to the agent and prints its replies.
# The following commands install LangGraph and the LangChain OpenAI integration
pip install -U langgraph
pip install -U "langchain[openai]"
# The following code creates a React-style agent and runs an interactive loop
from langgraph.prebuilt import create_react_agent
import os
# Read the OpenAI API key from the environment
openai_api_key = os.getenv("OPENAI_API_KEY")
# Create an agent that uses OpenAI’s GPT-3.5-turbo with no external tools
agent = create_react_agent(
model="openai:gpt-3.5-turbo",
tools=[],
prompt="You are a helpful assistant."
)
while True:
# Read the user’s message
user_input = input("User: ")
if user_input.strip().lower() == "quit":
print("Assistant: Goodbye!")
break
# Invoke the agent with the user message
result = agent.invoke({"messages": [{"role": "user", "content": user_input}]})
# Extract the assistant’s reply from the agent’s output
messages = result.get("messages", [])
assistant_message = ""
for m in messages:
if m.get("role") == "assistant":
assistant_message = m.get("content", "")
break
# Print the assistant’s reply
print("Assistant:", assistant_message)
In this example you install LangGraph and the LangChain OpenAI integration. You then create a react-style agent by specifying the model namespace openai:gpt-3.5-turbo, an empty list of tools, and an initial prompt. The agent.invoke method accepts a dictionary containing a list of messages and returns a result with a messages field, from which you extract and print the assistant’s content.
BUILDING A CHATBOT WITH LANGCHAIN
LangChain offers composable components for defining chains, agents, memory backends, and integrations with both local and remote LLMs. You will first build a chatbot that runs locally by wrapping a HuggingFace text-generation pipeline in LangChain’s LLM interface. You will then build a chatbot that calls OpenAI’s chat models through LangChain’s ChatOpenAI class. Both examples will use a memory backend so that the chain automatically retains conversation context.
The code example below shows how to create a local text-generation pipeline using HuggingFace, wrap it in LangChain’s HuggingFacePipeline class, and then build a ConversationChain with buffer memory. You will then run an interactive loop that feeds each user message into the chain and prints the assistant’s reply.
# The following code builds a local chatbot using LangChain and HuggingFace pipeline
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
# Create a text-generation pipeline with the GPT-2 model
hf_pipeline = pipeline("text-generation", model="gpt2")
# Wrap the pipeline so LangChain can treat it as an LLM
llm = HuggingFacePipeline(pipeline=hf_pipeline)
# Create a conversation chain that stores messages in buffer memory
memory = ConversationBufferMemory()
conversation = ConversationChain(llm=llm, memory=memory)
while True:
# Read the user’s message
user_input = input("User: ")
if user_input.strip().lower() == "quit":
print("Assistant: Goodbye!")
break
# Use the conversation chain to produce a response
response = conversation.predict(input=user_input)
# Print the assistant’s reply
print("Assistant:", response)
Next you will build a chatbot that calls OpenAI’s chat models through LangChain. In this example you will instantiate LangChain’s ChatOpenAI class, which automatically reads the OPENAI_API_KEY from the environment. You will then create a ConversationChain with buffer memory and run the interactive loop as before.
# The following code builds a remote chatbot using LangChain and OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
# Create a ChatOpenAI instance that reads the API key from the environment
chat_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
# Create a conversation chain with buffer memory
memory = ConversationBufferMemory()
conversation = ConversationChain(llm=chat_model, memory=memory)
while True:
# Prompt the user for input
user_input = input("User: ")
if user_input.strip().lower() == "quit":
print("Assistant: Goodbye!")
break
# Get the assistant’s reply from the chain
response = conversation.predict(input=user_input)
# Print the assistant’s reply
print("Assistant:", response)
CONCLUSION AND NEXT STEPS
You have now seen how to build a functioning chatbot in multiple ways. You learned how to run a Transformer model locally with HuggingFace Transformers, how to call remote models via the HuggingFace Inference API, how to integrate with OpenAI’s Chat Completion endpoints directly, how to orchestrate stateful agents with LangGraph, and how to compose conversational chains with LangChain for both local and remote models. Running models locally gives you full control over your data and avoids API costs, while remote APIs spare you hardware complexity and often deliver more powerful models. LangGraph enables advanced workflows with durable memory and human-in-the-loop capabilities, and LangChain provides flexible abstractions for building chains and agents. As you move forward you might explore fine-tuning models on your own data, deploying your chatbot behind a web framework such as FastAPI or Flask, experimenting with function calling in OpenAI’s API, or constructing multi-agent pipelines in LangGraph.
No comments:
Post a Comment