Hitchhiker's Guide to AI, Software Architecture, and Everything Else: LLM Developers’ Primer

INTRODUCTION

Many software engineers find natural language processing intimidating because it often involves large and complex frameworks. The emergence of the Transformer architecture changed that by introducing an attention-based design that can be pretrained on vast text corpora and then applied to a variety of language tasks. HuggingFace became the center of a vibrant community by packaging these pretrained Transformer models into an intuitive Python library. In this article you will learn how to leverage HuggingFace Transformers to build your own chatbot in multiple ways. You will first build and run the chatbot locally on your machine, and then you will see how to invoke remote models hosted by HuggingFace and by OpenAI. Finally you will explore higher-level orchestration and chain frameworks provided by LangGraph and LangChain. By following these examples you will gain a clear understanding of how text is converted to tokens, how models generate continuations, and how to integrate these capabilities into your own applications.

ENVIRONMENT AND DEPENDENCIES

Before writing any code you need a Python environment and several libraries. You will install the HuggingFace Transformers library together with its fast tokenizers component to run models locally. You will also install requests to call HTTP endpoints, the OpenAI client library to call OpenAI’s hosted models, LangGraph for agent orchestration, and LangChain for composable chains. The following command installs all of these packages:

# The following command installs core dependencies for local and remote LLM usage

pip install transformers tokenizers requests openai langgraph langchain

After running this command you will have access to modules such as transformers.AutoTokenizer, transformers.AutoModelForCausalLM, requests for HTTP calls, openai for OpenAI’s Python client, langgraph.prebuilt for building agents, and langchain.llms and langchain.chat_models for constructing chains.

BUILDING A LOCAL CHATBOT

To run a chatbot locally you need a tokenizer to convert text into a sequence of integer token IDs and a pretrained Transformer model to generate continuations from those tokens. You will then wrap these components in a simple interactive loop that reads user input from the console and prints the model’s responses.

The following code example shows how to load a tokenizer and a causal language model from HuggingFace’s model hub. You can replace the model identifier with any other causal model you prefer. This example uses PyTorch under the hood.

# The following code loads a tokenizer and a causal language model from HuggingFace

from transformers import AutoTokenizer, AutoModelForCausalLM

# The tokenizer converts text into integer token IDs that the model can process

tokenizer = AutoTokenizer.from_pretrained('gpt2')

# The model generates text continuations given input token IDs

model = AutoModelForCausalLM.from_pretrained('gpt2')

# Verify that the tokenizer and the model use the same vocabulary

assert tokenizer.vocab_size == model.config.vocab_size

With the tokenizer and model loaded you can implement an interactive chat loop. You will move the model to a GPU if one is available to speed up inference. You will maintain the entire conversation history as a single text string so that the model has full context. Each time the user submits a message, you will append it to the history, tokenize the updated history, generate new tokens as the model’s response, decode only the newly generated tokens back to text, and then display them.

# The following script implements a basic interactive chat loop using PyTorch

import torch

# Select GPU if available for faster inference

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.to(device)

# Initialize an empty string to hold the conversation history

chat_history = ""

while True:

# Prompt the user for input

user_input = input("User: ")

# Allow the user to exit by typing 'quit'

if user_input.strip().lower() == 'quit':

print("Chatbot: Goodbye!")

break

# Append the user’s message and a placeholder for the model’s reply

chat_history += f"User: {user_input}\nChatbot:"

# Tokenize the full conversation history and move tensors to the selected device

inputs = tokenizer(chat_history, return_tensors='pt').to(device)

# Generate up to 100 new tokens as the chatbot’s response

outputs = model.generate(**inputs, max_new_tokens=100)

# Decode only the newly generated tokens, skipping the original prompt tokens

generated_text = tokenizer.decode(

outputs[0][inputs['input_ids'].shape[-1]:],

skip_special_tokens=True

)

# Append the generated response to the conversation history

chat_history += generated_text + "\n"

# Print the chatbot’s reply to the console

print(f"Chatbot: {generated_text}")

BUILDING A REMOTE CHATBOT VIA HUGGINGFACE INFERENCE API

Running large models locally can require substantial hardware. HuggingFace offers an Inference API that hosts models in the cloud and exposes them via HTTP endpoints. To use this service you must obtain an API token from your HuggingFace account settings page. The example below shows how to authenticate and send a prompt to a remote model.

# The following code shows how to call a remote model using HuggingFace Inference API

import requests

# Replace 'YOUR_HF_TOKEN' with your HuggingFace API token

api_token = "YOUR_HF_TOKEN"

headers = {"Authorization": f"Bearer {api_token}"}

# Specify the model you wish to call, for example 'bigscience/bloom'

model_id = "bigscience/bloom"

# Prepare the JSON payload with the prompt and generation parameters

payload = {

"inputs": "Hello, how are you today?",

"parameters": {

"temperature": 0.7,

"max_new_tokens": 100

}

# Send a POST request to the inference endpoint

response = requests.post(

f"https://api-inference.huggingface.co/models/{model_id}",

headers=headers,

json=payload

)

# Parse the JSON response and print the first generated text

result = response.json()

print("Chatbot:", result[0]['generated_text'])

You can wrap this logic in a console loop to provide an interactive experience. You should also include retry logic to handle rate limits or transient network errors. HuggingFace’s API supports streaming responses as well by setting “stream”: true in the payload and reading chunks as they arrive.

BUILDING A REMOTE CHATBOT WITH OPENAI’S API

Many developers choose to use OpenAI’s hosted models for state-of-the-art performance and convenience. To call OpenAI’s Chat Completion endpoints you must first install their Python client and set your API key in the environment variable OPENAI_API_KEY. The example below shows how to prepare your application for OpenAI calls.

# The following code installs the OpenAI client library and configures the API key

pip install openai

import os

import openai

# The OpenAI client reads the API key from the OPENAI_API_KEY environment variable

openai.api_key = os.getenv("OPENAI_API_KEY")

Once configured you can implement a chat loop that maintains a list of message dictionaries, where each dictionary contains a “role” and “content.” By including a system message at the start you set the behavior of the assistant. Each time the user provides input, you append it to the history, call the ChatCompletion endpoint, extract the assistant’s reply, append it back to the history, and then display it.

# This script implements an interactive chat loop using OpenAI’s ChatCompletion API

import os

import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

# Initialize the conversation with a system message that defines assistant behavior

chat_history = [

{"role": "system", "content": "You are a helpful assistant."}

]

while True:

# Read the user’s message from the console

user_input = input("User: ")

# Exit if the user types 'quit'

if user_input.strip().lower() == "quit":

print("Assistant: Goodbye!")

break

# Append the user’s message to the conversation history

chat_history.append({"role": "user", "content": user_input})

# Call the ChatCompletion endpoint with the conversation history

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=chat_history,

temperature=0.7,

max_tokens=150

)

# Extract the assistant’s reply from the API response

assistant_message = response.choices[0].message["content"]

# Append the assistant’s message back to the history

chat_history.append({"role": "assistant", "content": assistant_message})

# Print the assistant’s reply to the console

print("Assistant:", assistant_message)

If you prefer to display tokens as they are generated rather than waiting for the full response, you can enable streaming by passing stream=True to ChatCompletion.create and iterating over the returned chunks, printing each delta.content as it arrives.

BUILDING A CHATBOT WITH LANGGRAPH

LangGraph provides a low-level framework for orchestrating stateful agents that can run long-lived workflows, include human-in-the-loop breakpoints, and maintain durable memory. To get started you will install LangGraph and then create a simple react-style agent that uses OpenAI’s GPT-3.5-turbo model. You will then run an interactive loop that sends user messages to the agent and prints its replies.

# The following commands install LangGraph and the LangChain OpenAI integration

pip install -U langgraph

pip install -U "langchain[openai]"

# The following code creates a React-style agent and runs an interactive loop

from langgraph.prebuilt import create_react_agent

import os

# Read the OpenAI API key from the environment

openai_api_key = os.getenv("OPENAI_API_KEY")

# Create an agent that uses OpenAI’s GPT-3.5-turbo with no external tools

agent = create_react_agent(

model="openai:gpt-3.5-turbo",

tools=[],

prompt="You are a helpful assistant."

)

while True:

# Read the user’s message

user_input = input("User: ")

if user_input.strip().lower() == "quit":

print("Assistant: Goodbye!")

break

# Invoke the agent with the user message

result = agent.invoke({"messages": [{"role": "user", "content": user_input}]})

# Extract the assistant’s reply from the agent’s output

messages = result.get("messages", [])

assistant_message = ""

for m in messages:

if m.get("role") == "assistant":

assistant_message = m.get("content", "")

break

# Print the assistant’s reply

print("Assistant:", assistant_message)

In this example you install LangGraph and the LangChain OpenAI integration. You then create a react-style agent by specifying the model namespace openai:gpt-3.5-turbo, an empty list of tools, and an initial prompt. The agent.invoke method accepts a dictionary containing a list of messages and returns a result with a messages field, from which you extract and print the assistant’s content.

BUILDING A CHATBOT WITH LANGCHAIN

LangChain offers composable components for defining chains, agents, memory backends, and integrations with both local and remote LLMs. You will first build a chatbot that runs locally by wrapping a HuggingFace text-generation pipeline in LangChain’s LLM interface. You will then build a chatbot that calls OpenAI’s chat models through LangChain’s ChatOpenAI class. Both examples will use a memory backend so that the chain automatically retains conversation context.

The code example below shows how to create a local text-generation pipeline using HuggingFace, wrap it in LangChain’s HuggingFacePipeline class, and then build a ConversationChain with buffer memory. You will then run an interactive loop that feeds each user message into the chain and prints the assistant’s reply.

# The following code builds a local chatbot using LangChain and HuggingFace pipeline

from transformers import pipeline

from langchain.llms import HuggingFacePipeline

from langchain.chains import ConversationChain

from langchain.memory import ConversationBufferMemory

# Create a text-generation pipeline with the GPT-2 model

hf_pipeline = pipeline("text-generation", model="gpt2")

# Wrap the pipeline so LangChain can treat it as an LLM

llm = HuggingFacePipeline(pipeline=hf_pipeline)

# Create a conversation chain that stores messages in buffer memory

memory = ConversationBufferMemory()

conversation = ConversationChain(llm=llm, memory=memory)

while True:

# Read the user’s message

user_input = input("User: ")

if user_input.strip().lower() == "quit":

print("Assistant: Goodbye!")

break

# Use the conversation chain to produce a response

response = conversation.predict(input=user_input)

# Print the assistant’s reply

print("Assistant:", response)

Next you will build a chatbot that calls OpenAI’s chat models through LangChain. In this example you will instantiate LangChain’s ChatOpenAI class, which automatically reads the OPENAI_API_KEY from the environment. You will then create a ConversationChain with buffer memory and run the interactive loop as before.

# The following code builds a remote chatbot using LangChain and OpenAI

from langchain.chat_models import ChatOpenAI

from langchain.chains import ConversationChain

from langchain.memory import ConversationBufferMemory

# Create a ChatOpenAI instance that reads the API key from the environment

chat_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# Create a conversation chain with buffer memory

memory = ConversationBufferMemory()

conversation = ConversationChain(llm=chat_model, memory=memory)

while True:

# Prompt the user for input

user_input = input("User: ")

if user_input.strip().lower() == "quit":

print("Assistant: Goodbye!")

break

# Get the assistant’s reply from the chain

response = conversation.predict(input=user_input)

# Print the assistant’s reply

print("Assistant:", response)

CONCLUSION AND NEXT STEPS

You have now seen how to build a functioning chatbot in multiple ways. You learned how to run a Transformer model locally with HuggingFace Transformers, how to call remote models via the HuggingFace Inference API, how to integrate with OpenAI’s Chat Completion endpoints directly, how to orchestrate stateful agents with LangGraph, and how to compose conversational chains with LangChain for both local and remote models. Running models locally gives you full control over your data and avoids API costs, while remote APIs spare you hardware complexity and often deliver more powerful models. LangGraph enables advanced workflows with durable memory and human-in-the-loop capabilities, and LangChain provides flexible abstractions for building chains and agents. As you move forward you might explore fine-tuning models on your own data, deploying your chatbot behind a web framework such as FastAPI or Flask, experimenting with function calling in OpenAI’s API, or constructing multi-agent pipelines in LangGraph.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Saturday, June 28, 2025

LLM Developers’ Primer

INTRODUCTION

ENVIRONMENT AND DEPENDENCIES

BUILDING A LOCAL CHATBOT

BUILDING A REMOTE CHATBOT VIA HUGGINGFACE INFERENCE API

BUILDING A REMOTE CHATBOT WITH OPENAI’S API

BUILDING A CHATBOT WITH LANGGRAPH

BUILDING A CHATBOT WITH LANGCHAIN

CONCLUSION AND NEXT STEPS

No comments:

About Me