Hitchhiker's Guide to AI, Software Architecture, and Everything Else: COMBINING LARGE LANGUAGE MODELS WITH THEOREM PROVERS: A TUTORIAL

INTRODUCTION TO THE CHALLENGE

Mathematics has always been a domain where precision matters absolutely. A single logical error can invalidate an entire proof, no matter how elegant or intuitive it might seem. For centuries, mathematicians have relied on peer review and careful checking to ensure correctness. However, as mathematical proofs grow increasingly complex, sometimes spanning hundreds of pages, the challenge of verification becomes daunting.

Enter two powerful technologies that, when combined, offer a revolutionary approach to mathematical reasoning. On one side, we have Large Language Models, which excel at understanding natural language, recognizing patterns, and generating human-like mathematical intuition. On the other side, we have Theorem Provers, which provide absolute logical rigor and can verify that every step in a proof follows necessarily from the axioms and previous statements.

The magic happens when we combine these two approaches. The LLM acts like a creative mathematician, proposing proof strategies and intermediate steps based on its training on vast amounts of mathematical literature. The Theorem Prover acts like a meticulous checker, ensuring that every proposed step is logically sound and formally correct. Together, they form a system that is both creative and rigorous.

UNDERSTANDING LARGE LANGUAGE MODELS FOR MATHEMATICS

Large Language Models have demonstrated remarkable capabilities in mathematical reasoning. Models like GPT-4, Claude, DeepSeek-Math, and specialized versions of Llama have shown they can solve complex mathematical problems, explain concepts, and even suggest proof strategies. However, they have a critical limitation: they can make mistakes. They might produce steps that seem plausible but are logically flawed.

What LLMs bring to mathematical proof is their ability to work with natural language descriptions of problems, their vast knowledge of mathematical techniques and patterns, and their capacity to generate creative approaches to proofs. They can take a theorem stated in plain English and propose a proof outline that a human mathematician would find reasonable.

For instance, if you ask an LLM to prove that the square root of two is irrational, it will likely suggest a proof by contradiction, proposing to assume that the square root of two can be expressed as a ratio of two integers in lowest terms, then deriving a contradiction. This is exactly the approach a human mathematician would take, because the LLM has learned this pattern from countless examples in its training data.

UNDERSTANDING THEOREM PROVERS

Theorem Provers are software systems that work with formal mathematical logic. Unlike LLMs, they do not guess or approximate. They verify proofs with absolute certainty by checking that each step follows from the axioms and inference rules of a formal logical system. Popular theorem provers include Lean, Coq, Isabelle/HOL, and others.

The power of a theorem prover lies in its guarantee of correctness. If a theorem prover accepts a proof, you can be absolutely certain that the proof is valid within the formal system being used. This is why major mathematical results, like the proof of the Kepler Conjecture, have been formalized in theorem provers to eliminate any possibility of error.

However, theorem provers have their own limitation: they require proofs to be written in a formal language that is quite different from how mathematicians normally communicate. Writing a proof in Lean or Coq requires expertise in the specific syntax and tactics of that system. This creates a barrier to entry and makes the process time-consuming.

THE SYNERGY: WHY COMBINING THEM WORKS

When we combine LLMs with Theorem Provers, we get the best of both worlds. The LLM provides the intuition and creativity, suggesting proof steps in a form that is close to natural mathematical language. The Theorem Prover provides the rigor, checking each suggested step and ensuring logical correctness.

The workflow looks like this: A user states a theorem in natural language. The LLM translates this into the formal language of the theorem prover and proposes a proof strategy. The system attempts to verify each step using the theorem prover. If a step fails verification, the LLM receives feedback and proposes an alternative. This loop continues until a complete, verified proof is constructed.

This approach has several advantages. First, it makes theorem proving more accessible because users can work in natural language rather than learning complex formal syntax. Second, it accelerates the proof process because the LLM can suggest steps that would take a human expert considerable time to formulate. Third, it maintains absolute rigor because every step is verified by the theorem prover.

ARCHITECTURAL DESIGN OF THE COMBINED SYSTEM

Before we dive into code, let us understand the architecture of our system. We will build a system with the following components:

The first component is the LLM Interface, which handles communication with the language model. This component takes natural language input and generates suggestions for proof steps. It can work with both open source models like Llama or commercial APIs like OpenAI's GPT-4.

The second component is the Theorem Prover Interface, which communicates with the formal verification system. For our implementation, we will use Lean 4, which is open source and has excellent tooling. This component translates LLM suggestions into Lean syntax and submits them for verification.

The third component is the Proof State Manager, which maintains the current state of the proof attempt. It tracks what has been proven so far, what remains to be proven, and the history of attempted steps.

The fourth component is the Feedback Loop Controller, which manages the interaction between the LLM and the Theorem Prover. When a proof step fails, it formulates an appropriate error message and sends it back to the LLM for a revised attempt.

The fifth component is the User Interface, which allows users to state theorems, view proof progress, and interact with the system.

SETTING UP THE DEVELOPMENT ENVIRONMENT

Before we can build our system, we need to set up the necessary tools. For the theorem prover, we will use Lean 4, which you can install following the instructions at the Lean community website. For the LLM component, we will write code that can work with multiple backends, including local models via Ollama or commercial APIs.

We will write our integration code in Python, as it has excellent libraries for both API communication and subprocess management. You will need Python 3.8 or later, along with several packages that we will install.

Let us start by creating a project structure. Create a directory for your project and set up a virtual environment to keep dependencies isolated.

IMPLEMENTING THE LLM INTERFACE

The LLM Interface is our bridge to the language model. We want this component to be flexible, supporting both open source and commercial models. Let us implement a clean abstraction that allows us to swap between different LLM backends easily.

import os
import json
from abc import ABC, abstractmethod
from typing import List, Dict, Optional


class LLMInterface(ABC):
    """
    Abstract base class for LLM interfaces.
    This allows us to support multiple LLM backends with a uniform API.
    """
    
    @abstractmethod
    def generate_proof_step(self, 
                           theorem_statement: str, 
                           current_proof_state: str,
                           previous_attempts: List[str]) -> str:
        """
        Generate a suggested proof step based on the current state.
        
        Args:
            theorem_statement: The theorem we are trying to prove
            current_proof_state: The current state in the proof
            previous_attempts: List of previously attempted steps that failed
            
        Returns:
            A suggested proof step in Lean syntax
        """
        pass
    
    @abstractmethod
    def translate_to_lean(self, natural_language_theorem: str) -> str:
        """
        Translate a natural language theorem statement into Lean syntax.
        
        Args:
            natural_language_theorem: Theorem stated in natural language
            
        Returns:
            The theorem in Lean 4 syntax
        """
        pass


class OpenAIInterface(LLMInterface):
    """
    Interface for OpenAI's GPT models.
    This allows us to use commercial models like GPT-4.
    """
    
    def __init__(self, api_key: str, model: str = "gpt-4"):
        """
        Initialize the OpenAI interface.
        
        Args:
            api_key: Your OpenAI API key
            model: The model to use (default: gpt-4)
        """
        self.api_key = api_key
        self.model = model
        self.conversation_history = []
        
        try:
            import openai
            self.client = openai.OpenAI(api_key=api_key)
        except ImportError:
            raise ImportError(
                "OpenAI package not installed. "
                "Install it with: pip install openai"
            )
    
    def generate_proof_step(self, 
                           theorem_statement: str, 
                           current_proof_state: str,
                           previous_attempts: List[str]) -> str:
        """
        Use GPT to generate the next proof step.
        """
        # Construct a detailed prompt that gives context
        prompt = self._construct_proof_step_prompt(
            theorem_statement, 
            current_proof_state, 
            previous_attempts
        )
        
        # Call the OpenAI API
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system", 
                    "content": self._get_system_prompt()
                },
                {
                    "role": "user", 
                    "content": prompt
                }
            ],
            temperature=0.7,
            max_tokens=500
        )
        
        # Extract the suggested step
        suggestion = response.choices[0].message.content.strip()
        return self._extract_lean_code(suggestion)
    
    def translate_to_lean(self, natural_language_theorem: str) -> str:
        """
        Translate natural language to Lean syntax using GPT.
        """
        prompt = f"""
        Translate the following theorem statement into Lean 4 syntax.
        Provide only the Lean code, without explanations.
        
        Theorem: {natural_language_theorem}
        
        Lean 4 code:
        """
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system", 
                    "content": self._get_system_prompt()
                },
                {
                    "role": "user", 
                    "content": prompt
                }
            ],
            temperature=0.3,
            max_tokens=300
        )
        
        return self._extract_lean_code(
            response.choices[0].message.content.strip()
        )
    
    def _get_system_prompt(self) -> str:
        """
        Get the system prompt that instructs the model on its role.
        """
        return """
        You are an expert in mathematical theorem proving using Lean 4.
        Your role is to help prove theorems by suggesting proof steps
        in valid Lean 4 syntax. You should be familiar with Lean's
        tactics, type theory, and common proof patterns.
        
        When suggesting proof steps:
        1. Use only valid Lean 4 syntax
        2. Be precise and formal
        3. Consider the current proof state carefully
        4. Learn from previous failed attempts
        5. Suggest one clear step at a time
        """
    
    def _construct_proof_step_prompt(self,
                                     theorem_statement: str,
                                     current_proof_state: str,
                                     previous_attempts: List[str]) -> str:
        """
        Construct a detailed prompt for generating the next proof step.
        """
        prompt = f"""
        We are proving the following theorem in Lean 4:
        
        {theorem_statement}
        
        Current proof state:
        {current_proof_state}
        """
        
        if previous_attempts:
            prompt += "\n\nPrevious attempts that failed:\n"
            for i, attempt in enumerate(previous_attempts, 1):
                prompt += f"{i}. {attempt}\n"
            prompt += "\nPlease suggest a different approach.\n"
        
        prompt += """
        Suggest the next proof step in Lean 4 syntax.
        Provide only the Lean tactic or code, without explanations.
        """
        
        return prompt
    
    def _extract_lean_code(self, response: str) -> str:
        """
        Extract Lean code from the response, removing markdown formatting.
        """
        # Remove markdown code blocks if present
        if "```lean" in response:
            start = response.find("```lean") + 7
            end = response.find("```", start)
            return response[start:end].strip()
        elif "```" in response:
            start = response.find("```") + 3
            end = response.find("```", start)
            return response[start:end].strip()
        else:
            return response.strip()


class OllamaInterface(LLMInterface):
    """
    Interface for local LLMs running via Ollama.
    This allows us to use open source models locally.
    """
    
    def __init__(self, model: str = "deepseek-math", host: str = "localhost:11434"):
        """
        Initialize the Ollama interface.
        
        Args:
            model: The model to use (must be pulled in Ollama first)
            host: The Ollama server host and port
        """
        self.model = model
        self.host = host
        self.base_url = f"http://{host}"
        
        try:
            import requests
            self.requests = requests
        except ImportError:
            raise ImportError(
                "Requests package not installed. "
                "Install it with: pip install requests"
            )
    
    def generate_proof_step(self, 
                           theorem_statement: str, 
                           current_proof_state: str,
                           previous_attempts: List[str]) -> str:
        """
        Use a local LLM via Ollama to generate the next proof step.
        """
        prompt = self._construct_proof_step_prompt(
            theorem_statement, 
            current_proof_state, 
            previous_attempts
        )
        
        # Call Ollama API
        response = self.requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "system": self._get_system_prompt(),
                "stream": False
            }
        )
        
        if response.status_code == 200:
            result = response.json()
            return self._extract_lean_code(result["response"])
        else:
            raise RuntimeError(
                f"Ollama API error: {response.status_code} - {response.text}"
            )
    
    def translate_to_lean(self, natural_language_theorem: str) -> str:
        """
        Translate natural language to Lean syntax using local LLM.
        """
        prompt = f"""
        Translate the following theorem statement into Lean 4 syntax.
        Provide only the Lean code, without explanations.
        
        Theorem: {natural_language_theorem}
        
        Lean 4 code:
        """
        
        response = self.requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "system": self._get_system_prompt(),
                "stream": False
            }
        )
        
        if response.status_code == 200:
            result = response.json()
            return self._extract_lean_code(result["response"])
        else:
            raise RuntimeError(
                f"Ollama API error: {response.status_code} - {response.text}"
            )
    
    def _get_system_prompt(self) -> str:
        """
        Get the system prompt for the local LLM.
        """
        return """
        You are an expert in mathematical theorem proving using Lean 4.
        Your role is to help prove theorems by suggesting proof steps
        in valid Lean 4 syntax. You should be familiar with Lean's
        tactics, type theory, and common proof patterns.
        
        When suggesting proof steps:
        1. Use only valid Lean 4 syntax
        2. Be precise and formal
        3. Consider the current proof state carefully
        4. Learn from previous failed attempts
        5. Suggest one clear step at a time
        """
    
    def _construct_proof_step_prompt(self,
                                     theorem_statement: str,
                                     current_proof_state: str,
                                     previous_attempts: List[str]) -> str:
        """
        Construct a detailed prompt for generating the next proof step.
        """
        prompt = f"""
        We are proving the following theorem in Lean 4:
        
        {theorem_statement}
        
        Current proof state:
        {current_proof_state}
        """
        
        if previous_attempts:
            prompt += "\n\nPrevious attempts that failed:\n"
            for i, attempt in enumerate(previous_attempts, 1):
                prompt += f"{i}. {attempt}\n"
            prompt += "\nPlease suggest a different approach.\n"
        
        prompt += """
        Suggest the next proof step in Lean 4 syntax.
        Provide only the Lean tactic or code, without explanations.
        """
        
        return prompt
    
    def _extract_lean_code(self, response: str) -> str:
        """
        Extract Lean code from the response.
        """
        # Remove markdown code blocks if present
        if "```lean" in response:
            start = response.find("```lean") + 7
            end = response.find("```", start)
            return response[start:end].strip()
        elif "```" in response:
            start = response.find("```") + 3
            end = response.find("```", start)
            return response[start:end].strip()
        else:
            return response.strip()

This LLM Interface code provides a clean abstraction over different language models. The abstract base class defines the contract that all LLM interfaces must follow, while the concrete implementations handle the specifics of communicating with OpenAI's API or a local Ollama server.

The key design principle here is separation of concerns. Each class has a single, well-defined responsibility. The OpenAIInterface knows how to talk to OpenAI's servers, while the OllamaInterface knows how to communicate with a local model. Both present the same interface to the rest of our system, making it easy to swap between them.

Notice how we construct detailed prompts that give the LLM context about what we are trying to prove, what the current state is, and what has already been tried. This context is crucial for getting good suggestions from the model.

IMPLEMENTING THE THEOREM PROVER INTERFACE

Now we need to build the interface to Lean 4. This component will take proof steps suggested by the LLM and verify them using Lean's type checker. It will also extract the current proof state so we can feed it back to the LLM.

import subprocess
import tempfile
import os
import re
from typing import Tuple, Optional
from dataclasses import dataclass


@dataclass
class ProofState:
    """
    Represents the current state of a proof attempt.
    """
    goals: List[str]
    hypotheses: List[str]
    is_complete: bool
    error_message: Optional[str] = None


class LeanInterface:
    """
    Interface for interacting with the Lean 4 theorem prover.
    This class handles compilation, verification, and state extraction.
    """
    
    def __init__(self, lean_executable: str = "lean"):
        """
        Initialize the Lean interface.
        
        Args:
            lean_executable: Path to the Lean executable
        """
        self.lean_executable = lean_executable
        self.verify_lean_installation()
    
    def verify_lean_installation(self):
        """
        Verify that Lean is properly installed and accessible.
        """
        try:
            result = subprocess.run(
                [self.lean_executable, "--version"],
                capture_output=True,
                text=True,
                timeout=5
            )
            if result.returncode != 0:
                raise RuntimeError(
                    "Lean is installed but returned an error. "
                    f"Error: {result.stderr}"
                )
        except FileNotFoundError:
            raise RuntimeError(
                f"Lean executable not found at {self.lean_executable}. "
                "Please install Lean 4 and ensure it is in your PATH."
            )
        except subprocess.TimeoutExpired:
            raise RuntimeError(
                "Lean verification timed out. "
                "There may be an issue with your Lean installation."
            )
    
    def verify_proof_step(self, 
                         theorem_code: str, 
                         proof_step: str) -> Tuple[bool, ProofState]:
        """
        Verify a single proof step in Lean.
        
        Args:
            theorem_code: The complete theorem statement in Lean
            proof_step: The proof step to verify
            
        Returns:
            A tuple of (success: bool, proof_state: ProofState)
        """
        # Create a complete Lean file with the theorem and proof step
        lean_code = self._construct_lean_file(theorem_code, proof_step)
        
        # Write to a temporary file and verify
        with tempfile.NamedTemporaryFile(
            mode='w', 
            suffix='.lean', 
            delete=False
        ) as f:
            f.write(lean_code)
            temp_file = f.name
        
        try:
            # Run Lean on the file
            result = subprocess.run(
                [self.lean_executable, temp_file],
                capture_output=True,
                text=True,
                timeout=30
            )
            
            # Parse the result
            if result.returncode == 0:
                # Proof step succeeded
                proof_state = self._extract_proof_state(result.stdout)
                return True, proof_state
            else:
                # Proof step failed
                error_msg = self._parse_error_message(result.stderr)
                proof_state = ProofState(
                    goals=[],
                    hypotheses=[],
                    is_complete=False,
                    error_message=error_msg
                )
                return False, proof_state
                
        except subprocess.TimeoutExpired:
            proof_state = ProofState(
                goals=[],
                hypotheses=[],
                is_complete=False,
                error_message="Verification timed out after 30 seconds"
            )
            return False, proof_state
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_file)
            except:
                pass
    
    def _construct_lean_file(self, theorem_code: str, proof_step: str) -> str:
        """
        Construct a complete Lean file with necessary imports.
        """
        return f"""
import Mathlib.Tactic

{theorem_code}
  {proof_step}
"""
    
    def _extract_proof_state(self, lean_output: str) -> ProofState:
        """
        Extract the current proof state from Lean's output.
        
        This parses Lean's output to determine what goals remain
        and what hypotheses are available.
        """
        # Check if proof is complete
        if "no goals" in lean_output.lower():
            return ProofState(
                goals=[],
                hypotheses=[],
                is_complete=True
            )
        
        # Extract goals
        goals = self._parse_goals(lean_output)
        
        # Extract hypotheses
        hypotheses = self._parse_hypotheses(lean_output)
        
        return ProofState(
            goals=goals,
            hypotheses=hypotheses,
            is_complete=False
        )
    
    def _parse_goals(self, output: str) -> List[str]:
        """
        Parse the goals from Lean's output.
        """
        goals = []
        
        # Look for goal markers in the output
        # Lean 4 typically shows goals after a turnstile symbol
        goal_pattern = r'⊢\s*(.+?)(?=\n\n|\n⊢|$)'
        matches = re.findall(goal_pattern, output, re.DOTALL)
        
        for match in matches:
            goals.append(match.strip())
        
        return goals
    
    def _parse_hypotheses(self, output: str) -> List[str]:
        """
        Parse the hypotheses from Lean's output.
        """
        hypotheses = []
        
        # Hypotheses typically appear before the turnstile
        # Format is usually: name : type
        hyp_pattern = r'(\w+)\s*:\s*([^\n]+)'
        matches = re.findall(hyp_pattern, output)
        
        for name, type_expr in matches:
            hypotheses.append(f"{name} : {type_expr}")
        
        return hypotheses
    
    def _parse_error_message(self, error_output: str) -> str:
        """
        Parse and clean up error messages from Lean.
        """
        # Remove file path information
        cleaned = re.sub(r'/tmp/tmp\w+\.lean:\d+:\d+:', '', error_output)
        
        # Extract the main error message
        lines = cleaned.split('\n')
        relevant_lines = [
            line for line in lines 
            if line.strip() and not line.startswith('---')
        ]
        
        return '\n'.join(relevant_lines[:5])  # Take first 5 relevant lines
    
    def check_theorem_syntax(self, theorem_code: str) -> Tuple[bool, str]:
        """
        Check if a theorem statement has valid Lean syntax.
        
        Args:
            theorem_code: The theorem code to check
            
        Returns:
            A tuple of (is_valid: bool, message: str)
        """
        # Create a minimal Lean file with just the theorem statement
        lean_code = f"""
import Mathlib.Tactic

{theorem_code}
  sorry  -- Placeholder proof
"""
        
        with tempfile.NamedTemporaryFile(
            mode='w', 
            suffix='.lean', 
            delete=False
        ) as f:
            f.write(lean_code)
            temp_file = f.name
        
        try:
            result = subprocess.run(
                [self.lean_executable, temp_file],
                capture_output=True,
                text=True,
                timeout=10
            )
            
            if result.returncode == 0:
                return True, "Theorem syntax is valid"
            else:
                error_msg = self._parse_error_message(result.stderr)
                return False, f"Syntax error: {error_msg}"
                
        except subprocess.TimeoutExpired:
            return False, "Syntax check timed out"
        finally:
            try:
                os.unlink(temp_file)
            except:
                pass

The Lean Interface handles all communication with the Lean theorem prover. It creates temporary files containing the Lean code, runs the Lean compiler on them, and parses the output to extract proof states and error messages.

The most important method here is verify_proof_step, which takes a theorem statement and a proposed proof step, combines them into a valid Lean file, and checks whether Lean accepts the proof. If the proof step is valid, we extract the resulting proof state, which tells us what goals remain to be proven. If the step is invalid, we extract the error message to help the LLM understand what went wrong.

Notice how we use temporary files for verification. This is necessary because Lean works with files rather than accepting code directly through standard input. We create the file, run Lean on it, parse the results, and then clean up the temporary file.

IMPLEMENTING THE PROOF STATE MANAGER

The Proof State Manager keeps track of the proof as it develops. It maintains a history of attempted steps, the current state of the proof, and provides methods for updating and querying this information.

from typing import List, Optional
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class ProofStep:
    """
    Represents a single step in a proof attempt.
    """
    step_number: int
    lean_code: str
    was_successful: bool
    proof_state_after: Optional[ProofState]
    timestamp: datetime = field(default_factory=datetime.now)
    error_message: Optional[str] = None


class ProofStateManager:
    """
    Manages the state of an ongoing proof attempt.
    
    This class tracks the history of proof steps, maintains the current
    proof state, and provides methods for querying and updating the proof.
    """
    
    def __init__(self, theorem_statement: str, theorem_code: str):
        """
        Initialize the proof state manager.
        
        Args:
            theorem_statement: Natural language statement of the theorem
            theorem_code: Lean code for the theorem
        """
        self.theorem_statement = theorem_statement
        self.theorem_code = theorem_code
        self.proof_steps: List[ProofStep] = []
        self.current_state: Optional[ProofState] = None
        self.is_complete = False
        self.failed_attempts: List[str] = []
    
    def add_successful_step(self, lean_code: str, resulting_state: ProofState):
        """
        Record a successful proof step.
        
        Args:
            lean_code: The Lean code for this step
            resulting_state: The proof state after this step
        """
        step = ProofStep(
            step_number=len(self.proof_steps) + 1,
            lean_code=lean_code,
            was_successful=True,
            proof_state_after=resulting_state
        )
        
        self.proof_steps.append(step)
        self.current_state = resulting_state
        
        # Check if proof is complete
        if resulting_state.is_complete:
            self.is_complete = True
        
        # Clear failed attempts since we made progress
        self.failed_attempts = []
    
    def add_failed_attempt(self, lean_code: str, error_message: str):
        """
        Record a failed proof attempt.
        
        Args:
            lean_code: The Lean code that failed
            error_message: The error message from Lean
        """
        step = ProofStep(
            step_number=len(self.proof_steps) + 1,
            lean_code=lean_code,
            was_successful=False,
            proof_state_after=None,
            error_message=error_message
        )
        
        # We don't add failed steps to the main proof steps list
        # but we track them for feedback to the LLM
        self.failed_attempts.append(lean_code)
    
    def get_current_proof_code(self) -> str:
        """
        Get the complete Lean code for the proof so far.
        
        Returns:
            A string containing the theorem and all successful proof steps
        """
        if not self.proof_steps:
            return self.theorem_code
        
        proof_lines = [step.lean_code for step in self.proof_steps]
        proof_body = "\n  ".join(proof_lines)
        
        return f"{self.theorem_code}\n  {proof_body}"
    
    def get_proof_summary(self) -> str:
        """
        Get a human-readable summary of the proof progress.
        
        Returns:
            A formatted string describing the proof state
        """
        summary = f"Theorem: {self.theorem_statement}\n\n"
        summary += f"Total steps: {len(self.proof_steps)}\n"
        summary += f"Status: {'Complete' if self.is_complete else 'In progress'}\n\n"
        
        if self.current_state and not self.is_complete:
            summary += "Current goals:\n"
            for i, goal in enumerate(self.current_state.goals, 1):
                summary += f"  {i}. {goal}\n"
            
            if self.current_state.hypotheses:
                summary += "\nAvailable hypotheses:\n"
                for hyp in self.current_state.hypotheses:
                    summary += f"  {hyp}\n"
        
        return summary
    
    def get_recent_failed_attempts(self, count: int = 3) -> List[str]:
        """
        Get the most recent failed attempts.
        
        Args:
            count: Number of recent attempts to return
            
        Returns:
            List of Lean code strings that failed
        """
        return self.failed_attempts[-count:]
    
    def reset(self):
        """
        Reset the proof state to start over.
        """
        self.proof_steps = []
        self.current_state = None
        self.is_complete = False
        self.failed_attempts = []

The Proof State Manager is the memory of our system. It remembers every step we have taken, both successful and failed. This is crucial for two reasons. First, it allows us to build up the complete proof incrementally. Second, it allows us to give the LLM feedback about what has already been tried, preventing it from suggesting the same failed approach repeatedly.

The class maintains two separate lists: one for successful proof steps that form part of the actual proof, and another for failed attempts that we use only for feedback. This separation keeps the proof itself clean while still learning from mistakes.

IMPLEMENTING THE FEEDBACK LOOP CONTROLLER

The Feedback Loop Controller orchestrates the interaction between the LLM and the Theorem Prover. It implements the core logic of our system: propose a step, verify it, and either move forward or try again with feedback.

from typing import Optional, Tuple
import time


class FeedbackLoopController:
    """
    Controls the feedback loop between the LLM and the theorem prover.
    
    This is the brain of our system, coordinating the interaction between
    the creative LLM and the rigorous theorem prover.
    """
    
    def __init__(self, 
                 llm_interface: LLMInterface,
                 lean_interface: LeanInterface,
                 max_attempts_per_step: int = 5,
                 max_total_steps: int = 50):
        """
        Initialize the feedback loop controller.
        
        Args:
            llm_interface: The LLM interface to use
            lean_interface: The Lean interface to use
            max_attempts_per_step: Maximum attempts for each proof step
            max_total_steps: Maximum total steps before giving up
        """
        self.llm = llm_interface
        self.lean = lean_interface
        self.max_attempts_per_step = max_attempts_per_step
        self.max_total_steps = max_total_steps
    
    def prove_theorem(self, 
                     theorem_statement: str) -> Tuple[bool, ProofStateManager]:
        """
        Attempt to prove a theorem using the LLM-Lean combination.
        
        Args:
            theorem_statement: Natural language statement of the theorem
            
        Returns:
            A tuple of (success: bool, proof_manager: ProofStateManager)
        """
        print(f"Starting proof attempt for: {theorem_statement}")
        print("=" * 70)
        
        # Step 1: Translate the theorem to Lean
        print("\nStep 1: Translating theorem to Lean syntax...")
        theorem_code = self.llm.translate_to_lean(theorem_statement)
        print(f"Generated Lean code:\n{theorem_code}\n")
        
        # Step 2: Verify the theorem syntax
        print("Step 2: Verifying theorem syntax...")
        is_valid, message = self.lean.check_theorem_syntax(theorem_code)
        
        if not is_valid:
            print(f"ERROR: Invalid theorem syntax: {message}")
            # Try to fix the syntax
            print("Attempting to fix syntax...")
            theorem_code = self._fix_theorem_syntax(
                theorem_statement, 
                theorem_code, 
                message
            )
            is_valid, message = self.lean.check_theorem_syntax(theorem_code)
            
            if not is_valid:
                print(f"ERROR: Could not fix syntax: {message}")
                proof_manager = ProofStateManager(
                    theorem_statement, 
                    theorem_code
                )
                return False, proof_manager
        
        print("Theorem syntax is valid.\n")
        
        # Step 3: Initialize proof state manager
        proof_manager = ProofStateManager(theorem_statement, theorem_code)
        
        # Step 4: Iteratively build the proof
        print("Step 3: Building proof step by step...\n")
        total_steps = 0
        
        while not proof_manager.is_complete and total_steps < self.max_total_steps:
            total_steps += 1
            print(f"--- Proof Step {total_steps} ---")
            
            success = self._attempt_next_step(proof_manager)
            
            if not success:
                print(f"Failed to find valid step after "
                      f"{self.max_attempts_per_step} attempts.")
                break
            
            print(f"Step {total_steps} successful!")
            print(f"Current state: {len(proof_manager.current_state.goals)} "
                  f"goal(s) remaining\n")
            
            # Small delay to avoid overwhelming APIs
            time.sleep(0.5)
        
        # Step 5: Report results
        print("\n" + "=" * 70)
        if proof_manager.is_complete:
            print("SUCCESS! Proof completed.")
            print(f"Total steps: {len(proof_manager.proof_steps)}")
        else:
            print("INCOMPLETE: Could not complete the proof.")
            print(f"Attempted {total_steps} steps.")
        
        print("=" * 70)
        
        return proof_manager.is_complete, proof_manager
    
    def _attempt_next_step(self, proof_manager: ProofStateManager) -> bool:
        """
        Attempt to find and verify the next proof step.
        
        Args:
            proof_manager: The proof state manager
            
        Returns:
            True if a valid step was found, False otherwise
        """
        current_proof = proof_manager.get_current_proof_code()
        current_state_str = self._format_proof_state(
            proof_manager.current_state
        )
        
        for attempt in range(1, self.max_attempts_per_step + 1):
            print(f"  Attempt {attempt}/{self.max_attempts_per_step}...")
            
            # Get suggestion from LLM
            recent_failures = proof_manager.get_recent_failed_attempts()
            suggested_step = self.llm.generate_proof_step(
                proof_manager.theorem_code,
                current_state_str,
                recent_failures
            )
            
            print(f"  LLM suggests: {suggested_step}")
            
            # Verify with Lean
            success, new_state = self.lean.verify_proof_step(
                proof_manager.theorem_code,
                suggested_step
            )
            
            if success:
                # Step verified successfully
                proof_manager.add_successful_step(suggested_step, new_state)
                return True
            else:
                # Step failed verification
                error_msg = new_state.error_message or "Unknown error"
                print(f"  Verification failed: {error_msg}")
                proof_manager.add_failed_attempt(suggested_step, error_msg)
        
        # All attempts failed
        return False
    
    def _format_proof_state(self, state: Optional[ProofState]) -> str:
        """
        Format the proof state as a string for the LLM.
        
        Args:
            state: The current proof state
            
        Returns:
            A formatted string describing the state
        """
        if state is None:
            return "Initial state - no proof steps yet"
        
        if state.is_complete:
            return "Proof complete - no goals remaining"
        
        formatted = "Goals to prove:\n"
        for i, goal in enumerate(state.goals, 1):
            formatted += f"  {i}. {goal}\n"
        
        if state.hypotheses:
            formatted += "\nAvailable hypotheses:\n"
            for hyp in state.hypotheses:
                formatted += f"  {hyp}\n"
        
        return formatted
    
    def _fix_theorem_syntax(self, 
                           theorem_statement: str,
                           broken_code: str,
                           error_message: str) -> str:
        """
        Attempt to fix syntax errors in the theorem code.
        
        Args:
            theorem_statement: Original natural language statement
            broken_code: The Lean code with syntax errors
            error_message: The error message from Lean
            
        Returns:
            Corrected Lean code
        """
        # This is a simplified version - in practice, you might want
        # to iterate with the LLM to fix the syntax
        print(f"Asking LLM to fix syntax error: {error_message}")
        
        # For now, just ask the LLM to translate again with the error context
        return self.llm.translate_to_lean(theorem_statement)

The Feedback Loop Controller is where the magic happens. It implements the core algorithm of our system. First, it translates the natural language theorem into Lean syntax. Then it enters a loop where it repeatedly asks the LLM for the next proof step, verifies that step with Lean, and either adds it to the proof or tries again with error feedback.

The key insight here is the feedback mechanism. When a proof step fails, we do not just try again blindly. We tell the LLM what went wrong and what we have already tried. This allows the LLM to learn from its mistakes and try different approaches.

Notice the rate limiting with the small delay between steps. This is important when using commercial APIs to avoid hitting rate limits and to be respectful of the service.

CREATING THE USER INTERFACE

Now we need a way for users to interact with our system. We will create a simple command-line interface that allows users to input theorems and see the proof process unfold.

import sys
from typing import Optional


class CommandLineInterface:
    """
    Simple command-line interface for the theorem proving system.
    
    This provides an interactive way for users to prove theorems
    and see the results.
    """
    
    def __init__(self, controller: FeedbackLoopController):
        """
        Initialize the CLI.
        
        Args:
            controller: The feedback loop controller to use
        """
        self.controller = controller
    
    def run(self):
        """
        Run the interactive command-line interface.
        """
        self._print_welcome()
        
        while True:
            print("\n" + "=" * 70)
            print("Options:")
            print("  1. Prove a theorem")
            print("  2. View example theorems")
            print("  3. Exit")
            print("=" * 70)
            
            choice = input("\nEnter your choice (1-3): ").strip()
            
            if choice == "1":
                self._prove_theorem_interactive()
            elif choice == "2":
                self._show_examples()
            elif choice == "3":
                print("\nThank you for using the theorem proving system!")
                break
            else:
                print("\nInvalid choice. Please enter 1, 2, or 3.")
    
    def _print_welcome(self):
        """
        Print the welcome message.
        """
        print("\n" + "=" * 70)
        print("LLM + Theorem Prover: Interactive Proof System")
        print("=" * 70)
        print("\nThis system combines Large Language Models with the Lean 4")
        print("theorem prover to help you prove mathematical theorems.")
        print("\nThe LLM suggests proof steps, and Lean verifies them for")
        print("absolute correctness.")
    
    def _prove_theorem_interactive(self):
        """
        Interactive theorem proving session.
        """
        print("\n" + "-" * 70)
        print("Prove a Theorem")
        print("-" * 70)
        print("\nEnter your theorem in natural language.")
        print("Example: The square root of 2 is irrational")
        print("\nTheorem: ", end="")
        
        theorem = input().strip()
        
        if not theorem:
            print("\nNo theorem entered. Returning to main menu.")
            return
        
        print("\nStarting proof attempt...")
        print("This may take a few minutes depending on the complexity.\n")
        
        try:
            success, proof_manager = self.controller.prove_theorem(theorem)
            
            self._display_results(success, proof_manager)
            
        except Exception as e:
            print(f"\nERROR: An unexpected error occurred: {str(e)}")
            print("Please try again or report this issue.")
    
    def _display_results(self, success: bool, proof_manager: ProofStateManager):
        """
        Display the results of a proof attempt.
        
        Args:
            success: Whether the proof was successful
            proof_manager: The proof state manager with the results
        """
        print("\n" + "=" * 70)
        print("PROOF RESULTS")
        print("=" * 70)
        
        print(f"\nTheorem: {proof_manager.theorem_statement}")
        print(f"Status: {'PROVEN' if success else 'INCOMPLETE'}")
        print(f"Steps: {len(proof_manager.proof_steps)}")
        
        if success:
            print("\nComplete Lean proof:")
            print("-" * 70)
            print(proof_manager.get_current_proof_code())
            print("-" * 70)
            
            # Offer to save the proof
            save = input("\nWould you like to save this proof? (y/n): ").strip().lower()
            if save == 'y':
                self._save_proof(proof_manager)
        else:
            print("\nThe system was unable to complete the proof.")
            print(f"\nProgress made:")
            print(proof_manager.get_proof_summary())
    
    def _save_proof(self, proof_manager: ProofStateManager):
        """
        Save a completed proof to a file.
        
        Args:
            proof_manager: The proof state manager with the completed proof
        """
        filename = input("Enter filename (without extension): ").strip()
        if not filename:
            filename = "proof"
        
        filename = f"{filename}.lean"
        
        try:
            with open(filename, 'w') as f:
                f.write("-- Automatically generated proof\n")
                f.write(f"-- Theorem: {proof_manager.theorem_statement}\n\n")
                f.write(proof_manager.get_current_proof_code())
            
            print(f"\nProof saved to {filename}")
        except Exception as e:
            print(f"\nError saving proof: {str(e)}")
    
    def _show_examples(self):
        """
        Show example theorems that can be proven.
        """
        print("\n" + "-" * 70)
        print("Example Theorems")
        print("-" * 70)
        print("\nHere are some example theorems you can try:")
        print("\n1. Simple arithmetic:")
        print("   'For all natural numbers n, n + 0 = n'")
        print("\n2. Basic algebra:")
        print("   'For all natural numbers a and b, a + b = b + a'")
        print("\n3. Number theory:")
        print("   'There are infinitely many prime numbers'")
        print("\n4. Set theory:")
        print("   'The union of a set with the empty set is the set itself'")
        print("\nNote: More complex theorems may require more sophisticated")
        print("proof strategies and may not always succeed.")
        print("-" * 70)


def create_system(llm_type: str = "openai", 
                 api_key: Optional[str] = None,
                 model: Optional[str] = None) -> CommandLineInterface:
    """
    Factory function to create the complete system.
    
    Args:
        llm_type: Type of LLM to use ("openai" or "ollama")
        api_key: API key for commercial LLMs (required for OpenAI)
        model: Specific model to use
        
    Returns:
        A configured CommandLineInterface instance
    """
    # Create LLM interface
    if llm_type.lower() == "openai":
        if not api_key:
            api_key = os.environ.get("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key required. Set OPENAI_API_KEY environment "
                    "variable or pass api_key parameter."
                )
        llm = OpenAIInterface(api_key, model or "gpt-4")
    elif llm_type.lower() == "ollama":
        llm = OllamaInterface(model or "deepseek-math")
    else:
        raise ValueError(f"Unknown LLM type: {llm_type}")
    
    # Create Lean interface
    lean = LeanInterface()
    
    # Create controller
    controller = FeedbackLoopController(llm, lean)
    
    # Create CLI
    return CommandLineInterface(controller)

The Command Line Interface provides a user-friendly way to interact with our system. Users can enter theorems in natural language, watch as the system attempts to prove them, and save successful proofs to files.

The interface is designed to be informative, showing the user what is happening at each step of the proof process. This transparency is important because theorem proving can take time, and users should understand what the system is doing.

PUTTING IT ALL TOGETHER: THE MAIN PROGRAM

Now we can create the main program that ties everything together and allows users to start proving theorems.

#!/usr/bin/env python3
"""
LLM + Theorem Prover Integration System

This program combines Large Language Models with the Lean 4 theorem prover
to assist in proving mathematical theorems. The LLM provides creative
proof strategies while Lean ensures absolute logical correctness.

Usage:
    python main.py --llm openai --api-key YOUR_KEY
    python main.py --llm ollama --model deepseek-math
"""

import argparse
import sys
import os


def parse_arguments():
    """
    Parse command-line arguments.
    
    Returns:
        Parsed arguments
    """
    parser = argparse.ArgumentParser(
        description="LLM + Theorem Prover Integration System",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  Using OpenAI GPT-4:
    python main.py --llm openai --api-key sk-...
    
  Using local Ollama with DeepSeek-Math:
    python main.py --llm ollama --model deepseek-math
    
  Using environment variable for API key:
    export OPENAI_API_KEY=sk-...
    python main.py --llm openai
        """
    )
    
    parser.add_argument(
        "--llm",
        type=str,
        choices=["openai", "ollama"],
        default="ollama",
        help="Type of LLM to use (default: ollama)"
    )
    
    parser.add_argument(
        "--api-key",
        type=str,
        help="API key for commercial LLMs (or set OPENAI_API_KEY env var)"
    )
    
    parser.add_argument(
        "--model",
        type=str,
        help="Specific model to use (e.g., gpt-4, deepseek-math)"
    )
    
    parser.add_argument(
        "--non-interactive",
        action="store_true",
        help="Run in non-interactive mode (for testing)"
    )
    
    parser.add_argument(
        "--theorem",
        type=str,
        help="Theorem to prove (for non-interactive mode)"
    )
    
    return parser.parse_args()


def main():
    """
    Main entry point for the program.
    """
    args = parse_arguments()
    
    try:
        # Create the system
        print("Initializing LLM + Theorem Prover system...")
        print(f"LLM type: {args.llm}")
        if args.model:
            print(f"Model: {args.model}")
        
        cli = create_system(
            llm_type=args.llm,
            api_key=args.api_key,
            model=args.model
        )
        
        print("System initialized successfully!\n")
        
        if args.non_interactive:
            # Non-interactive mode for testing
            if not args.theorem:
                print("ERROR: --theorem required in non-interactive mode")
                sys.exit(1)
            
            success, proof_manager = cli.controller.prove_theorem(args.theorem)
            cli._display_results(success, proof_manager)
        else:
            # Interactive mode
            cli.run()
            
    except KeyboardInterrupt:
        print("\n\nInterrupted by user. Exiting...")
        sys.exit(0)
    except Exception as e:
        print(f"\nFATAL ERROR: {str(e)}")
        print("\nPlease check your configuration and try again.")
        sys.exit(1)


if __name__ == "__main__":
    main()

This main program provides a clean command-line interface for starting the system. Users can choose between different LLM backends and configure the system according to their needs.

EXAMPLE: PROVING A SIMPLE THEOREM

Let us walk through a concrete example to see how the system works in practice. Suppose we want to prove that for all natural numbers n, n plus zero equals n. This is a simple theorem, but it illustrates the complete workflow.

First, the user enters the theorem in natural language. The LLM translates this into Lean syntax, producing something like this:

theorem add_zero (n : Nat) : n + 0 = n := by

The system verifies that this syntax is correct. Then it enters the proof loop. The LLM looks at the goal, which is to prove that n plus zero equals n for an arbitrary natural number n. Based on its training, the LLM knows that this is a fundamental property of natural numbers in Lean, and it suggests using the rfl tactic, which proves goals by reflexivity.

theorem add_zero (n : Nat) : n + 0 = n := by
  rfl

The system sends this to Lean for verification. Lean checks whether reflexivity is sufficient to prove the goal. In this case, it is not quite that simple because the definition of addition in Lean is recursive on the second argument, so n plus zero does not immediately simplify to n.

The LLM receives feedback that rfl failed. It then tries a different approach, perhaps using the simp tactic to simplify the goal:

theorem add_zero (n : Nat) : n + 0 = n := by
  simp

Lean verifies this step and confirms that it completes the proof. The system records this as a successful proof and presents it to the user.

This example shows the key aspects of the system: natural language input, automatic translation to formal syntax, iterative proof search with feedback, and rigorous verification.

ADVANCED FEATURES AND EXTENSIONS

The system we have built is a solid foundation, but there are many ways to extend and improve it. Here are some advanced features you might consider implementing.

One important extension is proof caching. When the system successfully proves a lemma, it can save that proof and reuse it in future proofs. This is particularly valuable for complex proofs that build on many intermediate results. You could implement this by maintaining a database of proven lemmas and their proofs, indexed by their statements.

Another valuable feature is proof search strategies. Instead of trying one step at a time, the system could explore multiple proof paths in parallel, using techniques like beam search or Monte Carlo tree search. This would make the system more robust and able to handle more complex proofs.

You could also add support for multiple theorem provers. While we have focused on Lean, the same architecture could work with Coq, Isabelle, or other provers. The key is to implement the appropriate interface for each prover while maintaining the same high-level API.

Interactive proof refinement is another interesting direction. Instead of fully automating the proof, the system could work collaboratively with a human mathematician, with the human providing high-level guidance and the system filling in the details.

HANDLING ERRORS AND EDGE CASES

A robust system must handle errors gracefully. Our implementation includes several error handling mechanisms that are worth discussing in detail.

When the LLM suggests invalid Lean syntax, the system catches the error, extracts the error message from Lean, and feeds it back to the LLM. This allows the LLM to learn from its syntax mistakes and correct them. However, we limit the number of retry attempts to prevent infinite loops.

When the system cannot find a proof within the maximum number of steps, it reports partial progress to the user. This is important because even an incomplete proof can be valuable, showing which parts of the theorem are straightforward and which are challenging.

Network errors when communicating with commercial LLM APIs are handled with appropriate error messages. In a production system, you might want to add retry logic with exponential backoff.

Timeout handling is crucial because both LLM inference and theorem proving can occasionally hang. We set reasonable timeouts for both operations and handle timeout exceptions gracefully.

PERFORMANCE CONSIDERATIONS

The performance of this system depends on several factors. The speed of the LLM is one bottleneck, particularly when using commercial APIs over the network. Local models via Ollama can be faster but may produce lower quality suggestions.

Lean verification is generally fast for simple steps but can be slow for complex tactics or large proof states. Caching verified steps can help avoid redundant verification.

The number of retry attempts per step significantly affects total runtime. Setting this too low may cause the system to give up prematurely, while setting it too high wastes time on unproductive search paths.

For better performance, you could implement parallel proof search, where multiple proof strategies are explored simultaneously. This requires careful management of Lean processes and LLM requests.

TESTING AND VALIDATION

Testing a system that combines LLMs with formal verification requires a multi-faceted approach. Unit tests should verify that each component works correctly in isolation. For example, test that the Lean interface correctly parses proof states and error messages.

Integration tests should verify that the components work together correctly. Create a suite of simple theorems with known proofs and verify that the system can prove them.

Regression tests are important to ensure that changes to the system do not break previously working functionality. Maintain a collection of theorems that the system has successfully proven and regularly verify that it can still prove them.

Performance benchmarks help track how the system performs over time. Measure metrics like average time to prove a theorem, success rate on a test suite, and number of LLM calls required per proof.

ETHICAL CONSIDERATIONS AND LIMITATIONS

It is important to understand the limitations of this system and use it responsibly. The system is a tool to assist mathematicians, not to replace them. It can help automate tedious parts of proofs and catch errors, but human insight and creativity remain essential for mathematical research.

The LLM component can make mistakes, and while the theorem prover catches logical errors, it cannot catch mistakes in problem formulation. If you formalize the wrong theorem, the system might prove it correctly even though it does not capture what you intended.

The system works best on theorems that are similar to those in the LLM's training data. For truly novel mathematical results, the LLM may not have good intuition about proof strategies.

There are also questions about credit and authorship. If the system helps prove a theorem, how should that be acknowledged? These are evolving questions in the field of automated theorem proving.

CONCLUSION AND FUTURE DIRECTIONS

We have built a complete system that combines the creative power of Large Language Models with the rigorous verification of theorem provers. This system demonstrates how AI can augment human mathematical reasoning while maintaining absolute correctness through formal verification.

The key insight is that LLMs and theorem provers have complementary strengths. LLMs excel at pattern recognition and generating plausible proof strategies based on vast training data. Theorem provers excel at rigorous verification and catching logical errors. Together, they form a powerful tool for mathematical reasoning.

The field of automated theorem proving is advancing rapidly. Future developments might include better integration between natural language and formal mathematics, more sophisticated proof search strategies, and systems that can learn from successful proofs to improve over time.

As LLMs continue to improve and theorem provers become more user-friendly, we can expect these combined systems to become increasingly powerful and accessible. They have the potential to democratize formal mathematics, making rigorous proof accessible to a broader audience.

The code we have developed here is a starting point. I encourage you to experiment with it, extend it, and adapt it to your needs. Try proving different theorems, experiment with different LLMs and theorem provers, and explore new ways to combine machine learning with formal verification.

Mathematics has always been a collaborative endeavor, with mathematicians building on each other's work across generations. These AI-assisted tools are a new kind of collaborator, one that can work tirelessly to verify our reasoning and suggest new approaches. Used wisely, they can help us push the boundaries of mathematical knowledge while maintaining the rigor that makes mathematics special.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Wednesday, May 13, 2026

COMBINING LARGE LANGUAGE MODELS WITH THEOREM PROVERS: A TUTORIAL