Introduction

Modern software development heavily relies on version control systems, with Git being the undisputed leader. Navigating and understanding complex Git repositories, especially unfamiliar ones, can be a daunting and time-consuming task for developers, project managers, and new team members alike. The sheer volume of code, commit history, branching strategies, and associated documentation often presents a significant barrier to entry or rapid comprehension. This article introduces the concept and detailed architecture of an LLM-based Git analysis agent designed to automate this process, providing comprehensive and insightful summaries of any given repository.

The core challenge addressed by this agent lies in bridging the gap between raw Git repository data and human-understandable, high-level summaries. Furthermore, a significant technical hurdle for any Large Language Model (LLM) is its inherent context window limitation. A typical repository can contain hundreds or thousands of files, far exceeding the token capacity of even the most advanced LLMs if processed all at once. Our agent is specifically engineered to overcome this by employing a progressive summarization strategy, breaking down the analysis into manageable, context-aware chunks.

Agent Architecture Overview

The Git analysis agent operates through a series of interconnected modules, each responsible for a specific aspect of repository understanding and information synthesis. This modular design ensures maintainability, scalability, and adherence to clean architecture principles. The overall flow begins with user input, proceeds through repository acquisition and detailed analysis, leverages LLMs for summarization, and culminates in a structured, comprehensive report.

Here is a conceptual ASCII diagram illustrating the agent's architecture:

+---------------------+ +--------------------------+

| User Configuration |---->| Repository Acquisition |

| (LLM, Repo Path) | | (Local/Remote) |

+---------------------+ +--------------------------+

| |

v v

+---------------------+ +--------------------------+

| Orchestration Engine|---->| Git Interaction Module |

| (Main Control Flow) | | (Log, Diff, Files, Tags) |

+---------------------+ +--------------------------+

| |

v v

+---------------------+ +--------------------------+

| File Analysis Module|---->| LLM Integration Layer |

| (Read, Chunk Files) | | (Prompting, API Calls) |

+---------------------+ +--------------------------+

| |

v v

+---------------------+ +--------------------------+

| Progressive | | Output Generation |

| Summarization & |---->| (Structured Report) |

| Memory Module | +--------------------------+

+---------------------+

The agent's journey starts with the user providing configuration details, including the target repository's location and the LLM settings. The Repository Acquisition module then handles fetching the repository, whether it is a local directory or a remote URL. The Orchestration Engine acts as the central coordinator, directing the flow of analysis. It delegates tasks to the Git Interaction Module for extracting metadata such as commit history, branches, and tags. Concurrently, the File Analysis Module reads individual files, preparing their content for LLM processing. The LLM Integration Layer manages all communication with the chosen Large Language Model, crafting prompts and parsing responses. Crucially, the Progressive Summarization and Memory Module aggregates file-level summaries into higher-level insights, effectively managing context. Finally, the Output Generation module compiles all gathered and summarized information into a coherent and detailed report for the user.

Detailed Constituent Descriptions

Let us delve deeper into each critical component of our LLM-based Git analysis agent, providing code examples to illustrate their functionality.

Configuration Management

Effective configuration management is paramount for flexibility and ease of use. The user must be able to specify the repository path (local or remote) and the details for the LLM, including whether it is a local model (e.g., via Ollama or a local server) or a remote API (e.g., OpenAI, Azure OpenAI). This module centralizes these settings, making them accessible throughout the agent.

We define a `Configuration` class to encapsulate these settings, ensuring that all necessary parameters are available before the analysis begins.

# config.py

import os

from typing import Optional

class LLMConfig:

"""

Encapsulates configuration settings for the Large Language Model.

Supports both remote API-based LLMs and local server-based LLMs.

"""

def __init__(self,

llm_type: str, # 'openai', 'local'

api_key: Optional[str] = None,

model_name: str = "gpt-4o-mini",

base_url: Optional[str] = None):

"""

Initializes the LLM configuration.

Args:

llm_type: Specifies the type of LLM ('openai' for remote API, 'local' for a local server).

api_key: The API key for remote LLM services (e.g., OpenAI API key).

This should ideally be loaded from environment variables for security.

model_name: The specific model identifier to use (e.g., "gpt-4o-mini", "llama3").

base_url: The base URL for local LLM servers (e.g., "http://localhost:11434/v1").

"""

if llm_type not in ['openai', 'local']:

raise ValueError("llm_type must be 'openai' or 'local'")

self.llm_type = llm_type

self.api_key = api_key if api_key else os.getenv("OPENAI_API_KEY")

self.model_name = model_name

self.base_url = base_url

if self.llm_type == 'openai' and not self.api_key:

raise ValueError("OPENAI_API_KEY environment variable or api_key must be set for OpenAI LLM type.")

if self.llm_type == 'local' and not self.base_url:

raise ValueError("base_url must be set for local LLM type.")

def __repr__(self) -> str:

"""Provides a string representation of the LLMConfig object."""

return (f"LLMConfig(llm_type='{self.llm_type}', model_name='{self.model_name}', "

f"base_url='{self.base_url if self.base_url else 'N/A'}')")

class AgentConfig:

"""

Main configuration class for the Git analysis agent.

Holds repository path and LLM configuration.

"""

def __init__(self,

repo_path: str,

llm_config: LLMConfig,

output_dir: str = "analysis_results"):

"""

Initializes the agent configuration.

Args:

repo_path: The path to the local Git repository or its remote URL.

llm_config: An instance of LLMConfig containing LLM-specific settings.

output_dir: The directory where analysis results and summaries will be stored.

"""

self.repo_path = repo_path

self.llm_config = llm_config

self.output_dir = output_dir

# Ensure output directory exists

os.makedirs(self.output_dir, exist_ok=True)

def __repr__(self) -> str:

"""Provides a string representation of the AgentConfig object."""

return (f"AgentConfig(repo_path='{self.repo_path}', llm_config={self.llm_config}, "

f"output_dir='{self.output_dir}')")

Repository Acquisition

This module is responsible for obtaining the Git repository. It must handle two primary scenarios: a local file path already present on the user's machine or a remote URL pointing to a repository on platforms like GitHub or GitLab. For remote repositories, it performs a clone operation.

The `GitRepositoryManager` class encapsulates the logic for cloning remote repositories and validating local paths. It ensures that the agent always operates on a valid, accessible Git repository.

# git_operations.py

import os

import shutil

import git # type: ignore # gitpython library

class GitRepositoryManager:

"""

Manages the acquisition and cleanup of Git repositories.

Handles cloning remote repositories and validating local paths.

"""

def __init__(self, repo_source_path: str, clone_dir: str = "cloned_repos"):

"""

Initializes the GitRepositoryManager.

Args:

repo_source_path: The path to the local Git repository or its remote URL.

clone_dir: The directory where remote repositories will be cloned.

"""

self.repo_source_path = repo_source_path

self.clone_dir = clone_dir

self.local_repo_path: Optional[str] = None

self.is_cloned = False

os.makedirs(self.clone_dir, exist_ok=True)

def acquire_repository(self) -> str:

"""

Acquires the Git repository, either by using a local path or cloning a remote one.

Returns:

The absolute path to the local Git repository directory.

Raises:

ValueError: If the provided path is not a valid Git repository.

git.InvalidGitRepositoryError: If cloning fails or the local path is not a Git repo.

git.GitCommandError: If a git command fails during cloning.

"""

if os.path.isdir(self.repo_source_path) and \

os.path.exists(os.path.join(self.repo_source_path, '.git')):

# It's already a local Git repository

self.local_repo_path = os.path.abspath(self.repo_source_path)

print(f"Using local repository at: {self.local_repo_path}")

elif self.repo_source_path.startswith(('http://', 'https://', 'git@')):

# It's a remote URL, clone it

repo_name = self.repo_source_path.split('/')[-1].replace('.git', '')

target_path = os.path.join(self.clone_dir, repo_name)

if os.path.exists(target_path):

print(f"Repository already cloned to {target_path}. Pulling latest changes...")

repo = git.Repo(target_path)

origin = repo.remotes.origin

origin.pull()

else:

print(f"Cloning remote repository {self.repo_source_path} to {target_path}...")

git.Repo.clone_from(self.repo_source_path, target_path)

self.local_repo_path = os.path.abspath(target_path)

self.is_cloned = True

print(f"Repository successfully cloned/updated at: {self.local_repo_path}")

else:

raise ValueError(f"Invalid repository source: {self.repo_source_path}. "

"Must be a local path to a Git repo or a remote URL.")

# Final check to ensure it's a valid Git repository

try:

_ = git.Repo(self.local_repo_path)

except git.InvalidGitRepositoryError as e:

raise ValueError(f"The path '{self.local_repo_path}' is not a valid Git repository.") from e

return self.local_repo_path

def cleanup(self) -> None:

"""

Removes the cloned repository directory if it was cloned by this manager.

"""

if self.is_cloned and self.local_repo_path and os.path.exists(self.local_repo_path):

print(f"Cleaning up cloned repository: {self.local_repo_path}")

shutil.rmtree(self.local_repo_path)

self.local_repo_path = None

self.is_cloned = False

Repository Traversal and Git Metadata Extraction

Once the repository is acquired, the `GitAnalyzer` module takes over to extract crucial metadata from the Git history. This includes information about contributors, commit patterns, branches, tags (representing releases), and general repository statistics. This data provides foundational context for the LLM's subsequent analysis.

# git_operations.py (continued)

from collections import defaultdict

from datetime import datetime

class GitAnalyzer:

"""

Analyzes a local Git repository to extract metadata such as contributors,

commit history, branches, and tags.

"""

def __init__(self, repo_path: str):

"""

Initializes the GitAnalyzer with the path to the local repository.

Args:

repo_path: The absolute path to the local Git repository.

"""

try:

self.repo = git.Repo(repo_path)

self.repo_path = repo_path

except git.InvalidGitRepositoryError as e:

raise ValueError(f"'{repo_path}' is not a valid Git repository.") from e

def get_contributors(self) -> dict:

"""

Analyzes commit history to identify contributors and their commit counts.

Returns:

A dictionary where keys are contributor names (author name <email>)

and values are their respective commit counts.

"""

contributors = defaultdict(int)

for commit in self.repo.iter_commits():

author_info = f"{commit.author.name} <{commit.author.email}>"

contributors[author_info] += 1

return dict(contributors)

def get_commit_summary(self, max_commits: int = 50) -> list[dict]:

"""

Retrieves a summary of recent commits.

Args:

max_commits: The maximum number of commits to retrieve.

Returns:

A list of dictionaries, each representing a commit with its hash, author,

date, and message.

"""

commit_list = []

for i, commit in enumerate(self.repo.iter_commits()):

if i >= max_commits:

break

commit_list.append({

"hash": commit.hexsha,

"author": f"{commit.author.name} <{commit.author.email}>",

"date": datetime.fromtimestamp(commit.committed_date).strftime('%Y-%m-%d %H:%M:%S'),

"message": commit.message.strip()

})

return commit_list

def get_branches(self) -> list[str]:

"""

Lists all local and remote branches in the repository.

Returns:

A list of branch names.

"""

return [head.name for head in self.repo.heads] + \

[remote.name for remote in self.repo.remotes]

def get_tags(self) -> list[str]:

"""

Lists all tags (often representing releases) in the repository.

Returns:

A list of tag names.

"""

return [tag.name for tag in self.repo.tags]

def get_repo_structure(self) -> str:

"""

Generates a simplified tree-like representation of the repository's file structure.

Excludes typical Git-related directories and common build artifacts.

Returns:

A string representing the directory tree.

"""

structure_lines = []

ignore_patterns = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',

'target', 'build', 'dist', '.idea', '.vscode']

for root, dirs, files in os.walk(self.repo_path):

# Filter out ignored directories

dirs[:] = [d for d in dirs if d not in ignore_patterns]

level = root.replace(self.repo_path, '').count(os.sep)

indent = ' ' * level

relative_path = os.path.relpath(root, self.repo_path)

if relative_path == '.': # Don't print '.' for the root itself

structure_lines.append(f"{os.path.basename(self.repo_path)}/")

else:

structure_lines.append(f"{indent}|-- {os.path.basename(root)}/")

subindent = ' ' * (level + 1)

for f in files:

structure_lines.append(f"{subindent}|-- {f}")

return "\n".join(structure_lines)

File-Level Analysis and Summarization Strategy

This is the core module addressing the LLM context window limitation. Instead of feeding the entire repository to the LLM, the agent processes files individually. The `FileProcessor` reads file contents, and then the `LLMSummarizer` uses the LLM to generate a concise summary for each file. This approach ensures that the LLM receives manageable chunks of information.

The `LLMClient` acts as an abstraction layer for interacting with different LLM providers, making the system flexible.

# llm_interface.py

import os

from abc import ABC, abstractmethod

from typing import Any, Dict, List, Optional

from openai import OpenAI # type: ignore

from config import LLMConfig

class LLMClient(ABC):

"""

Abstract base class for LLM clients, defining the common interface.

"""

@abstractmethod

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the LLM and returns its completion.

Args:

prompt: The text prompt to send to the LLM.

temperature: Controls the randomness of the output. Higher values mean more random.

Returns:

The generated text completion from the LLM.

"""

pass

class OpenAILLMClient(LLMClient):

"""

Concrete implementation of LLMClient for OpenAI API.

"""

def __init__(self, config: LLMConfig):

"""

Initializes the OpenAI LLM client.

Args:

config: An LLMConfig instance containing OpenAI-specific settings.

"""

if config.llm_type != 'openai':

raise ValueError("LLMConfig must be of type 'openai' for OpenAILLMClient.")

if not config.api_key:

raise ValueError("OpenAI API key is missing in configuration.")

self.client = OpenAI(api_key=config.api_key)

self.model_name = config.model_name

print(f"Initialized OpenAI LLM Client with model: {self.model_name}")

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the OpenAI API and returns its completion.

"""

try:

response = self.client.chat.completions.create(

model=self.model_name,

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": prompt}

temperature=temperature,

)

return response.choices[0].message.content if response.choices[0].message.content else ""

except Exception as e:

print(f"Error calling OpenAI API: {e}")

return f"Error: Could not get completion from OpenAI API - {e}"

class LocalLLMClient(LLMClient):

"""

Concrete implementation of LLMClient for local LLM servers (e.g., Ollama).

Assumes a compatible OpenAI-like API endpoint.

"""

def __init__(self, config: LLMConfig):

"""

Initializes the Local LLM client.

Args:

config: An LLMConfig instance containing local LLM-specific settings.

"""

if config.llm_type != 'local':

raise ValueError("LLMConfig must be of type 'local' for LocalLLMClient.")

if not config.base_url:

raise ValueError("Base URL is missing for local LLM configuration.")

self.client = OpenAI(base_url=config.base_url, api_key="ollama") # API key is often dummy for local

self.model_name = config.model_name

print(f"Initialized Local LLM Client with model: {self.model_name} at {config.base_url}")

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the local LLM server and returns its completion.

"""

try:

response = self.client.chat.completions.create(

model=self.model_name,

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": prompt}

temperature=temperature,

)

return response.choices[0].message.content if response.choices[0].message.content else ""

except Exception as e:

print(f"Error calling Local LLM API: {e}")

return f"Error: Could not get completion from Local LLM API - {e}"

The `FileProcessor` is responsible for reading file content, while the `LLMSummarizer` orchestrates the prompt creation and interaction with the `LLMClient`.

# summarization.py

import os

from typing import Dict, List, Tuple

from llm_interface import LLMClient

class FileProcessor:

"""

Handles reading and processing of individual files within the repository.

"""

def __init__(self, repo_root: str):

"""

Initializes the FileProcessor.

Args:

repo_root: The root directory of the Git repository.

"""

self.repo_root = repo_root

def read_file_content(self, file_path: str) -> Optional[str]:

"""

Reads the content of a specified file.

Handles common encoding issues and skips binary files.

Args:

file_path: The absolute path to the file.

Returns:

The content of the file as a string, or None if it's a binary file

or cannot be read.

"""

if not os.path.exists(file_path) or not os.path.isfile(file_path):

print(f"Warning: File not found or is not a file: {file_path}")

return None

# Heuristic to skip binary files

mime_type_guess = None

try:

import mimetypes

mime_type_guess, _ = mimetypes.guess_type(file_path)

except ImportError:

pass # mimetypes might not be available in some minimal environments

if mime_type_guess and not mime_type_guess.startswith('text'):

print(f"Skipping binary file: {file_path} (MIME type: {mime_type_guess})")

return None

# Attempt to read as text

try:

with open(file_path, 'r', encoding='utf-8') as f:

return f.read()

except UnicodeDecodeError:

print(f"Skipping non-UTF-8 or binary file: {file_path}")

return None

except Exception as e:

print(f"Error reading file {file_path}: {e}")

return None

class LLMSummarizer:

"""

Uses an LLM to generate summaries for file contents and aggregated information.

"""

def __init__(self, llm_client: LLMClient):

"""

Initializes the LLMSummarizer with an LLM client.

Args:

llm_client: An instance of a concrete LLMClient implementation.

"""

self.llm_client = llm_client

def summarize_file(self, file_path: str, file_content: str) -> str:

"""

Generates a concise summary for a single file's content.

Args:

file_path: The relative path of the file being summarized.

file_content: The full content of the file.

Returns:

A summary string generated by the LLM.

"""

prompt = (

f"You are an expert software engineer tasked with summarizing code and configuration files. "

f"Provide a concise summary of the purpose, key functionalities, and important configurations "

f"or dependencies found in the following file. Focus on what this file *does* and its role "

f"within a larger project. Keep the summary under 150 words.\n\n"

f"File: {file_path}\n"

f"Content:\n```\n{file_content}\n```\n\n"

f"Concise Summary:"

)

return self.llm_client.get_completion(prompt)

def summarize_directory(self, directory_path: str, file_summaries: Dict[str, str]) -> str:

"""

Generates a summary for a directory based on the summaries of its contained files.

Args:

directory_path: The relative path of the directory.

file_summaries: A dictionary mapping file paths to their summaries within this directory.

Returns:

A summary string for the directory.

"""

if not file_summaries:

return f"Directory '{directory_path}' contains no relevant files or summaries."

summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in file_summaries.items()])

prompt = (

f"You are an expert software architect analyzing a project structure. "

f"Based on the following file summaries, provide a concise overview of the purpose "

f"and primary functionalities of the directory '{directory_path}'. "

f"Identify any common themes, dependencies, or architectural patterns. "

f"Keep the summary under 200 words.\n\n"

f"Directory: {directory_path}\n"

f"File Summaries:\n{summaries_text}\n\n"

f"Concise Directory Summary:"

)

return self.llm_client.get_completion(prompt)

def summarize_repository(self,

repo_name: str,

repo_structure: str,

directory_summaries: Dict[str, str],

git_metadata: Dict[str, Any]) -> str:

"""

Generates a comprehensive summary of the entire repository.

Args:

repo_name: The name of the repository.

repo_structure: A string representation of the repository's file structure.

directory_summaries: A dictionary mapping directory paths to their summaries.

git_metadata: A dictionary containing aggregated Git metadata (contributors, commits, etc.).

Returns:

A comprehensive summary string for the entire repository.

"""

dir_summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in directory_summaries.items()])

contributors_text = "\n".join([f" - {author} ({count} commits)" for author, count in git_metadata.get('contributors', {}).items()])

recent_commits_text = "\n".join([f" - {c['date']} by {c['author']}: {c['message']}" for c in git_metadata.get('recent_commits', [])[:5]])

branches_text = ", ".join(git_metadata.get('branches', []))

tags_text = ", ".join(git_metadata.get('tags', []))

prompt = (

f"You are a highly intelligent AI assistant specializing in software project analysis. "

f"Your task is to provide a comprehensive and detailed summary of the Git repository named '{repo_name}'. "

f"Synthesize information from the repository's structure, directory-level summaries, and Git metadata. "

f"Cover the following aspects:\n"

f"1. **Overall Purpose and Key Functionalities:** What is the project about? What problems does it solve?\n"

f"2. **Architectural Overview/Structure:** Describe the main components and how they are organized.\n"

f"3. **Core Technologies/Dependencies:** Identify programming languages, frameworks, and key libraries.\n"

f"4. **Development Environment/Setup:** How would one set up and run this project? (e.g., Docker, `requirements.txt`)\n"

f"5. **Key Contributors and Activity:** Who are the main developers and what is the recent activity?\n"

f"6. **Release Strategy/Versioning:** How are releases managed (tags, branches)?\n"

f"7. **Known Issues/Limitations:** Any explicit mentions of problems or areas for improvement (from README/comments).\n"

f"8. **Evolution/Changes:** High-level overview of recent significant changes.\n\n"

f"Repository Name: {repo_name}\n"

f"Repository Structure:\n{repo_structure}\n\n"

f"Directory Summaries:\n{dir_summaries_text}\n\n"

f"Git Metadata:\n"

f" Contributors:\n{contributors_text}\n"

f" Recent Commits:\n{recent_commits_text}\n"

f" Branches: {branches_text}\n"

f" Tags (Releases): {tags_text}\n\n"

f"Comprehensive Repository Summary:"

)

return self.llm_client.get_completion(prompt, temperature=0.2) # Lower temperature for factual summary

Progressive Summarization and Memory

This module is crucial for managing the context window. It stores file-level summaries and then aggregates them into directory-level summaries, and finally into an overall repository summary. This hierarchical summarization ensures that the LLM never receives an overwhelming amount of raw data at once, but rather progressively distilled information. The `SummaryAggregator` orchestrates this process, storing intermediate results.

# summarization.py (continued)

import json

class SummaryAggregator:

"""

Manages the storage and aggregation of file and directory summaries.

"""

def __init__(self, output_dir: str):

"""

Initializes the SummaryAggregator.

Args:

output_dir: The directory where summaries will be saved.

"""

self.output_dir = output_dir

os.makedirs(output_dir, exist_ok=True)

self.file_summaries: Dict[str, str] = {}

self.directory_summaries: Dict[str, str] = {}

self.repo_summary: Optional[str] = None

self.git_metadata: Dict[str, Any] = {}

def add_file_summary(self, relative_path: str, summary: str) -> None:

"""

Adds a summary for a specific file.

Args:

relative_path: The path of the file relative to the repository root.

summary: The LLM-generated summary for the file.

"""

self.file_summaries[relative_path] = summary

self._save_summary(f"file_summary_{relative_path.replace(os.sep, '_').replace('.', '_')}.txt", summary)

def add_directory_summary(self, relative_path: str, summary: str) -> None:

"""

Adds a summary for a specific directory.

Args:

relative_path: The path of the directory relative to the repository root.

summary: The LLM-generated summary for the directory.

"""

self.directory_summaries[relative_path] = summary

self._save_summary(f"dir_summary_{relative_path.replace(os.sep, '_')}.txt", summary)

def set_repo_summary(self, summary: str) -> None:

"""

Sets the final comprehensive repository summary.

Args:

summary: The LLM-generated summary for the entire repository.

"""

self.repo_summary = summary

self._save_summary("repository_summary.txt", summary)

def set_git_metadata(self, metadata: Dict[str, Any]) -> None:

"""

Stores the extracted Git metadata.

Args:

metadata: A dictionary containing Git metadata.

"""

self.git_metadata = metadata

self._save_summary("git_metadata.json", json.dumps(metadata, indent=2))

def get_file_summaries_for_directory(self, relative_dir_path: str) -> Dict[str, str]:

"""

Retrieves file summaries belonging to a specific directory.

Args:

relative_dir_path: The relative path of the directory.

Returns:

A dictionary of file paths to summaries within that directory.

"""

if relative_dir_path == ".": # Root directory

return {p: s for p, s in self.file_summaries.items() if os.sep not in p and p != "README.md"}

# Include README.md if it's in the root, but not in a sub-directory summary

if relative_dir_path == "": # Special case for root

return {p:s for p,s in self.file_summaries.items() if not os.path.dirname(p)}

# For subdirectories, filter files that start with the directory path

prefix = relative_dir_path + os.sep

return {p: s for p, s in self.file_summaries.items() if p.startswith(prefix) and os.path.dirname(p) == relative_dir_path}

def _save_summary(self, filename: str, content: str) -> None:

"""

Helper method to save a summary to a file.

"""

file_path = os.path.join(self.output_dir, filename)

try:

with open(file_path, 'w', encoding='utf-8') as f:

f.write(content)

print(f"Saved summary to {file_path}")

except Exception as e:

print(f"Error saving summary to {file_path}: {e}")

Output Generation

The final stage involves compiling all the gathered and summarized information into a coherent, human-readable report. This report should present the repository's structure, purpose, key features, development environment, contributors, and any identified issues or release information in an organized manner. The `GitAnalysisAgent` itself will handle the final report generation by orchestrating the collection of all summaries.

The Git Analysis Agent (Orchestrator)

The `GitAnalysisAgent` class serves as the main orchestrator, tying all the modules together. It manages the entire workflow, from repository acquisition to final report generation, ensuring that each step is executed logically and efficiently.

# agent.py

import os

from typing import Any, Dict

from config import AgentConfig, LLMConfig

from git_operations import GitRepositoryManager, GitAnalyzer

from llm_interface import LLMClient, OpenAILLMClient, LocalLLMClient

from summarization import FileProcessor, LLMSummarizer, SummaryAggregator

class GitAnalysisAgent:

"""

The main orchestrator for the LLM-based Git analysis agent.

Coordinates repository acquisition, Git metadata extraction, file processing,

LLM summarization, and report generation.

"""

def __init__(self, config: AgentConfig):

"""

Initializes the GitAnalysisAgent with the provided configuration.

Args:

config: An instance of AgentConfig containing all necessary settings.

"""

self.config = config

self.repo_manager = GitRepositoryManager(config.repo_path, config.output_dir)

self.llm_client: LLMClient

if config.llm_config.llm_type == 'openai':

self.llm_client = OpenAILLMClient(config.llm_config)

elif config.llm_config.llm_type == 'local':

self.llm_client = LocalLLMClient(config.llm_config)

else:

raise ValueError(f"Unsupported LLM type: {config.llm_config.llm_type}")

self.llm_summarizer = LLMSummarizer(self.llm_client)

self.summary_aggregator = SummaryAggregator(config.output_dir)

self.local_repo_path: Optional[str] = None

self.git_analyzer: Optional[GitAnalyzer] = None

self.file_processor: Optional[FileProcessor] = None

def analyze_repository(self) -> str:

"""

Executes the full repository analysis workflow.

Returns:

The final comprehensive repository summary as a string.

"""

print("\n--- Starting Repository Analysis ---")

try:

# 1. Acquire Repository

self.local_repo_path = self.repo_manager.acquire_repository()

self.git_analyzer = GitAnalyzer(self.local_repo_path)

self.file_processor = FileProcessor(self.local_repo_path)

# 2. Extract Git Metadata

print("\n--- Extracting Git Metadata ---")

git_metadata = self._extract_git_metadata()

self.summary_aggregator.set_git_metadata(git_metadata)

# 3. Analyze and Summarize Files

print("\n--- Analyzing and Summarizing Files ---")

self._analyze_and_summarize_files()

# 4. Summarize Directories

print("\n--- Summarizing Directories ---")

self._summarize_directories()

# 5. Generate Final Repository Summary

print("\n--- Generating Final Repository Summary ---")

repo_name = os.path.basename(self.local_repo_path)

repo_structure = self.git_analyzer.get_repo_structure() if self.git_analyzer else "Could not generate structure."

final_repo_summary = self.llm_summarizer.summarize_repository(

repo_name=repo_name,

repo_structure=repo_structure,

directory_summaries=self.summary_aggregator.directory_summaries,

git_metadata=git_metadata

)

self.summary_aggregator.set_repo_summary(final_repo_summary)

print("\n--- Repository Analysis Complete ---")

return final_repo_summary

except Exception as e:

print(f"An error occurred during analysis: {e}")

return f"Analysis failed due to an error: {e}"

finally:

self.repo_manager.cleanup() # Ensure cloned repos are removed

def _extract_git_metadata(self) -> Dict[str, Any]:

"""Helper to extract and return Git metadata."""

if not self.git_analyzer:

raise RuntimeError("GitAnalyzer not initialized.")

metadata = {

"contributors": self.git_analyzer.get_contributors(),

"recent_commits": self.git_analyzer.get_commit_summary(max_commits=10),

"branches": self.git_analyzer.get_branches(),

"tags": self.git_analyzer.get_tags(),

"repo_structure_preview": self.git_analyzer.get_repo_structure() # Store a preview for context

}

print("Git metadata extracted.")

return metadata

def _analyze_and_summarize_files(self) -> None:

"""

Traverses the repository, reads files, and generates LLM summaries for each.

"""

if not self.local_repo_path or not self.file_processor:

raise RuntimeError("Repository path or file processor not initialized.")

# Walk through the repository, excluding common ignored directories

ignore_dirs = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',

'target', 'build', 'dist', '.idea', '.vscode']

# Add common documentation files to process first, as they often contain purpose

priority_files = ['README.md', 'Dockerfile', 'requirements.txt', 'package.json', 'pom.xml']

processed_files = set()

# Process priority files first if they exist at the root

for p_file in priority_files:

abs_path = os.path.join(self.local_repo_path, p_file)

if os.path.exists(abs_path) and os.path.isfile(abs_path):

relative_path = os.path.relpath(abs_path, self.local_repo_path)

print(f"Processing priority file: {relative_path}")

content = self.file_processor.read_file_content(abs_path)

if content:

summary = self.llm_summarizer.summarize_file(relative_path, content)

self.summary_aggregator.add_file_summary(relative_path, summary)

processed_files.add(relative_path)

for root, dirs, files in os.walk(self.local_repo_path):

# Modify dirs in-place to prune traversal

dirs[:] = [d for d in dirs if d not in ignore_dirs]

for file_name in files:

abs_file_path = os.path.join(root, file_name)

relative_file_path = os.path.relpath(abs_file_path, self.local_repo_path)

if relative_file_path in processed_files:

continue # Skip files already processed as priority

# Skip common non-source files or very large files

if any(relative_file_path.endswith(ext) for ext in ['.png', '.jpg', '.jpeg', '.gif', '.bin', '.zip', '.tar.gz', '.log']) or \

os.path.getsize(abs_file_path) > 1024 * 1024: # e.g., 1MB limit for text files

print(f"Skipping large or non-text file: {relative_file_path}")

continue

print(f"Processing file: {relative_file_path}")

content = self.file_processor.read_file_content(abs_file_path)

if content:

summary = self.llm_summarizer.summarize_file(relative_file_path, content)

self.summary_aggregator.add_file_summary(relative_file_path, summary)

processed_files.add(relative_file_path)

def _summarize_directories(self) -> None:

"""

Generates summaries for directories based on their contained file summaries.

Processes directories from deepest to shallowest to ensure dependencies.

"""

if not self.local_repo_path:

raise RuntimeError("Repository path not initialized.")

# Get all unique directory paths that have files summarized

all_file_paths = self.summary_aggregator.file_summaries.keys()

all_dirs = set()

for f_path in all_file_paths:

current_dir = os.path.dirname(f_path)

while current_dir and current_dir != '.':

all_dirs.add(current_dir)

current_dir = os.path.dirname(current_dir)

# Ensure root directory is included if there are any files

if all_file_paths:

all_dirs.add(".") # Represents the root directory

# Sort directories by depth (deepest first) to summarize from bottom-up

sorted_dirs = sorted(list(all_dirs), key=lambda x: x.count(os.sep), reverse=True)

for dir_path in sorted_dirs:

print(f"Summarizing directory: {dir_path if dir_path != '.' else 'root'}")

file_summaries_in_dir = self.summary_aggregator.get_file_summaries_for_directory(dir_path)

# Include sub-directory summaries in the current directory's context

# This is key for progressive summarization

sub_dir_summaries_for_context = {}

for existing_dir, existing_summary in self.summary_aggregator.directory_summaries.items():

if existing_dir.startswith(dir_path + os.sep):

sub_dir_summaries_for_context[existing_dir] = existing_summary

combined_context = {**file_summaries_in_dir, **sub_dir_summaries_for_context}

if combined_context:

dir_summary = self.llm_summarizer.summarize_directory(dir_path, combined_context)

self.summary_aggregator.add_directory_summary(dir_path, dir_summary)

else:

print(f"No relevant file or sub-directory summaries found for {dir_path}. Skipping directory summary.")

Running Example and Usage

To demonstrate the agent's capabilities, we will use a small, self-contained Python project. This project includes a `README.md`, `requirements.txt`, `Dockerfile`, and a `src` directory with a `main.py` and `utils.py`.

First, let us define the structure and content of our example repository. You would typically create these files in a directory, initialize a Git repository, and make a few commits.

my_simple_project/

├── .gitignore

├── Dockerfile

├── README.md

├── requirements.txt

└── src/

├── __init__.py

├── main.py

└── utils.py

Content for `my_simple_project` files:

`README.md`:

# My Simple Project

This is a basic Python project demonstrating a simple utility.

It includes a main script and a utility module.

## Features

- Greets a user.

- Performs a simple arithmetic operation.

## Setup

1. Clone the repository.

2. Install dependencies: `pip install -r requirements.txt`

3. Run: `python src/main.py`

## Known Issues

- The arithmetic operation currently only supports integers.

`requirements.txt`:

# No external dependencies for this simple example

# But in a real project, this would list packages like:

# requests==2.28.1

# numpy==1.23.5

`Dockerfile`:

# Use an official Python runtime as a parent image

FROM python:3.9-slim-buster

# Set the working directory in the container

WORKDIR /app

# Copy the current directory contents into the container at /app

COPY . /app

# Install any needed packages specified in requirements.txt

RUN pip install --no-cache-dir -r requirements.txt

# Make port 80 available to the world outside this container

# EXPOSE 80

# Run main.py when the container launches

CMD ["python", "src/main.py"]

__init__.py:

`src/__init__.py`: (This file can be empty, its purpose is to mark `src` as a Python package)

`src/main.py`:

# src/main.py

from src.utils import add_numbers, greet

def run_application():

"""

Main function to run the simple application logic.

"""

print("Starting My Simple Project application...")

name = "Alice"

greet(name)

num1 = 10

num2 = 5

result = add_numbers(num1, num2)

print(f"The sum of {num1} and {num2} is: {result}")

print("Application finished.")

if __name__ == "__main__":

run_application()

`src/utils.py`:

# src/utils.py

def greet(name: str) -> None:

"""

Prints a greeting message to the console.

Args:

name: The name of the person to greet.

"""

print(f"Hello, {name}! Welcome to the utility module.")

def add_numbers(a: int, b: int) -> int:

"""

Adds two integer numbers and returns their sum.

Args:

a: The first integer.

b: The second integer.

Returns:

The sum of a and b.

"""

return a + b

`.gitignore`:

# Byte-compiled / optimized / DLL files

__pycache__/

*.pyc

*.pyd

*.pyo

# Virtual environment

venv/

.venv/

# Editor backup files

To run the analysis, you would typically have a `main.py` script that initializes the agent with the desired configuration. Ensure you have `gitpython` and `openai` libraries installed (`pip install GitPython openai`). For local LLMs, you would need an Ollama server running and a model pulled.

# main.py

import os

from config import AgentConfig, LLMConfig

from agent import GitAnalysisAgent

def setup_example_repo(repo_name: str = "my_simple_project") -> str:

"""

Creates a dummy Git repository for demonstration purposes.

"""

repo_path = os.path.join(os.getcwd(), repo_name)

if os.path.exists(repo_path):

import shutil

shutil.rmtree(repo_path) # Clean up previous run

os.makedirs(repo_path, exist_ok=True)

# Create files

with open(os.path.join(repo_path, "README.md"), "w") as f:

f.write("# My Simple Project\n\nThis is a basic Python project demonstrating a simple utility.\nIt includes a main script and a utility module.\n\n## Features\n- Greets a user.\n- Performs a simple arithmetic operation.\n\n## Setup\n1. Clone the repository.\n2. Install dependencies: `pip install -r requirements.txt`\n3. Run: `python src/main.py`\n\n## Known Issues\n- The arithmetic operation currently only supports integers.\n")

with open(os.path.join(repo_path, "requirements.txt"), "w") as f:

f.write("# No external dependencies for this simple example\n")

with open(os.path.join(repo_path, "Dockerfile"), "w") as f:

f.write("FROM python:3.9-slim-buster\nWORKDIR /app\nCOPY . /app\nRUN pip install --no-cache-dir -r requirements.txt\nCMD [\"python\", \"src/main.py\"]\n")

with open(os.path.join(repo_path, ".gitignore"), "w") as f:

f.write("__pycache__/\n*.pyc\nvenv/\n")

src_dir = os.path.join(repo_path, "src")

os.makedirs(src_dir, exist_ok=True)

with open(os.path.join(src_dir, "__init__.py"), "w") as f:

f.write("")

with open(os.path.join(src_dir, "main.py"), "w") as f:

f.write("from src.utils import add_numbers, greet\n\ndef run_application():\n print(\"Starting My Simple Project application...\")\n name = \"Alice\"\n greet(name)\n num1 = 10\n num2 = 5\n result = add_numbers(num1, num2)\n print(f\"The sum of {num1} and {num2} is: {result}\")\n print(\"Application finished.\")\n\nif __name__ == \"__main__\":\n run_application()\n")

with open(os.path.join(src_dir, "utils.py"), "w") as f:

f.write("def greet(name: str) -> None:\n print(f\"Hello, {name}! Welcome to the utility module.\")\n\ndef add_numbers(a: int, b: int) -> int:\n return a + b\n")

# Initialize Git repository and make an initial commit

import git # type: ignore

repo = git.Repo.init(repo_path)

repo.index.add(["."])

repo.index.commit("Initial commit: Set up basic project structure and files")

# Simulate another commit

with open(os.path.join(src_dir, "main.py"), "a") as f:

f.write("\n# Added a comment to simulate a change\n")

repo.index.add([os.path.join(src_dir, "main.py")])

repo.index.commit("Feature: Added a comment to main.py")

print(f"Example repository '{repo_name}' created and initialized at {repo_path}")

return repo_path

def main():

"""

Main function to configure and run the Git analysis agent.

"""

# --- IMPORTANT: Configure your LLM here ---

# For OpenAI: Ensure OPENAI_API_KEY environment variable is set

# llm_config = LLMConfig(llm_type='openai', model_name='gpt-4o-mini')

# For Local LLM (e.g., Ollama running 'llama3' model at default port)

# Make sure Ollama is running and you have 'llama3' model pulled:

# ollama run llama3

llm_config = LLMConfig(llm_type='local', model_name='llama3', base_url='http://localhost:11434/v1')

# --- Setup example local repository ---

local_repo_path = setup_example_repo("my_simple_project_to_analyze")

# Alternatively, use a remote repository:

# remote_repo_url = "https://github.com/git/git.git" # Example remote repo (will be cloned)

# agent_config = AgentConfig(repo_path=remote_repo_url, llm_config=llm_config)

agent_config = AgentConfig(repo_path=local_repo_path, llm_config=llm_config)

agent = GitAnalysisAgent(agent_config)

final_summary = agent.analyze_repository()

print("\n==============================================================================")

print("FINAL REPOSITORY ANALYSIS REPORT")

print("==============================================================================")

print(final_summary)

print("==============================================================================")

print(f"Detailed summaries are saved in: {agent_config.output_dir}")

if __name__ == "__main__":

main()

When `main.py` is executed, it first sets up the example Git repository locally. Then, it initializes the `AgentConfig` with the path to this local repository and the chosen LLM configuration. The `GitAnalysisAgent` is instantiated and its `analyze_repository` method is called. This method orchestrates the entire process: cloning (if remote), extracting Git metadata, iterating through files to generate individual summaries, aggregating these into directory summaries, and finally synthesizing all this information into a comprehensive repository-level summary using the LLM. All intermediate and final summaries are saved to the `analysis_results` directory.

This agent provides a powerful tool for quickly gaining deep insights into any Git repository, significantly reducing the manual effort required for understanding complex codebases and their development history.

ADDENDUM: Full Running Example Code

To make the running example fully self-contained and executable, here are all the Python files that constitute the agent and the `main.py` script to run it.

1. `config.py`

# config.py

import os

from typing import Optional

class LLMConfig:

"""

Encapsulates configuration settings for the Large Language Model.

Supports both remote API-based LLMs and local server-based LLMs.

"""

def __init__(self,

llm_type: str, # 'openai', 'local'

api_key: Optional[str] = None,

model_name: str = "gpt-4o-mini",

base_url: Optional[str] = None):

"""

Initializes the LLM configuration.

Args:

llm_type: Specifies the type of LLM ('openai' for remote API, 'local' for a local server).

api_key: The API key for remote LLM services (e.g., OpenAI API key).

This should ideally be loaded from environment variables for security.

model_name: The specific model identifier to use (e.g., "gpt-4o-mini", "llama3").

base_url: The base URL for local LLM servers (e.g., "http://localhost:11434/v1").

"""

if llm_type not in ['openai', 'local']:

raise ValueError("llm_type must be 'openai' or 'local'")

self.llm_type = llm_type

self.api_key = api_key if api_key else os.getenv("OPENAI_API_KEY")

self.model_name = model_name

self.base_url = base_url

if self.llm_type == 'openai' and not self.api_key:

raise ValueError("OPENAI_API_KEY environment variable or api_key must be set for OpenAI LLM type.")

if self.llm_type == 'local' and not self.base_url:

raise ValueError("base_url must be set for local LLM type.")

def __repr__(self) -> str:

"""Provides a string representation of the LLMConfig object."""

return (f"LLMConfig(llm_type='{self.llm_type}', model_name='{self.model_name}', "

f"base_url='{self.base_url if self.base_url else 'N/A'}')")

class AgentConfig:

"""

Main configuration class for the Git analysis agent.

Holds repository path and LLM configuration.

"""

def __init__(self,

repo_path: str,

llm_config: LLMConfig,

output_dir: str = "analysis_results"):

"""

Initializes the agent configuration.

Args:

repo_path: The path to the local Git repository or its remote URL.

llm_config: An instance of LLMConfig containing LLM-specific settings.

output_dir: The directory where analysis results and summaries will be stored.

"""

self.repo_path = repo_path

self.llm_config = llm_config

self.output_dir = output_dir

# Ensure output directory exists

os.makedirs(self.output_dir, exist_ok=True)

def __repr__(self) -> str:

"""Provides a string representation of the AgentConfig object."""

return (f"AgentConfig(repo_path='{self.repo_path}', llm_config={self.llm_config}, "

f"output_dir='{self.output_dir}')")

2. `git_operations.py`

# git_operations.py

import os

import shutil

import git # type: ignore # gitpython library

from typing import Optional, Any, Dict

from collections import defaultdict

from datetime import datetime

class GitRepositoryManager:

"""

Manages the acquisition and cleanup of Git repositories.

Handles cloning remote repositories and validating local paths.

"""

def __init__(self, repo_source_path: str, clone_dir: str = "cloned_repos"):

"""

Initializes the GitRepositoryManager.

Args:

repo_source_path: The path to the local Git repository or its remote URL.

clone_dir: The directory where remote repositories will be cloned.

"""

self.repo_source_path = repo_source_path

self.clone_dir = clone_dir

self.local_repo_path: Optional[str] = None

self.is_cloned = False

os.makedirs(self.clone_dir, exist_ok=True)

def acquire_repository(self) -> str:

"""

Acquires the Git repository, either by using a local path or cloning a remote one.

Returns:

The absolute path to the local Git repository directory.

Raises:

ValueError: If the provided path is not a valid Git repository.

git.InvalidGitRepositoryError: If cloning fails or the local path is not a Git repo.

git.GitCommandError: If a git command fails during cloning.

"""

if os.path.isdir(self.repo_source_path) and \

os.path.exists(os.path.join(self.repo_source_path, '.git')):

# It's already a local Git repository

self.local_repo_path = os.path.abspath(self.repo_source_path)

print(f"Using local repository at: {self.local_repo_path}")

elif self.repo_source_path.startswith(('http://', 'https://', 'git@')):

# It's a remote URL, clone it

repo_name = self.repo_source_path.split('/')[-1].replace('.git', '')

target_path = os.path.join(self.clone_dir, repo_name)

if os.path.exists(target_path):

print(f"Repository already cloned to {target_path}. Pulling latest changes...")

repo = git.Repo(target_path)

origin = repo.remotes.origin

origin.pull()

else:

print(f"Cloning remote repository {self.repo_source_path} to {target_path}...")

git.Repo.clone_from(self.repo_source_path, target_path)

self.local_repo_path = os.path.abspath(target_path)

self.is_cloned = True

print(f"Repository successfully cloned/updated at: {self.local_repo_path}")

else:

raise ValueError(f"Invalid repository source: {self.repo_source_path}. "

"Must be a local path to a Git repo or a remote URL.")

# Final check to ensure it's a valid Git repository

try:

_ = git.Repo(self.local_repo_path)

except git.InvalidGitRepositoryError as e:

raise ValueError(f"The path '{self.local_repo_path}' is not a valid Git repository.") from e

return self.local_repo_path

def cleanup(self) -> None:

"""

Removes the cloned repository directory if it was cloned by this manager.

"""

if self.is_cloned and self.local_repo_path and os.path.exists(self.local_repo_path):

print(f"Cleaning up cloned repository: {self.local_repo_path}")

shutil.rmtree(self.local_repo_path)

self.local_repo_path = None

self.is_cloned = False

class GitAnalyzer:

"""

Analyzes a local Git repository to extract metadata such as contributors,

commit history, branches, and tags.

"""

def __init__(self, repo_path: str):

"""

Initializes the GitAnalyzer with the path to the local repository.

Args:

repo_path: The absolute path to the local Git repository.

"""

try:

self.repo = git.Repo(repo_path)

self.repo_path = repo_path

except git.InvalidGitRepositoryError as e:

raise ValueError(f"'{repo_path}' is not a valid Git repository.") from e

def get_contributors(self) -> dict:

"""

Analyzes commit history to identify contributors and their commit counts.

Returns:

A dictionary where keys are contributor names (author name <email>)

and values are their respective commit counts.

"""

contributors = defaultdict(int)

for commit in self.repo.iter_commits():

author_info = f"{commit.author.name} <{commit.author.email}>"

contributors[author_info] += 1

return dict(contributors)

def get_commit_summary(self, max_commits: int = 50) -> list[dict]:

"""

Retrieves a summary of recent commits.

Args:

max_commits: The maximum number of commits to retrieve.

Returns:

A list of dictionaries, each representing a commit with its hash, author,

date, and message.

"""

commit_list = []

for i, commit in enumerate(self.repo.iter_commits()):

if i >= max_commits:

break

commit_list.append({

"hash": commit.hexsha,

"author": f"{commit.author.name} <{commit.author.email}>",

"date": datetime.fromtimestamp(commit.committed_date).strftime('%Y-%m-%d %H:%M:%S'),

"message": commit.message.strip()

})

return commit_list

def get_branches(self) -> list[str]:

"""

Lists all local and remote branches in the repository.

Returns:

A list of branch names.

"""

return [head.name for head in self.repo.heads] + \

[remote.name for remote in self.repo.remotes]

def get_tags(self) -> list[str]:

"""

Lists all tags (often representing releases) in the repository.

Returns:

A list of tag names.

"""

return [tag.name for tag in self.repo.tags]

def get_repo_structure(self) -> str:

"""

Generates a simplified tree-like representation of the repository's file structure.

Excludes typical Git-related directories and common build artifacts.

Returns:

A string representing the directory tree.

"""

structure_lines = []

ignore_patterns = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',

'target', 'build', 'dist', '.idea', '.vscode']

for root, dirs, files in os.walk(self.repo_path):

# Filter out ignored directories

dirs[:] = [d for d in dirs if d not in ignore_patterns]

level = root.replace(self.repo_path, '').count(os.sep)

indent = ' ' * level

relative_path = os.path.relpath(root, self.repo_path)

if relative_path == '.': # Don't print '.' for the root itself

structure_lines.append(f"{os.path.basename(self.repo_path)}/")

else:

structure_lines.append(f"{indent}|-- {os.path.basename(root)}/")

subindent = ' ' * (level + 1)

for f in files:

structure_lines.append(f"{subindent}|-- {f}")

return "\n".join(structure_lines)

3. `llm_interface.py`

# llm_interface.py

import os

from abc import ABC, abstractmethod

from typing import Any, Dict, List, Optional

from openai import OpenAI # type: ignore

from config import LLMConfig

class LLMClient(ABC):

"""

Abstract base class for LLM clients, defining the common interface.

"""

@abstractmethod

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the LLM and returns its completion.

Args:

prompt: The text prompt to send to the LLM.

temperature: Controls the randomness of the output. Higher values mean more random.

Returns:

The generated text completion from the LLM.

"""

pass

class OpenAILLMClient(LLMClient):

"""

Concrete implementation of LLMClient for OpenAI API.

"""

def __init__(self, config: LLMConfig):

"""

Initializes the OpenAI LLM client.

Args:

config: An LLMConfig instance containing OpenAI-specific settings.

"""

if config.llm_type != 'openai':

raise ValueError("LLMConfig must be of type 'openai' for OpenAILLMClient.")

if not config.api_key:

raise ValueError("OpenAI API key is missing in configuration.")

self.client = OpenAI(api_key=config.api_key)

self.model_name = config.model_name

print(f"Initialized OpenAI LLM Client with model: {self.model_name}")

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the OpenAI API and returns its completion.

"""

try:

response = self.client.chat.completions.create(

model=self.model_name,

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": prompt}

temperature=temperature,

)

return response.choices[0].message.content if response.choices[0].message.content else ""

except Exception as e:

print(f"Error calling OpenAI API: {e}")

return f"Error: Could not get completion from OpenAI API - {e}"

class LocalLLMClient(LLMClient):

"""

Concrete implementation of LLMClient for local LLM servers (e.g., Ollama).

Assumes a compatible OpenAI-like API endpoint.

"""

def __init__(self, config: LLMConfig):

"""

Initializes the Local LLM client.

Args:

config: An LLMConfig instance containing local LLM-specific settings.

"""

if config.llm_type != 'local':

raise ValueError("LLMConfig must be of type 'local' for LocalLLMClient.")

if not config.base_url:

raise ValueError("Base URL is missing for local LLM configuration.")

self.client = OpenAI(base_url=config.base_url, api_key="ollama") # API key is often dummy for local

self.model_name = config.model_name

print(f"Initialized Local LLM Client with model: {self.model_name} at {config.base_url}")

def get_completion(self, prompt: str, temperature: float = 0.7) -> str:

"""

Sends a prompt to the local LLM server and returns its completion.

"""

try:

response = self.client.chat.completions.create(

model=self.model_name,

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": prompt}

temperature=temperature,

)

return response.choices[0].message.content if response.choices[0].message.content else ""

except Exception as e:

print(f"Error calling Local LLM API: {e}")

return f"Error: Could not get completion from Local LLM API - {e}"

4. `summarization.py`

```python

# summarization.py

import os

import json

import mimetypes # Used for file type guessing

from typing import Dict, List, Tuple, Any, Optional

from llm_interface import LLMClient

class FileProcessor:

"""

Handles reading and processing of individual files within the repository.

"""

def __init__(self, repo_root: str):

"""

Initializes the FileProcessor.

Args:

repo_root: The root directory of the Git repository.

"""

self.repo_root = repo_root

def read_file_content(self, file_path: str) -> Optional[str]:

"""

Reads the content of a specified file.

Handles common encoding issues and skips binary files.

Args:

file_path: The absolute path to the file.

Returns:

The content of the file as a string, or None if it's a binary file

or cannot be read.

"""

if not os.path.exists(file_path) or not os.path.isfile(file_path):

print(f"Warning: File not found or is not a file: {file_path}")

return None

# Heuristic to skip binary files

mime_type_guess = None

try:

mime_type_guess, _ = mimetypes.guess_type(file_path)

except ImportError:

pass # mimetypes might not be available in some minimal environments

if mime_type_guess and not mime_type_guess.startswith('text'):

print(f"Skipping binary file: {file_path} (MIME type: {mime_type_guess})")

return None

# Attempt to read as text

try:

with open(file_path, 'r', encoding='utf-8') as f:

return f.read()

except UnicodeDecodeError:

print(f"Skipping non-UTF-8 or binary file: {file_path}")

return None

except Exception as e:

print(f"Error reading file {file_path}: {e}")

return None

class LLMSummarizer:

"""

Uses an LLM to generate summaries for file contents and aggregated information.

"""

def __init__(self, llm_client: LLMClient):

"""

Initializes the LLMSummarizer with an LLM client.

Args:

llm_client: An instance of a concrete LLMClient implementation.

"""

self.llm_client = llm_client

def summarize_file(self, file_path: str, file_content: str) -> str:

"""

Generates a concise summary for a single file's content.

Args:

file_path: The relative path of the file being summarized.

file_content: The full content of the file.

Returns:

A summary string generated by the LLM.

"""

prompt = (

f"You are an expert software engineer tasked with summarizing code and configuration files. "

f"Provide a concise summary of the purpose, key functionalities, and important configurations "

f"or dependencies found in the following file. Focus on what this file *does* and its role "

f"within a larger project. Keep the summary under 150 words.\n\n"

f"File: {file_path}\n"

f"Content:\n```\n{file_content}\n```\n\n"

f"Concise Summary:"

)

return self.llm_client.get_completion(prompt)

def summarize_directory(self, directory_path: str, combined_context: Dict[str, str]) -> str:

"""

Generates a summary for a directory based on the summaries of its contained files and sub-directories.

Args:

directory_path: The relative path of the directory.

combined_context: A dictionary mapping file/sub-directory paths to their summaries within this directory.

Returns:

A summary string for the directory.

"""

if not combined_context:

return f"Directory '{directory_path}' contains no relevant files or summaries."

summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in combined_context.items()])

dir_name_display = directory_path if directory_path != "." else "the root directory"

prompt = (

f"You are an expert software architect analyzing a project structure. "

f"Based on the following file and sub-directory summaries, provide a concise overview of the purpose "

f"and primary functionalities of {dir_name_display}. "

f"Identify any common themes, dependencies, or architectural patterns. "

f"Keep the summary under 200 words.\n\n"

f"Directory: {dir_name_display}\n"

f"Contextual Summaries:\n{summaries_text}\n\n"