Introduction
Modern software development heavily relies on version control systems, with Git being the undisputed leader. Navigating and understanding complex Git repositories, especially unfamiliar ones, can be a daunting and time-consuming task for developers, project managers, and new team members alike. The sheer volume of code, commit history, branching strategies, and associated documentation often presents a significant barrier to entry or rapid comprehension. This article introduces the concept and detailed architecture of an LLM-based Git analysis agent designed to automate this process, providing comprehensive and insightful summaries of any given repository.
The core challenge addressed by this agent lies in bridging the gap between raw Git repository data and human-understandable, high-level summaries. Furthermore, a significant technical hurdle for any Large Language Model (LLM) is its inherent context window limitation. A typical repository can contain hundreds or thousands of files, far exceeding the token capacity of even the most advanced LLMs if processed all at once. Our agent is specifically engineered to overcome this by employing a progressive summarization strategy, breaking down the analysis into manageable, context-aware chunks.
Agent Architecture Overview
The Git analysis agent operates through a series of interconnected modules, each responsible for a specific aspect of repository understanding and information synthesis. This modular design ensures maintainability, scalability, and adherence to clean architecture principles. The overall flow begins with user input, proceeds through repository acquisition and detailed analysis, leverages LLMs for summarization, and culminates in a structured, comprehensive report.
Here is a conceptual ASCII diagram illustrating the agent's architecture:
+---------------------+ +--------------------------+
| User Configuration |---->| Repository Acquisition |
| (LLM, Repo Path) | | (Local/Remote) |
+---------------------+ +--------------------------+
| |
v v
+---------------------+ +--------------------------+
| Orchestration Engine|---->| Git Interaction Module |
| (Main Control Flow) | | (Log, Diff, Files, Tags) |
+---------------------+ +--------------------------+
| |
v v
+---------------------+ +--------------------------+
| File Analysis Module|---->| LLM Integration Layer |
| (Read, Chunk Files) | | (Prompting, API Calls) |
+---------------------+ +--------------------------+
| |
v v
+---------------------+ +--------------------------+
| Progressive | | Output Generation |
| Summarization & |---->| (Structured Report) |
| Memory Module | +--------------------------+
+---------------------+
The agent's journey starts with the user providing configuration details, including the target repository's location and the LLM settings. The Repository Acquisition module then handles fetching the repository, whether it is a local directory or a remote URL. The Orchestration Engine acts as the central coordinator, directing the flow of analysis. It delegates tasks to the Git Interaction Module for extracting metadata such as commit history, branches, and tags. Concurrently, the File Analysis Module reads individual files, preparing their content for LLM processing. The LLM Integration Layer manages all communication with the chosen Large Language Model, crafting prompts and parsing responses. Crucially, the Progressive Summarization and Memory Module aggregates file-level summaries into higher-level insights, effectively managing context. Finally, the Output Generation module compiles all gathered and summarized information into a coherent and detailed report for the user.
Detailed Constituent Descriptions
Let us delve deeper into each critical component of our LLM-based Git analysis agent, providing code examples to illustrate their functionality.
Configuration Management
Effective configuration management is paramount for flexibility and ease of use. The user must be able to specify the repository path (local or remote) and the details for the LLM, including whether it is a local model (e.g., via Ollama or a local server) or a remote API (e.g., OpenAI, Azure OpenAI). This module centralizes these settings, making them accessible throughout the agent.
We define a `Configuration` class to encapsulate these settings, ensuring that all necessary parameters are available before the analysis begins.
# config.py
import os
from typing import Optional
class LLMConfig:
"""
Encapsulates configuration settings for the Large Language Model.
Supports both remote API-based LLMs and local server-based LLMs.
"""
def __init__(self,
llm_type: str, # 'openai', 'local'
api_key: Optional[str] = None,
model_name: str = "gpt-4o-mini",
base_url: Optional[str] = None):
"""
Initializes the LLM configuration.
Args:
llm_type: Specifies the type of LLM ('openai' for remote API, 'local' for a local server).
api_key: The API key for remote LLM services (e.g., OpenAI API key).
This should ideally be loaded from environment variables for security.
model_name: The specific model identifier to use (e.g., "gpt-4o-mini", "llama3").
base_url: The base URL for local LLM servers (e.g., "http://localhost:11434/v1").
"""
if llm_type not in ['openai', 'local']:
raise ValueError("llm_type must be 'openai' or 'local'")
self.llm_type = llm_type
self.api_key = api_key if api_key else os.getenv("OPENAI_API_KEY")
self.model_name = model_name
self.base_url = base_url
if self.llm_type == 'openai' and not self.api_key:
raise ValueError("OPENAI_API_KEY environment variable or api_key must be set for OpenAI LLM type.")
if self.llm_type == 'local' and not self.base_url:
raise ValueError("base_url must be set for local LLM type.")
def __repr__(self) -> str:
"""Provides a string representation of the LLMConfig object."""
return (f"LLMConfig(llm_type='{self.llm_type}', model_name='{self.model_name}', "
f"base_url='{self.base_url if self.base_url else 'N/A'}')")
class AgentConfig:
"""
Main configuration class for the Git analysis agent.
Holds repository path and LLM configuration.
"""
def __init__(self,
repo_path: str,
llm_config: LLMConfig,
output_dir: str = "analysis_results"):
"""
Initializes the agent configuration.
Args:
repo_path: The path to the local Git repository or its remote URL.
llm_config: An instance of LLMConfig containing LLM-specific settings.
output_dir: The directory where analysis results and summaries will be stored.
"""
self.repo_path = repo_path
self.llm_config = llm_config
self.output_dir = output_dir
# Ensure output directory exists
os.makedirs(self.output_dir, exist_ok=True)
def __repr__(self) -> str:
"""Provides a string representation of the AgentConfig object."""
return (f"AgentConfig(repo_path='{self.repo_path}', llm_config={self.llm_config}, "
f"output_dir='{self.output_dir}')")
Repository Acquisition
This module is responsible for obtaining the Git repository. It must handle two primary scenarios: a local file path already present on the user's machine or a remote URL pointing to a repository on platforms like GitHub or GitLab. For remote repositories, it performs a clone operation.
The `GitRepositoryManager` class encapsulates the logic for cloning remote repositories and validating local paths. It ensures that the agent always operates on a valid, accessible Git repository.
# git_operations.py
import os
import shutil
import git # type: ignore # gitpython library
class GitRepositoryManager:
"""
Manages the acquisition and cleanup of Git repositories.
Handles cloning remote repositories and validating local paths.
"""
def __init__(self, repo_source_path: str, clone_dir: str = "cloned_repos"):
"""
Initializes the GitRepositoryManager.
Args:
repo_source_path: The path to the local Git repository or its remote URL.
clone_dir: The directory where remote repositories will be cloned.
"""
self.repo_source_path = repo_source_path
self.clone_dir = clone_dir
self.local_repo_path: Optional[str] = None
self.is_cloned = False
os.makedirs(self.clone_dir, exist_ok=True)
def acquire_repository(self) -> str:
"""
Acquires the Git repository, either by using a local path or cloning a remote one.
Returns:
The absolute path to the local Git repository directory.
Raises:
ValueError: If the provided path is not a valid Git repository.
git.InvalidGitRepositoryError: If cloning fails or the local path is not a Git repo.
git.GitCommandError: If a git command fails during cloning.
"""
if os.path.isdir(self.repo_source_path) and \
os.path.exists(os.path.join(self.repo_source_path, '.git')):
# It's already a local Git repository
self.local_repo_path = os.path.abspath(self.repo_source_path)
print(f"Using local repository at: {self.local_repo_path}")
elif self.repo_source_path.startswith(('http://', 'https://', 'git@')):
# It's a remote URL, clone it
repo_name = self.repo_source_path.split('/')[-1].replace('.git', '')
target_path = os.path.join(self.clone_dir, repo_name)
if os.path.exists(target_path):
print(f"Repository already cloned to {target_path}. Pulling latest changes...")
repo = git.Repo(target_path)
origin = repo.remotes.origin
origin.pull()
else:
print(f"Cloning remote repository {self.repo_source_path} to {target_path}...")
git.Repo.clone_from(self.repo_source_path, target_path)
self.local_repo_path = os.path.abspath(target_path)
self.is_cloned = True
print(f"Repository successfully cloned/updated at: {self.local_repo_path}")
else:
raise ValueError(f"Invalid repository source: {self.repo_source_path}. "
"Must be a local path to a Git repo or a remote URL.")
# Final check to ensure it's a valid Git repository
try:
_ = git.Repo(self.local_repo_path)
except git.InvalidGitRepositoryError as e:
raise ValueError(f"The path '{self.local_repo_path}' is not a valid Git repository.") from e
return self.local_repo_path
def cleanup(self) -> None:
"""
Removes the cloned repository directory if it was cloned by this manager.
"""
if self.is_cloned and self.local_repo_path and os.path.exists(self.local_repo_path):
print(f"Cleaning up cloned repository: {self.local_repo_path}")
shutil.rmtree(self.local_repo_path)
self.local_repo_path = None
self.is_cloned = False
Repository Traversal and Git Metadata Extraction
Once the repository is acquired, the `GitAnalyzer` module takes over to extract crucial metadata from the Git history. This includes information about contributors, commit patterns, branches, tags (representing releases), and general repository statistics. This data provides foundational context for the LLM's subsequent analysis.
# git_operations.py (continued)
from collections import defaultdict
from datetime import datetime
class GitAnalyzer:
"""
Analyzes a local Git repository to extract metadata such as contributors,
commit history, branches, and tags.
"""
def __init__(self, repo_path: str):
"""
Initializes the GitAnalyzer with the path to the local repository.
Args:
repo_path: The absolute path to the local Git repository.
"""
try:
self.repo = git.Repo(repo_path)
self.repo_path = repo_path
except git.InvalidGitRepositoryError as e:
raise ValueError(f"'{repo_path}' is not a valid Git repository.") from e
def get_contributors(self) -> dict:
"""
Analyzes commit history to identify contributors and their commit counts.
Returns:
A dictionary where keys are contributor names (author name <email>)
and values are their respective commit counts.
"""
contributors = defaultdict(int)
for commit in self.repo.iter_commits():
author_info = f"{commit.author.name} <{commit.author.email}>"
contributors[author_info] += 1
return dict(contributors)
def get_commit_summary(self, max_commits: int = 50) -> list[dict]:
"""
Retrieves a summary of recent commits.
Args:
max_commits: The maximum number of commits to retrieve.
Returns:
A list of dictionaries, each representing a commit with its hash, author,
date, and message.
"""
commit_list = []
for i, commit in enumerate(self.repo.iter_commits()):
if i >= max_commits:
break
commit_list.append({
"hash": commit.hexsha,
"author": f"{commit.author.name} <{commit.author.email}>",
"date": datetime.fromtimestamp(commit.committed_date).strftime('%Y-%m-%d %H:%M:%S'),
"message": commit.message.strip()
})
return commit_list
def get_branches(self) -> list[str]:
"""
Lists all local and remote branches in the repository.
Returns:
A list of branch names.
"""
return [head.name for head in self.repo.heads] + \
[remote.name for remote in self.repo.remotes]
def get_tags(self) -> list[str]:
"""
Lists all tags (often representing releases) in the repository.
Returns:
A list of tag names.
"""
return [tag.name for tag in self.repo.tags]
def get_repo_structure(self) -> str:
"""
Generates a simplified tree-like representation of the repository's file structure.
Excludes typical Git-related directories and common build artifacts.
Returns:
A string representing the directory tree.
"""
structure_lines = []
ignore_patterns = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',
'target', 'build', 'dist', '.idea', '.vscode']
for root, dirs, files in os.walk(self.repo_path):
# Filter out ignored directories
dirs[:] = [d for d in dirs if d not in ignore_patterns]
level = root.replace(self.repo_path, '').count(os.sep)
indent = ' ' * level
relative_path = os.path.relpath(root, self.repo_path)
if relative_path == '.': # Don't print '.' for the root itself
structure_lines.append(f"{os.path.basename(self.repo_path)}/")
else:
structure_lines.append(f"{indent}|-- {os.path.basename(root)}/")
subindent = ' ' * (level + 1)
for f in files:
structure_lines.append(f"{subindent}|-- {f}")
return "\n".join(structure_lines)
File-Level Analysis and Summarization Strategy
This is the core module addressing the LLM context window limitation. Instead of feeding the entire repository to the LLM, the agent processes files individually. The `FileProcessor` reads file contents, and then the `LLMSummarizer` uses the LLM to generate a concise summary for each file. This approach ensures that the LLM receives manageable chunks of information.
The `LLMClient` acts as an abstraction layer for interacting with different LLM providers, making the system flexible.
# llm_interface.py
import os
from abc import ABC, abstractmethod
from typing import Any, Dict, List, Optional
from openai import OpenAI # type: ignore
from config import LLMConfig
class LLMClient(ABC):
"""
Abstract base class for LLM clients, defining the common interface.
"""
@abstractmethod
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the LLM and returns its completion.
Args:
prompt: The text prompt to send to the LLM.
temperature: Controls the randomness of the output. Higher values mean more random.
Returns:
The generated text completion from the LLM.
"""
pass
class OpenAILLMClient(LLMClient):
"""
Concrete implementation of LLMClient for OpenAI API.
"""
def __init__(self, config: LLMConfig):
"""
Initializes the OpenAI LLM client.
Args:
config: An LLMConfig instance containing OpenAI-specific settings.
"""
if config.llm_type != 'openai':
raise ValueError("LLMConfig must be of type 'openai' for OpenAILLMClient.")
if not config.api_key:
raise ValueError("OpenAI API key is missing in configuration.")
self.client = OpenAI(api_key=config.api_key)
self.model_name = config.model_name
print(f"Initialized OpenAI LLM Client with model: {self.model_name}")
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the OpenAI API and returns its completion.
"""
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=temperature,
)
return response.choices[0].message.content if response.choices[0].message.content else ""
except Exception as e:
print(f"Error calling OpenAI API: {e}")
return f"Error: Could not get completion from OpenAI API - {e}"
class LocalLLMClient(LLMClient):
"""
Concrete implementation of LLMClient for local LLM servers (e.g., Ollama).
Assumes a compatible OpenAI-like API endpoint.
"""
def __init__(self, config: LLMConfig):
"""
Initializes the Local LLM client.
Args:
config: An LLMConfig instance containing local LLM-specific settings.
"""
if config.llm_type != 'local':
raise ValueError("LLMConfig must be of type 'local' for LocalLLMClient.")
if not config.base_url:
raise ValueError("Base URL is missing for local LLM configuration.")
self.client = OpenAI(base_url=config.base_url, api_key="ollama") # API key is often dummy for local
self.model_name = config.model_name
print(f"Initialized Local LLM Client with model: {self.model_name} at {config.base_url}")
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the local LLM server and returns its completion.
"""
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=temperature,
)
return response.choices[0].message.content if response.choices[0].message.content else ""
except Exception as e:
print(f"Error calling Local LLM API: {e}")
return f"Error: Could not get completion from Local LLM API - {e}"
The `FileProcessor` is responsible for reading file content, while the `LLMSummarizer` orchestrates the prompt creation and interaction with the `LLMClient`.
# summarization.py
import os
from typing import Dict, List, Tuple
from llm_interface import LLMClient
class FileProcessor:
"""
Handles reading and processing of individual files within the repository.
"""
def __init__(self, repo_root: str):
"""
Initializes the FileProcessor.
Args:
repo_root: The root directory of the Git repository.
"""
self.repo_root = repo_root
def read_file_content(self, file_path: str) -> Optional[str]:
"""
Reads the content of a specified file.
Handles common encoding issues and skips binary files.
Args:
file_path: The absolute path to the file.
Returns:
The content of the file as a string, or None if it's a binary file
or cannot be read.
"""
if not os.path.exists(file_path) or not os.path.isfile(file_path):
print(f"Warning: File not found or is not a file: {file_path}")
return None
# Heuristic to skip binary files
mime_type_guess = None
try:
import mimetypes
mime_type_guess, _ = mimetypes.guess_type(file_path)
except ImportError:
pass # mimetypes might not be available in some minimal environments
if mime_type_guess and not mime_type_guess.startswith('text'):
print(f"Skipping binary file: {file_path} (MIME type: {mime_type_guess})")
return None
# Attempt to read as text
try:
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
except UnicodeDecodeError:
print(f"Skipping non-UTF-8 or binary file: {file_path}")
return None
except Exception as e:
print(f"Error reading file {file_path}: {e}")
return None
class LLMSummarizer:
"""
Uses an LLM to generate summaries for file contents and aggregated information.
"""
def __init__(self, llm_client: LLMClient):
"""
Initializes the LLMSummarizer with an LLM client.
Args:
llm_client: An instance of a concrete LLMClient implementation.
"""
self.llm_client = llm_client
def summarize_file(self, file_path: str, file_content: str) -> str:
"""
Generates a concise summary for a single file's content.
Args:
file_path: The relative path of the file being summarized.
file_content: The full content of the file.
Returns:
A summary string generated by the LLM.
"""
prompt = (
f"You are an expert software engineer tasked with summarizing code and configuration files. "
f"Provide a concise summary of the purpose, key functionalities, and important configurations "
f"or dependencies found in the following file. Focus on what this file *does* and its role "
f"within a larger project. Keep the summary under 150 words.\n\n"
f"File: {file_path}\n"
f"Content:\n```\n{file_content}\n```\n\n"
f"Concise Summary:"
)
return self.llm_client.get_completion(prompt)
def summarize_directory(self, directory_path: str, file_summaries: Dict[str, str]) -> str:
"""
Generates a summary for a directory based on the summaries of its contained files.
Args:
directory_path: The relative path of the directory.
file_summaries: A dictionary mapping file paths to their summaries within this directory.
Returns:
A summary string for the directory.
"""
if not file_summaries:
return f"Directory '{directory_path}' contains no relevant files or summaries."
summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in file_summaries.items()])
prompt = (
f"You are an expert software architect analyzing a project structure. "
f"Based on the following file summaries, provide a concise overview of the purpose "
f"and primary functionalities of the directory '{directory_path}'. "
f"Identify any common themes, dependencies, or architectural patterns. "
f"Keep the summary under 200 words.\n\n"
f"Directory: {directory_path}\n"
f"File Summaries:\n{summaries_text}\n\n"
f"Concise Directory Summary:"
)
return self.llm_client.get_completion(prompt)
def summarize_repository(self,
repo_name: str,
repo_structure: str,
directory_summaries: Dict[str, str],
git_metadata: Dict[str, Any]) -> str:
"""
Generates a comprehensive summary of the entire repository.
Args:
repo_name: The name of the repository.
repo_structure: A string representation of the repository's file structure.
directory_summaries: A dictionary mapping directory paths to their summaries.
git_metadata: A dictionary containing aggregated Git metadata (contributors, commits, etc.).
Returns:
A comprehensive summary string for the entire repository.
"""
dir_summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in directory_summaries.items()])
contributors_text = "\n".join([f" - {author} ({count} commits)" for author, count in git_metadata.get('contributors', {}).items()])
recent_commits_text = "\n".join([f" - {c['date']} by {c['author']}: {c['message']}" for c in git_metadata.get('recent_commits', [])[:5]])
branches_text = ", ".join(git_metadata.get('branches', []))
tags_text = ", ".join(git_metadata.get('tags', []))
prompt = (
f"You are a highly intelligent AI assistant specializing in software project analysis. "
f"Your task is to provide a comprehensive and detailed summary of the Git repository named '{repo_name}'. "
f"Synthesize information from the repository's structure, directory-level summaries, and Git metadata. "
f"Cover the following aspects:\n"
f"1. **Overall Purpose and Key Functionalities:** What is the project about? What problems does it solve?\n"
f"2. **Architectural Overview/Structure:** Describe the main components and how they are organized.\n"
f"3. **Core Technologies/Dependencies:** Identify programming languages, frameworks, and key libraries.\n"
f"4. **Development Environment/Setup:** How would one set up and run this project? (e.g., Docker, `requirements.txt`)\n"
f"5. **Key Contributors and Activity:** Who are the main developers and what is the recent activity?\n"
f"6. **Release Strategy/Versioning:** How are releases managed (tags, branches)?\n"
f"7. **Known Issues/Limitations:** Any explicit mentions of problems or areas for improvement (from README/comments).\n"
f"8. **Evolution/Changes:** High-level overview of recent significant changes.\n\n"
f"Repository Name: {repo_name}\n"
f"Repository Structure:\n{repo_structure}\n\n"
f"Directory Summaries:\n{dir_summaries_text}\n\n"
f"Git Metadata:\n"
f" Contributors:\n{contributors_text}\n"
f" Recent Commits:\n{recent_commits_text}\n"
f" Branches: {branches_text}\n"
f" Tags (Releases): {tags_text}\n\n"
f"Comprehensive Repository Summary:"
)
return self.llm_client.get_completion(prompt, temperature=0.2) # Lower temperature for factual summary
Progressive Summarization and Memory
This module is crucial for managing the context window. It stores file-level summaries and then aggregates them into directory-level summaries, and finally into an overall repository summary. This hierarchical summarization ensures that the LLM never receives an overwhelming amount of raw data at once, but rather progressively distilled information. The `SummaryAggregator` orchestrates this process, storing intermediate results.
# summarization.py (continued)
import json
class SummaryAggregator:
"""
Manages the storage and aggregation of file and directory summaries.
"""
def __init__(self, output_dir: str):
"""
Initializes the SummaryAggregator.
Args:
output_dir: The directory where summaries will be saved.
"""
self.output_dir = output_dir
os.makedirs(output_dir, exist_ok=True)
self.file_summaries: Dict[str, str] = {}
self.directory_summaries: Dict[str, str] = {}
self.repo_summary: Optional[str] = None
self.git_metadata: Dict[str, Any] = {}
def add_file_summary(self, relative_path: str, summary: str) -> None:
"""
Adds a summary for a specific file.
Args:
relative_path: The path of the file relative to the repository root.
summary: The LLM-generated summary for the file.
"""
self.file_summaries[relative_path] = summary
self._save_summary(f"file_summary_{relative_path.replace(os.sep, '_').replace('.', '_')}.txt", summary)
def add_directory_summary(self, relative_path: str, summary: str) -> None:
"""
Adds a summary for a specific directory.
Args:
relative_path: The path of the directory relative to the repository root.
summary: The LLM-generated summary for the directory.
"""
self.directory_summaries[relative_path] = summary
self._save_summary(f"dir_summary_{relative_path.replace(os.sep, '_')}.txt", summary)
def set_repo_summary(self, summary: str) -> None:
"""
Sets the final comprehensive repository summary.
Args:
summary: The LLM-generated summary for the entire repository.
"""
self.repo_summary = summary
self._save_summary("repository_summary.txt", summary)
def set_git_metadata(self, metadata: Dict[str, Any]) -> None:
"""
Stores the extracted Git metadata.
Args:
metadata: A dictionary containing Git metadata.
"""
self.git_metadata = metadata
self._save_summary("git_metadata.json", json.dumps(metadata, indent=2))
def get_file_summaries_for_directory(self, relative_dir_path: str) -> Dict[str, str]:
"""
Retrieves file summaries belonging to a specific directory.
Args:
relative_dir_path: The relative path of the directory.
Returns:
A dictionary of file paths to summaries within that directory.
"""
if relative_dir_path == ".": # Root directory
return {p: s for p, s in self.file_summaries.items() if os.sep not in p and p != "README.md"}
# Include README.md if it's in the root, but not in a sub-directory summary
if relative_dir_path == "": # Special case for root
return {p:s for p,s in self.file_summaries.items() if not os.path.dirname(p)}
# For subdirectories, filter files that start with the directory path
prefix = relative_dir_path + os.sep
return {p: s for p, s in self.file_summaries.items() if p.startswith(prefix) and os.path.dirname(p) == relative_dir_path}
def _save_summary(self, filename: str, content: str) -> None:
"""
Helper method to save a summary to a file.
"""
file_path = os.path.join(self.output_dir, filename)
try:
with open(file_path, 'w', encoding='utf-8') as f:
f.write(content)
print(f"Saved summary to {file_path}")
except Exception as e:
print(f"Error saving summary to {file_path}: {e}")
Output Generation
The final stage involves compiling all the gathered and summarized information into a coherent, human-readable report. This report should present the repository's structure, purpose, key features, development environment, contributors, and any identified issues or release information in an organized manner. The `GitAnalysisAgent` itself will handle the final report generation by orchestrating the collection of all summaries.
The Git Analysis Agent (Orchestrator)
The `GitAnalysisAgent` class serves as the main orchestrator, tying all the modules together. It manages the entire workflow, from repository acquisition to final report generation, ensuring that each step is executed logically and efficiently.
# agent.py
import os
from typing import Any, Dict
from config import AgentConfig, LLMConfig
from git_operations import GitRepositoryManager, GitAnalyzer
from llm_interface import LLMClient, OpenAILLMClient, LocalLLMClient
from summarization import FileProcessor, LLMSummarizer, SummaryAggregator
class GitAnalysisAgent:
"""
The main orchestrator for the LLM-based Git analysis agent.
Coordinates repository acquisition, Git metadata extraction, file processing,
LLM summarization, and report generation.
"""
def __init__(self, config: AgentConfig):
"""
Initializes the GitAnalysisAgent with the provided configuration.
Args:
config: An instance of AgentConfig containing all necessary settings.
"""
self.config = config
self.repo_manager = GitRepositoryManager(config.repo_path, config.output_dir)
self.llm_client: LLMClient
if config.llm_config.llm_type == 'openai':
self.llm_client = OpenAILLMClient(config.llm_config)
elif config.llm_config.llm_type == 'local':
self.llm_client = LocalLLMClient(config.llm_config)
else:
raise ValueError(f"Unsupported LLM type: {config.llm_config.llm_type}")
self.llm_summarizer = LLMSummarizer(self.llm_client)
self.summary_aggregator = SummaryAggregator(config.output_dir)
self.local_repo_path: Optional[str] = None
self.git_analyzer: Optional[GitAnalyzer] = None
self.file_processor: Optional[FileProcessor] = None
def analyze_repository(self) -> str:
"""
Executes the full repository analysis workflow.
Returns:
The final comprehensive repository summary as a string.
"""
print("\n--- Starting Repository Analysis ---")
try:
# 1. Acquire Repository
self.local_repo_path = self.repo_manager.acquire_repository()
self.git_analyzer = GitAnalyzer(self.local_repo_path)
self.file_processor = FileProcessor(self.local_repo_path)
# 2. Extract Git Metadata
print("\n--- Extracting Git Metadata ---")
git_metadata = self._extract_git_metadata()
self.summary_aggregator.set_git_metadata(git_metadata)
# 3. Analyze and Summarize Files
print("\n--- Analyzing and Summarizing Files ---")
self._analyze_and_summarize_files()
# 4. Summarize Directories
print("\n--- Summarizing Directories ---")
self._summarize_directories()
# 5. Generate Final Repository Summary
print("\n--- Generating Final Repository Summary ---")
repo_name = os.path.basename(self.local_repo_path)
repo_structure = self.git_analyzer.get_repo_structure() if self.git_analyzer else "Could not generate structure."
final_repo_summary = self.llm_summarizer.summarize_repository(
repo_name=repo_name,
repo_structure=repo_structure,
directory_summaries=self.summary_aggregator.directory_summaries,
git_metadata=git_metadata
)
self.summary_aggregator.set_repo_summary(final_repo_summary)
print("\n--- Repository Analysis Complete ---")
return final_repo_summary
except Exception as e:
print(f"An error occurred during analysis: {e}")
return f"Analysis failed due to an error: {e}"
finally:
self.repo_manager.cleanup() # Ensure cloned repos are removed
def _extract_git_metadata(self) -> Dict[str, Any]:
"""Helper to extract and return Git metadata."""
if not self.git_analyzer:
raise RuntimeError("GitAnalyzer not initialized.")
metadata = {
"contributors": self.git_analyzer.get_contributors(),
"recent_commits": self.git_analyzer.get_commit_summary(max_commits=10),
"branches": self.git_analyzer.get_branches(),
"tags": self.git_analyzer.get_tags(),
"repo_structure_preview": self.git_analyzer.get_repo_structure() # Store a preview for context
}
print("Git metadata extracted.")
return metadata
def _analyze_and_summarize_files(self) -> None:
"""
Traverses the repository, reads files, and generates LLM summaries for each.
"""
if not self.local_repo_path or not self.file_processor:
raise RuntimeError("Repository path or file processor not initialized.")
# Walk through the repository, excluding common ignored directories
ignore_dirs = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',
'target', 'build', 'dist', '.idea', '.vscode']
# Add common documentation files to process first, as they often contain purpose
priority_files = ['README.md', 'Dockerfile', 'requirements.txt', 'package.json', 'pom.xml']
processed_files = set()
# Process priority files first if they exist at the root
for p_file in priority_files:
abs_path = os.path.join(self.local_repo_path, p_file)
if os.path.exists(abs_path) and os.path.isfile(abs_path):
relative_path = os.path.relpath(abs_path, self.local_repo_path)
print(f"Processing priority file: {relative_path}")
content = self.file_processor.read_file_content(abs_path)
if content:
summary = self.llm_summarizer.summarize_file(relative_path, content)
self.summary_aggregator.add_file_summary(relative_path, summary)
processed_files.add(relative_path)
for root, dirs, files in os.walk(self.local_repo_path):
# Modify dirs in-place to prune traversal
dirs[:] = [d for d in dirs if d not in ignore_dirs]
for file_name in files:
abs_file_path = os.path.join(root, file_name)
relative_file_path = os.path.relpath(abs_file_path, self.local_repo_path)
if relative_file_path in processed_files:
continue # Skip files already processed as priority
# Skip common non-source files or very large files
if any(relative_file_path.endswith(ext) for ext in ['.png', '.jpg', '.jpeg', '.gif', '.bin', '.zip', '.tar.gz', '.log']) or \
os.path.getsize(abs_file_path) > 1024 * 1024: # e.g., 1MB limit for text files
print(f"Skipping large or non-text file: {relative_file_path}")
continue
print(f"Processing file: {relative_file_path}")
content = self.file_processor.read_file_content(abs_file_path)
if content:
summary = self.llm_summarizer.summarize_file(relative_file_path, content)
self.summary_aggregator.add_file_summary(relative_file_path, summary)
processed_files.add(relative_file_path)
def _summarize_directories(self) -> None:
"""
Generates summaries for directories based on their contained file summaries.
Processes directories from deepest to shallowest to ensure dependencies.
"""
if not self.local_repo_path:
raise RuntimeError("Repository path not initialized.")
# Get all unique directory paths that have files summarized
all_file_paths = self.summary_aggregator.file_summaries.keys()
all_dirs = set()
for f_path in all_file_paths:
current_dir = os.path.dirname(f_path)
while current_dir and current_dir != '.':
all_dirs.add(current_dir)
current_dir = os.path.dirname(current_dir)
# Ensure root directory is included if there are any files
if all_file_paths:
all_dirs.add(".") # Represents the root directory
# Sort directories by depth (deepest first) to summarize from bottom-up
sorted_dirs = sorted(list(all_dirs), key=lambda x: x.count(os.sep), reverse=True)
for dir_path in sorted_dirs:
print(f"Summarizing directory: {dir_path if dir_path != '.' else 'root'}")
file_summaries_in_dir = self.summary_aggregator.get_file_summaries_for_directory(dir_path)
# Include sub-directory summaries in the current directory's context
# This is key for progressive summarization
sub_dir_summaries_for_context = {}
for existing_dir, existing_summary in self.summary_aggregator.directory_summaries.items():
if existing_dir.startswith(dir_path + os.sep):
sub_dir_summaries_for_context[existing_dir] = existing_summary
combined_context = {**file_summaries_in_dir, **sub_dir_summaries_for_context}
if combined_context:
dir_summary = self.llm_summarizer.summarize_directory(dir_path, combined_context)
self.summary_aggregator.add_directory_summary(dir_path, dir_summary)
else:
print(f"No relevant file or sub-directory summaries found for {dir_path}. Skipping directory summary.")
Running Example and Usage
To demonstrate the agent's capabilities, we will use a small, self-contained Python project. This project includes a `README.md`, `requirements.txt`, `Dockerfile`, and a `src` directory with a `main.py` and `utils.py`.
First, let us define the structure and content of our example repository. You would typically create these files in a directory, initialize a Git repository, and make a few commits.
my_simple_project/
├── .gitignore
├── Dockerfile
├── README.md
├── requirements.txt
└── src/
├── __init__.py
├── main.py
└── utils.py
Content for `my_simple_project` files:
`README.md`:
# My Simple Project
This is a basic Python project demonstrating a simple utility.
It includes a main script and a utility module.
## Features
- Greets a user.
- Performs a simple arithmetic operation.
## Setup
1. Clone the repository.
2. Install dependencies: `pip install -r requirements.txt`
3. Run: `python src/main.py`
## Known Issues
- The arithmetic operation currently only supports integers.
`requirements.txt`:
# No external dependencies for this simple example
# But in a real project, this would list packages like:
# requests==2.28.1
# numpy==1.23.5
`Dockerfile`:
# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Make port 80 available to the world outside this container
# EXPOSE 80
# Run main.py when the container launches
CMD ["python", "src/main.py"]
__init__.py:
`src/__init__.py`: (This file can be empty, its purpose is to mark `src` as a Python package)
`src/main.py`:
# src/main.py
from src.utils import add_numbers, greet
def run_application():
"""
Main function to run the simple application logic.
"""
print("Starting My Simple Project application...")
name = "Alice"
greet(name)
num1 = 10
num2 = 5
result = add_numbers(num1, num2)
print(f"The sum of {num1} and {num2} is: {result}")
print("Application finished.")
if __name__ == "__main__":
run_application()
`src/utils.py`:
# src/utils.py
def greet(name: str) -> None:
"""
Prints a greeting message to the console.
Args:
name: The name of the person to greet.
"""
print(f"Hello, {name}! Welcome to the utility module.")
def add_numbers(a: int, b: int) -> int:
"""
Adds two integer numbers and returns their sum.
Args:
a: The first integer.
b: The second integer.
Returns:
The sum of a and b.
"""
return a + b
`.gitignore`:
# Byte-compiled / optimized / DLL files
__pycache__/
*.pyc
*.pyd
*.pyo
# Virtual environment
venv/
.venv/
# Editor backup files
*~
To run the analysis, you would typically have a `main.py` script that initializes the agent with the desired configuration. Ensure you have `gitpython` and `openai` libraries installed (`pip install GitPython openai`). For local LLMs, you would need an Ollama server running and a model pulled.
# main.py
import os
from config import AgentConfig, LLMConfig
from agent import GitAnalysisAgent
def setup_example_repo(repo_name: str = "my_simple_project") -> str:
"""
Creates a dummy Git repository for demonstration purposes.
"""
repo_path = os.path.join(os.getcwd(), repo_name)
if os.path.exists(repo_path):
import shutil
shutil.rmtree(repo_path) # Clean up previous run
os.makedirs(repo_path, exist_ok=True)
# Create files
with open(os.path.join(repo_path, "README.md"), "w") as f:
f.write("# My Simple Project\n\nThis is a basic Python project demonstrating a simple utility.\nIt includes a main script and a utility module.\n\n## Features\n- Greets a user.\n- Performs a simple arithmetic operation.\n\n## Setup\n1. Clone the repository.\n2. Install dependencies: `pip install -r requirements.txt`\n3. Run: `python src/main.py`\n\n## Known Issues\n- The arithmetic operation currently only supports integers.\n")
with open(os.path.join(repo_path, "requirements.txt"), "w") as f:
f.write("# No external dependencies for this simple example\n")
with open(os.path.join(repo_path, "Dockerfile"), "w") as f:
f.write("FROM python:3.9-slim-buster\nWORKDIR /app\nCOPY . /app\nRUN pip install --no-cache-dir -r requirements.txt\nCMD [\"python\", \"src/main.py\"]\n")
with open(os.path.join(repo_path, ".gitignore"), "w") as f:
f.write("__pycache__/\n*.pyc\nvenv/\n")
src_dir = os.path.join(repo_path, "src")
os.makedirs(src_dir, exist_ok=True)
with open(os.path.join(src_dir, "__init__.py"), "w") as f:
f.write("")
with open(os.path.join(src_dir, "main.py"), "w") as f:
f.write("from src.utils import add_numbers, greet\n\ndef run_application():\n print(\"Starting My Simple Project application...\")\n name = \"Alice\"\n greet(name)\n num1 = 10\n num2 = 5\n result = add_numbers(num1, num2)\n print(f\"The sum of {num1} and {num2} is: {result}\")\n print(\"Application finished.\")\n\nif __name__ == \"__main__\":\n run_application()\n")
with open(os.path.join(src_dir, "utils.py"), "w") as f:
f.write("def greet(name: str) -> None:\n print(f\"Hello, {name}! Welcome to the utility module.\")\n\ndef add_numbers(a: int, b: int) -> int:\n return a + b\n")
# Initialize Git repository and make an initial commit
import git # type: ignore
repo = git.Repo.init(repo_path)
repo.index.add(["."])
repo.index.commit("Initial commit: Set up basic project structure and files")
# Simulate another commit
with open(os.path.join(src_dir, "main.py"), "a") as f:
f.write("\n# Added a comment to simulate a change\n")
repo.index.add([os.path.join(src_dir, "main.py")])
repo.index.commit("Feature: Added a comment to main.py")
print(f"Example repository '{repo_name}' created and initialized at {repo_path}")
return repo_path
def main():
"""
Main function to configure and run the Git analysis agent.
"""
# --- IMPORTANT: Configure your LLM here ---
# For OpenAI: Ensure OPENAI_API_KEY environment variable is set
# llm_config = LLMConfig(llm_type='openai', model_name='gpt-4o-mini')
# For Local LLM (e.g., Ollama running 'llama3' model at default port)
# Make sure Ollama is running and you have 'llama3' model pulled:
# ollama run llama3
llm_config = LLMConfig(llm_type='local', model_name='llama3', base_url='http://localhost:11434/v1')
# --- Setup example local repository ---
local_repo_path = setup_example_repo("my_simple_project_to_analyze")
# Alternatively, use a remote repository:
# remote_repo_url = "https://github.com/git/git.git" # Example remote repo (will be cloned)
# agent_config = AgentConfig(repo_path=remote_repo_url, llm_config=llm_config)
agent_config = AgentConfig(repo_path=local_repo_path, llm_config=llm_config)
agent = GitAnalysisAgent(agent_config)
final_summary = agent.analyze_repository()
print("\n==============================================================================")
print("FINAL REPOSITORY ANALYSIS REPORT")
print("==============================================================================")
print(final_summary)
print("==============================================================================")
print(f"Detailed summaries are saved in: {agent_config.output_dir}")
if __name__ == "__main__":
main()
When `main.py` is executed, it first sets up the example Git repository locally. Then, it initializes the `AgentConfig` with the path to this local repository and the chosen LLM configuration. The `GitAnalysisAgent` is instantiated and its `analyze_repository` method is called. This method orchestrates the entire process: cloning (if remote), extracting Git metadata, iterating through files to generate individual summaries, aggregating these into directory summaries, and finally synthesizing all this information into a comprehensive repository-level summary using the LLM. All intermediate and final summaries are saved to the `analysis_results` directory.
This agent provides a powerful tool for quickly gaining deep insights into any Git repository, significantly reducing the manual effort required for understanding complex codebases and their development history.
ADDENDUM: Full Running Example Code
To make the running example fully self-contained and executable, here are all the Python files that constitute the agent and the `main.py` script to run it.
1. `config.py`
# config.py
import os
from typing import Optional
class LLMConfig:
"""
Encapsulates configuration settings for the Large Language Model.
Supports both remote API-based LLMs and local server-based LLMs.
"""
def __init__(self,
llm_type: str, # 'openai', 'local'
api_key: Optional[str] = None,
model_name: str = "gpt-4o-mini",
base_url: Optional[str] = None):
"""
Initializes the LLM configuration.
Args:
llm_type: Specifies the type of LLM ('openai' for remote API, 'local' for a local server).
api_key: The API key for remote LLM services (e.g., OpenAI API key).
This should ideally be loaded from environment variables for security.
model_name: The specific model identifier to use (e.g., "gpt-4o-mini", "llama3").
base_url: The base URL for local LLM servers (e.g., "http://localhost:11434/v1").
"""
if llm_type not in ['openai', 'local']:
raise ValueError("llm_type must be 'openai' or 'local'")
self.llm_type = llm_type
self.api_key = api_key if api_key else os.getenv("OPENAI_API_KEY")
self.model_name = model_name
self.base_url = base_url
if self.llm_type == 'openai' and not self.api_key:
raise ValueError("OPENAI_API_KEY environment variable or api_key must be set for OpenAI LLM type.")
if self.llm_type == 'local' and not self.base_url:
raise ValueError("base_url must be set for local LLM type.")
def __repr__(self) -> str:
"""Provides a string representation of the LLMConfig object."""
return (f"LLMConfig(llm_type='{self.llm_type}', model_name='{self.model_name}', "
f"base_url='{self.base_url if self.base_url else 'N/A'}')")
class AgentConfig:
"""
Main configuration class for the Git analysis agent.
Holds repository path and LLM configuration.
"""
def __init__(self,
repo_path: str,
llm_config: LLMConfig,
output_dir: str = "analysis_results"):
"""
Initializes the agent configuration.
Args:
repo_path: The path to the local Git repository or its remote URL.
llm_config: An instance of LLMConfig containing LLM-specific settings.
output_dir: The directory where analysis results and summaries will be stored.
"""
self.repo_path = repo_path
self.llm_config = llm_config
self.output_dir = output_dir
# Ensure output directory exists
os.makedirs(self.output_dir, exist_ok=True)
def __repr__(self) -> str:
"""Provides a string representation of the AgentConfig object."""
return (f"AgentConfig(repo_path='{self.repo_path}', llm_config={self.llm_config}, "
f"output_dir='{self.output_dir}')")
2. `git_operations.py`
# git_operations.py
import os
import shutil
import git # type: ignore # gitpython library
from typing import Optional, Any, Dict
from collections import defaultdict
from datetime import datetime
class GitRepositoryManager:
"""
Manages the acquisition and cleanup of Git repositories.
Handles cloning remote repositories and validating local paths.
"""
def __init__(self, repo_source_path: str, clone_dir: str = "cloned_repos"):
"""
Initializes the GitRepositoryManager.
Args:
repo_source_path: The path to the local Git repository or its remote URL.
clone_dir: The directory where remote repositories will be cloned.
"""
self.repo_source_path = repo_source_path
self.clone_dir = clone_dir
self.local_repo_path: Optional[str] = None
self.is_cloned = False
os.makedirs(self.clone_dir, exist_ok=True)
def acquire_repository(self) -> str:
"""
Acquires the Git repository, either by using a local path or cloning a remote one.
Returns:
The absolute path to the local Git repository directory.
Raises:
ValueError: If the provided path is not a valid Git repository.
git.InvalidGitRepositoryError: If cloning fails or the local path is not a Git repo.
git.GitCommandError: If a git command fails during cloning.
"""
if os.path.isdir(self.repo_source_path) and \
os.path.exists(os.path.join(self.repo_source_path, '.git')):
# It's already a local Git repository
self.local_repo_path = os.path.abspath(self.repo_source_path)
print(f"Using local repository at: {self.local_repo_path}")
elif self.repo_source_path.startswith(('http://', 'https://', 'git@')):
# It's a remote URL, clone it
repo_name = self.repo_source_path.split('/')[-1].replace('.git', '')
target_path = os.path.join(self.clone_dir, repo_name)
if os.path.exists(target_path):
print(f"Repository already cloned to {target_path}. Pulling latest changes...")
repo = git.Repo(target_path)
origin = repo.remotes.origin
origin.pull()
else:
print(f"Cloning remote repository {self.repo_source_path} to {target_path}...")
git.Repo.clone_from(self.repo_source_path, target_path)
self.local_repo_path = os.path.abspath(target_path)
self.is_cloned = True
print(f"Repository successfully cloned/updated at: {self.local_repo_path}")
else:
raise ValueError(f"Invalid repository source: {self.repo_source_path}. "
"Must be a local path to a Git repo or a remote URL.")
# Final check to ensure it's a valid Git repository
try:
_ = git.Repo(self.local_repo_path)
except git.InvalidGitRepositoryError as e:
raise ValueError(f"The path '{self.local_repo_path}' is not a valid Git repository.") from e
return self.local_repo_path
def cleanup(self) -> None:
"""
Removes the cloned repository directory if it was cloned by this manager.
"""
if self.is_cloned and self.local_repo_path and os.path.exists(self.local_repo_path):
print(f"Cleaning up cloned repository: {self.local_repo_path}")
shutil.rmtree(self.local_repo_path)
self.local_repo_path = None
self.is_cloned = False
class GitAnalyzer:
"""
Analyzes a local Git repository to extract metadata such as contributors,
commit history, branches, and tags.
"""
def __init__(self, repo_path: str):
"""
Initializes the GitAnalyzer with the path to the local repository.
Args:
repo_path: The absolute path to the local Git repository.
"""
try:
self.repo = git.Repo(repo_path)
self.repo_path = repo_path
except git.InvalidGitRepositoryError as e:
raise ValueError(f"'{repo_path}' is not a valid Git repository.") from e
def get_contributors(self) -> dict:
"""
Analyzes commit history to identify contributors and their commit counts.
Returns:
A dictionary where keys are contributor names (author name <email>)
and values are their respective commit counts.
"""
contributors = defaultdict(int)
for commit in self.repo.iter_commits():
author_info = f"{commit.author.name} <{commit.author.email}>"
contributors[author_info] += 1
return dict(contributors)
def get_commit_summary(self, max_commits: int = 50) -> list[dict]:
"""
Retrieves a summary of recent commits.
Args:
max_commits: The maximum number of commits to retrieve.
Returns:
A list of dictionaries, each representing a commit with its hash, author,
date, and message.
"""
commit_list = []
for i, commit in enumerate(self.repo.iter_commits()):
if i >= max_commits:
break
commit_list.append({
"hash": commit.hexsha,
"author": f"{commit.author.name} <{commit.author.email}>",
"date": datetime.fromtimestamp(commit.committed_date).strftime('%Y-%m-%d %H:%M:%S'),
"message": commit.message.strip()
})
return commit_list
def get_branches(self) -> list[str]:
"""
Lists all local and remote branches in the repository.
Returns:
A list of branch names.
"""
return [head.name for head in self.repo.heads] + \
[remote.name for remote in self.repo.remotes]
def get_tags(self) -> list[str]:
"""
Lists all tags (often representing releases) in the repository.
Returns:
A list of tag names.
"""
return [tag.name for tag in self.repo.tags]
def get_repo_structure(self) -> str:
"""
Generates a simplified tree-like representation of the repository's file structure.
Excludes typical Git-related directories and common build artifacts.
Returns:
A string representing the directory tree.
"""
structure_lines = []
ignore_patterns = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',
'target', 'build', 'dist', '.idea', '.vscode']
for root, dirs, files in os.walk(self.repo_path):
# Filter out ignored directories
dirs[:] = [d for d in dirs if d not in ignore_patterns]
level = root.replace(self.repo_path, '').count(os.sep)
indent = ' ' * level
relative_path = os.path.relpath(root, self.repo_path)
if relative_path == '.': # Don't print '.' for the root itself
structure_lines.append(f"{os.path.basename(self.repo_path)}/")
else:
structure_lines.append(f"{indent}|-- {os.path.basename(root)}/")
subindent = ' ' * (level + 1)
for f in files:
structure_lines.append(f"{subindent}|-- {f}")
return "\n".join(structure_lines)
3. `llm_interface.py`
# llm_interface.py
import os
from abc import ABC, abstractmethod
from typing import Any, Dict, List, Optional
from openai import OpenAI # type: ignore
from config import LLMConfig
class LLMClient(ABC):
"""
Abstract base class for LLM clients, defining the common interface.
"""
@abstractmethod
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the LLM and returns its completion.
Args:
prompt: The text prompt to send to the LLM.
temperature: Controls the randomness of the output. Higher values mean more random.
Returns:
The generated text completion from the LLM.
"""
pass
class OpenAILLMClient(LLMClient):
"""
Concrete implementation of LLMClient for OpenAI API.
"""
def __init__(self, config: LLMConfig):
"""
Initializes the OpenAI LLM client.
Args:
config: An LLMConfig instance containing OpenAI-specific settings.
"""
if config.llm_type != 'openai':
raise ValueError("LLMConfig must be of type 'openai' for OpenAILLMClient.")
if not config.api_key:
raise ValueError("OpenAI API key is missing in configuration.")
self.client = OpenAI(api_key=config.api_key)
self.model_name = config.model_name
print(f"Initialized OpenAI LLM Client with model: {self.model_name}")
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the OpenAI API and returns its completion.
"""
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=temperature,
)
return response.choices[0].message.content if response.choices[0].message.content else ""
except Exception as e:
print(f"Error calling OpenAI API: {e}")
return f"Error: Could not get completion from OpenAI API - {e}"
class LocalLLMClient(LLMClient):
"""
Concrete implementation of LLMClient for local LLM servers (e.g., Ollama).
Assumes a compatible OpenAI-like API endpoint.
"""
def __init__(self, config: LLMConfig):
"""
Initializes the Local LLM client.
Args:
config: An LLMConfig instance containing local LLM-specific settings.
"""
if config.llm_type != 'local':
raise ValueError("LLMConfig must be of type 'local' for LocalLLMClient.")
if not config.base_url:
raise ValueError("Base URL is missing for local LLM configuration.")
self.client = OpenAI(base_url=config.base_url, api_key="ollama") # API key is often dummy for local
self.model_name = config.model_name
print(f"Initialized Local LLM Client with model: {self.model_name} at {config.base_url}")
def get_completion(self, prompt: str, temperature: float = 0.7) -> str:
"""
Sends a prompt to the local LLM server and returns its completion.
"""
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=temperature,
)
return response.choices[0].message.content if response.choices[0].message.content else ""
except Exception as e:
print(f"Error calling Local LLM API: {e}")
return f"Error: Could not get completion from Local LLM API - {e}"
4. `summarization.py`
```python
# summarization.py
import os
import json
import mimetypes # Used for file type guessing
from typing import Dict, List, Tuple, Any, Optional
from llm_interface import LLMClient
class FileProcessor:
"""
Handles reading and processing of individual files within the repository.
"""
def __init__(self, repo_root: str):
"""
Initializes the FileProcessor.
Args:
repo_root: The root directory of the Git repository.
"""
self.repo_root = repo_root
def read_file_content(self, file_path: str) -> Optional[str]:
"""
Reads the content of a specified file.
Handles common encoding issues and skips binary files.
Args:
file_path: The absolute path to the file.
Returns:
The content of the file as a string, or None if it's a binary file
or cannot be read.
"""
if not os.path.exists(file_path) or not os.path.isfile(file_path):
print(f"Warning: File not found or is not a file: {file_path}")
return None
# Heuristic to skip binary files
mime_type_guess = None
try:
mime_type_guess, _ = mimetypes.guess_type(file_path)
except ImportError:
pass # mimetypes might not be available in some minimal environments
if mime_type_guess and not mime_type_guess.startswith('text'):
print(f"Skipping binary file: {file_path} (MIME type: {mime_type_guess})")
return None
# Attempt to read as text
try:
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
except UnicodeDecodeError:
print(f"Skipping non-UTF-8 or binary file: {file_path}")
return None
except Exception as e:
print(f"Error reading file {file_path}: {e}")
return None
class LLMSummarizer:
"""
Uses an LLM to generate summaries for file contents and aggregated information.
"""
def __init__(self, llm_client: LLMClient):
"""
Initializes the LLMSummarizer with an LLM client.
Args:
llm_client: An instance of a concrete LLMClient implementation.
"""
self.llm_client = llm_client
def summarize_file(self, file_path: str, file_content: str) -> str:
"""
Generates a concise summary for a single file's content.
Args:
file_path: The relative path of the file being summarized.
file_content: The full content of the file.
Returns:
A summary string generated by the LLM.
"""
prompt = (
f"You are an expert software engineer tasked with summarizing code and configuration files. "
f"Provide a concise summary of the purpose, key functionalities, and important configurations "
f"or dependencies found in the following file. Focus on what this file *does* and its role "
f"within a larger project. Keep the summary under 150 words.\n\n"
f"File: {file_path}\n"
f"Content:\n```\n{file_content}\n```\n\n"
f"Concise Summary:"
)
return self.llm_client.get_completion(prompt)
def summarize_directory(self, directory_path: str, combined_context: Dict[str, str]) -> str:
"""
Generates a summary for a directory based on the summaries of its contained files and sub-directories.
Args:
directory_path: The relative path of the directory.
combined_context: A dictionary mapping file/sub-directory paths to their summaries within this directory.
Returns:
A summary string for the directory.
"""
if not combined_context:
return f"Directory '{directory_path}' contains no relevant files or summaries."
summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in combined_context.items()])
dir_name_display = directory_path if directory_path != "." else "the root directory"
prompt = (
f"You are an expert software architect analyzing a project structure. "
f"Based on the following file and sub-directory summaries, provide a concise overview of the purpose "
f"and primary functionalities of {dir_name_display}. "
f"Identify any common themes, dependencies, or architectural patterns. "
f"Keep the summary under 200 words.\n\n"
f"Directory: {dir_name_display}\n"
f"Contextual Summaries:\n{summaries_text}\n\n"
f"Concise Directory Summary:"
)
return self.llm_client.get_completion(prompt)
def summarize_repository(self,
repo_name: str,
repo_structure: str,
directory_summaries: Dict[str, str],
git_metadata: Dict[str, Any]) -> str:
"""
Generates a comprehensive summary of the entire repository.
Args:
repo_name: The name of the repository.
repo_structure: A string representation of the repository's file structure.
directory_summaries: A dictionary mapping directory paths to their summaries.
git_metadata: A dictionary containing aggregated Git metadata (contributors, commits, etc.).
Returns:
A comprehensive summary string for the entire repository.
"""
dir_summaries_text = "\n".join([f"- {path}: {summary}" for path, summary in directory_summaries.items()])
contributors_text = "\n".join([f" - {author} ({count} commits)" for author, count in git_metadata.get('contributors', {}).items()])
recent_commits_text = "\n".join([f" - {c['date']} by {c['author']}: {c['message']}" for c in git_metadata.get('recent_commits', [])[:5]])
branches_text = ", ".join(git_metadata.get('branches', []))
tags_text = ", ".join(git_metadata.get('tags', []))
prompt = (
f"You are a highly intelligent AI assistant specializing in software project analysis. "
f"Your task is to provide a comprehensive and detailed summary of the Git repository named '{repo_name}'. "
f"Synthesize information from the repository's structure, directory-level summaries, and Git metadata. "
f"Cover the following aspects:\n"
f"1. **Overall Purpose and Key Functionalities:** What is the project about? What problems does it solve?\n"
f"2. **Architectural Overview/Structure:** Describe the main components and how they are organized.\n"
f"3. **Core Technologies/Dependencies:** Identify programming languages, frameworks, and key libraries.\n"
f"4. **Development Environment/Setup:** How would one set up and run this project? (e.g., Docker, `requirements.txt`)\n"
f"5. **Key Contributors and Activity:** Who are the main developers and what is the recent activity?\n"
f"6. **Release Strategy/Versioning:** How are releases managed (tags, branches)?\n"
f"7. **Known Issues/Limitations:** Any explicit mentions of problems or areas for improvement (from README/comments).\n"
f"8. **Evolution/Changes:** High-level overview of recent significant changes.\n\n"
f"Repository Name: {repo_name}\n"
f"Repository Structure:\n{repo_structure}\n\n"
f"Directory Summaries:\n{dir_summaries_text}\n\n"
f"Git Metadata:\n"
f" Contributors:\n{contributors_text}\n"
f" Recent Commits:\n{recent_commits_text}\n"
f" Branches: {branches_text}\n"
f" Tags (Releases): {tags_text}\n\n"
f"Comprehensive Repository Summary:"
)
return self.llm_client.get_completion(prompt, temperature=0.2) # Lower temperature for factual summary
class SummaryAggregator:
"""
Manages the storage and aggregation of file and directory summaries.
"""
def __init__(self, output_dir: str):
"""
Initializes the SummaryAggregator.
Args:
output_dir: The directory where summaries will be saved.
"""
self.output_dir = output_dir
os.makedirs(output_dir, exist_ok=True)
self.file_summaries: Dict[str, str] = {}
self.directory_summaries: Dict[str, str] = {}
self.repo_summary: Optional[str] = None
self.git_metadata: Dict[str, Any] = {}
def add_file_summary(self, relative_path: str, summary: str) -> None:
"""
Adds a summary for a specific file.
Args:
relative_path: The path of the file relative to the repository root.
summary: The LLM-generated summary for the file.
"""
self.file_summaries[relative_path] = summary
# Sanitize path for filename
safe_filename = relative_path.replace(os.sep, '_').replace('.', '_')
self._save_summary(f"file_summary_{safe_filename}.txt", summary)
def add_directory_summary(self, relative_path: str, summary: str) -> None:
"""
Adds a summary for a specific directory.
Args:
relative_path: The path of the directory relative to the repository root.
summary: The LLM-generated summary for the directory.
"""
self.directory_summaries[relative_path] = summary
# Sanitize path for filename
safe_filename = relative_path.replace(os.sep, '_')
self._save_summary(f"dir_summary_{safe_filename}.txt", summary)
def set_repo_summary(self, summary: str) -> None:
"""
Sets the final comprehensive repository summary.
Args:
summary: The LLM-generated summary for the entire repository.
"""
self.repo_summary = summary
self._save_summary("repository_summary.txt", summary)
def set_git_metadata(self, metadata: Dict[str, Any]) -> None:
"""
Stores the extracted Git metadata.
Args:
metadata: A dictionary containing Git metadata.
"""
self.git_metadata = metadata
self._save_summary("git_metadata.json", json.dumps(metadata, indent=2))
def get_file_summaries_for_directory(self, relative_dir_path: str) -> Dict[str, str]:
"""
Retrieves file summaries belonging directly to a specific directory (not subdirectories).
Args:
relative_dir_path: The relative path of the directory (e.g., "src", "." for root).
Returns:
A dictionary of file paths to summaries within that directory.
"""
if relative_dir_path == ".":
# Files directly in the root, not in any subdirectory
return {p: s for p, s in self.file_summaries.items() if os.path.dirname(p) == ""}
else:
# Files directly in the specified subdirectory
return {p: s for p, s in self.file_summaries.items() if os.path.dirname(p) == relative_dir_path}
def _save_summary(self, filename: str, content: str) -> None:
"""
Helper method to save a summary to a file.
"""
file_path = os.path.join(self.output_dir, filename)
try:
with open(file_path, 'w', encoding='utf-8') as f:
f.write(content)
print(f"Saved summary to {file_path}")
except Exception as e:
print(f"Error saving summary to {file_path}: {e}")
5. `agent.py`
# agent.py
import os
from typing import Any, Dict, Optional
from config import AgentConfig, LLMConfig
from git_operations import GitRepositoryManager, GitAnalyzer
from llm_interface import LLMClient, OpenAILLMClient, LocalLLMClient
from summarization import FileProcessor, LLMSummarizer, SummaryAggregator
class GitAnalysisAgent:
"""
The main orchestrator for the LLM-based Git analysis agent.
Coordinates repository acquisition, Git metadata extraction, file processing,
LLM summarization, and report generation.
"""
def __init__(self, config: AgentConfig):
"""
Initializes the GitAnalysisAgent with the provided configuration.
Args:
config: An instance of AgentConfig containing all necessary settings.
"""
self.config = config
self.repo_manager = GitRepositoryManager(config.repo_path, config.output_dir)
self.llm_client: LLMClient
if config.llm_config.llm_type == 'openai':
self.llm_client = OpenAILLMClient(config.llm_config)
elif config.llm_config.llm_type == 'local':
self.llm_client = LocalLLMClient(config.llm_config)
else:
raise ValueError(f"Unsupported LLM type: {config.llm_config.llm_type}")
self.llm_summarizer = LLMSummarizer(self.llm_client)
self.summary_aggregator = SummaryAggregator(config.output_dir)
self.local_repo_path: Optional[str] = None
self.git_analyzer: Optional[GitAnalyzer] = None
self.file_processor: Optional[FileProcessor] = None
def analyze_repository(self) -> str:
"""
Executes the full repository analysis workflow.
Returns:
The final comprehensive repository summary as a string.
"""
print("\n--- Starting Repository Analysis ---")
try:
# 1. Acquire Repository
self.local_repo_path = self.repo_manager.acquire_repository()
self.git_analyzer = GitAnalyzer(self.local_repo_path)
self.file_processor = FileProcessor(self.local_repo_path)
# 2. Extract Git Metadata
print("\n--- Extracting Git Metadata ---")
git_metadata = self._extract_git_metadata()
self.summary_aggregator.set_git_metadata(git_metadata)
# 3. Analyze and Summarize Files
print("\n--- Analyzing and Summarizing Files ---")
self._analyze_and_summarize_files()
# 4. Summarize Directories
print("\n--- Summarizing Directories ---")
self._summarize_directories()
# 5. Generate Final Repository Summary
print("\n--- Generating Final Repository Summary ---")
repo_name = os.path.basename(self.local_repo_path)
repo_structure = self.git_analyzer.get_repo_structure() if self.git_analyzer else "Could not generate structure."
final_repo_summary = self.llm_summarizer.summarize_repository(
repo_name=repo_name,
repo_structure=repo_structure,
directory_summaries=self.summary_aggregator.directory_summaries,
git_metadata=git_metadata
)
self.summary_aggregator.set_repo_summary(final_repo_summary)
print("\n--- Repository Analysis Complete ---")
return final_repo_summary
except Exception as e:
print(f"An error occurred during analysis: {e}")
return f"Analysis failed due to an error: {e}"
finally:
self.repo_manager.cleanup() # Ensure cloned repos are removed
def _extract_git_metadata(self) -> Dict[str, Any]:
"""Helper to extract and return Git metadata."""
if not self.git_analyzer:
raise RuntimeError("GitAnalyzer not initialized.")
metadata = {
"contributors": self.git_analyzer.get_contributors(),
"recent_commits": self.git_analyzer.get_commit_summary(max_commits=10),
"branches": self.git_analyzer.get_branches(),
"tags": self.git_analyzer.get_tags(),
"repo_structure_preview": self.git_analyzer.get_repo_structure() # Store a preview for context
}
print("Git metadata extracted.")
return metadata
def _analyze_and_summarize_files(self) -> None:
"""
Traverses the repository, reads files, and generates LLM summaries for each.
"""
if not self.local_repo_path or not self.file_processor:
raise RuntimeError("Repository path or file processor not initialized.")
# Walk through the repository, excluding common ignored directories
ignore_dirs = ['.git', '__pycache__', 'venv', '.venv', 'node_modules',
'target', 'build', 'dist', '.idea', '.vscode']
# Add common documentation files to process first, as they often contain purpose
priority_files = ['README.md', 'Dockerfile', 'requirements.txt', 'package.json', 'pom.xml']
processed_files = set()
# Process priority files first if they exist at the root
for p_file in priority_files:
abs_path = os.path.join(self.local_repo_path, p_file)
if os.path.exists(abs_path) and os.path.isfile(abs_path):
relative_path = os.path.relpath(abs_path, self.local_repo_path)
print(f"Processing priority file: {relative_path}")
content = self.file_processor.read_file_content(abs_path)
if content:
summary = self.llm_summarizer.summarize_file(relative_path, content)
self.summary_aggregator.add_file_summary(relative_path, summary)
processed_files.add(relative_path)
for root, dirs, files in os.walk(self.local_repo_path):
# Modify dirs in-place to prune traversal
dirs[:] = [d for d in dirs if d not in ignore_dirs]
for file_name in files:
abs_file_path = os.path.join(root, file_name)
relative_file_path = os.path.relpath(abs_file_path, self.local_repo_path)
if relative_file_path in processed_files:
continue # Skip files already processed as priority
# Skip common non-source files or very large files
if any(relative_file_path.endswith(ext) for ext in ['.png', '.jpg', '.jpeg', '.gif', '.bin', '.zip', '.tar.gz', '.log']) or \
os.path.getsize(abs_file_path) > 1024 * 1024: # e.g., 1MB limit for text files
print(f"Skipping large or non-text file: {relative_file_path}")
continue
print(f"Processing file: {relative_file_path}")
content = self.file_processor.read_file_content(abs_file_path)
if content:
summary = self.llm_summarizer.summarize_file(relative_file_path, content)
self.summary_aggregator.add_file_summary(relative_file_path, summary)
processed_files.add(relative_file_path)
def _summarize_directories(self) -> None:
"""
Generates summaries for directories based on their contained file summaries.
Processes directories from deepest to shallowest to ensure dependencies.
"""
if not self.local_repo_path:
raise RuntimeError("Repository path not initialized.")
# Get all unique directory paths that have files summarized
all_file_paths = self.summary_aggregator.file_summaries.keys()
all_dirs = set()
for f_path in all_file_paths:
current_dir = os.path.dirname(f_path)
while current_dir and current_dir != '.':
all_dirs.add(current_dir)
current_dir = os.path.dirname(current_dir)
# Ensure root directory is included if there are any files
if all_file_paths:
all_dirs.add(".") # Represents the root directory
# Sort directories by depth (deepest first) to summarize from bottom-up
sorted_dirs = sorted(list(all_dirs), key=lambda x: x.count(os.sep), reverse=True)
for dir_path in sorted_dirs:
print(f"Summarizing directory: {dir_path if dir_path != '.' else 'root'}")
file_summaries_in_dir = self.summary_aggregator.get_file_summaries_for_directory(dir_path)
# Include sub-directory summaries in the current directory's context
# This is key for progressive summarization.
# We look for summaries of directories that are direct children of the current dir_path.
sub_dir_summaries_for_context = {}
for existing_dir, existing_summary in self.summary_aggregator.directory_summaries.items():
# Check if existing_dir is a direct child of dir_path
# e.g., if dir_path is "src", existing_dir could be "src/utils"
if existing_dir != dir_path and os.path.dirname(existing_dir) == dir_path:
sub_dir_summaries_for_context[existing_dir] = existing_summary
combined_context = {**file_summaries_in_dir, **sub_dir_summaries_for_context}
if combined_context:
dir_summary = self.llm_summarizer.summarize_directory(dir_path, combined_context)
self.summary_aggregator.add_directory_summary(dir_path, dir_summary)
else:
print(f"No relevant file or sub-directory summaries found for {dir_path}. Skipping directory summary.")
6. `main.py`
# main.py
import os
import shutil
import git # type: ignore
from config import AgentConfig, LLMConfig
from agent import GitAnalysisAgent
def setup_example_repo(repo_name: str = "my_simple_project") -> str:
"""
Creates a dummy Git repository for demonstration purposes.
"""
repo_path = os.path.join(os.getcwd(), repo_name)
if os.path.exists(repo_path):
print(f"Cleaning up existing example repository at {repo_path}")
shutil.rmtree(repo_path) # Clean up previous run
os.makedirs(repo_path, exist_ok=True)
# Create files
with open(os.path.join(repo_path, "README.md"), "w") as f:
f.write("# My Simple Project\n\nThis is a basic Python project demonstrating a simple utility.\nIt includes a main script and a utility module.\n\n## Features\n- Greets a user.\n- Performs a simple arithmetic operation.\n\n## Setup\n1. Clone the repository.\n2. Install dependencies: `pip install -r requirements.txt`\n3. Run: `python src/main.py`\n\n## Known Issues\n- The arithmetic operation currently only supports integers.\n")
with open(os.path.join(repo_path, "requirements.txt"), "w") as f:
f.write("# No external dependencies for this simple example\n")
with open(os.path.join(repo_path, "Dockerfile"), "w") as f:
f.write("FROM python:3.9-slim-buster\nWORKDIR /app\nCOPY . /app\nRUN pip install --no-cache-dir -r requirements.txt\nCMD [\"python\", \"src/main.py\"]\n")
with open(os.path.join(repo_path, ".gitignore"), "w") as f:
f.write("__pycache__/\n*.pyc\nvenv/\n")
src_dir = os.path.join(repo_path, "src")
os.makedirs(src_dir, exist_ok=True)
with open(os.path.join(src_dir, "__init__.py"), "w") as f:
f.write("")
with open(os.path.join(src_dir, "main.py"), "w") as f:
f.write("from src.utils import add_numbers, greet\n\ndef run_application():\n \"\"\"\n Main function to run the simple application logic.\n \"\"\"\n print(\"Starting My Simple Project application...\")\n name = \"Alice\"\n greet(name)\n\n num1 = 10\n num2 = 5\n result = add_numbers(num1, num2)\n print(f\"The sum of {num1} and {num2} is: {result}\")\n print(\"Application finished.\")\n\nif __name__ == \"__main__\":\n run_application()\n")
with open(os.path.join(src_dir, "utils.py"), "w") as f:
f.write("def greet(name: str) -> None:\n \"\"\"\n Prints a greeting message to the console.\n\n Args:\n name: The name of the person to greet.\n \"\"\"\n print(f\"Hello, {name}! Welcome to the utility module.\")\n\ndef add_numbers(a: int, b: int) -> int:\n \"\"\"\n Adds two integer numbers and returns their sum.\n\n Args:\n a: The first integer.\n b: The second integer.\n\n Returns:\n The sum of a and b.\n \"\"\"\n return a + b\n")
# Initialize Git repository and make an initial commit
repo = git.Repo.init(repo_path)
repo.index.add(["."])
repo.index.commit("Initial commit: Set up basic project structure and files")
# Simulate another commit
with open(os.path.join(src_dir, "main.py"), "a") as f:
f.write("\n# Added a comment to simulate a change\n")
repo.index.add([os.path.join(src_dir, "main.py")])
repo.index.commit("Feature: Added a comment to main.py")
print(f"Example repository '{repo_name}' created and initialized at {repo_path}")
return repo_path
def main():
"""
Main function to configure and run the Git analysis agent.
"""
# --- IMPORTANT: Configure your LLM here ---
# For OpenAI: Ensure OPENAI_API_KEY environment variable is set
# llm_config = LLMConfig(llm_type='openai', model_name='gpt-4o-mini')
# For Local LLM (e.g., Ollama running 'llama3' model at default port)
# Make sure Ollama is running and you have 'llama3' model pulled:
# ollama run llama3
llm_config = LLMConfig(llm_type='local', model_name='llama3', base_url='http://localhost:11434/v1')
# --- Setup example local repository ---
local_repo_path = setup_example_repo("my_simple_project_to_analyze")
# Alternatively, use a remote repository:
# remote_repo_url = "https://github.com/git/git.git" # Example remote repo (will be cloned)
# agent_config = AgentConfig(repo_path=remote_repo_url, llm_config=llm_config)
agent_config = AgentConfig(repo_path=local_repo_path, llm_config=llm_config)
agent = GitAnalysisAgent(agent_config)
final_summary = agent.analyze_repository()
print("\n==============================================================================")
print("FINAL REPOSITORY ANALYSIS REPORT")
print("==============================================================================")
print(final_summary)
print("==============================================================================")
print(f"Detailed summaries are saved in: {agent_config.output_dir}")
if __name__ == "__main__":
main()
No comments:
Post a Comment