Sunday, May 11, 2025

Analyzing GitHub Repositories with LLMs Using Knowledge Graphs and Embeddings

Part 1: Analysis of Code Bases


When using Large Language Models (LLMs) like GPT-4, Claude, or local models such as Mistral to analyze extensive GitHub repositories, developers often face significant challenges due to the limited context length (typically 8K, 32K, or 128K tokens). Directly feeding the entire repository into an LLM is impractical. To tackle this issue effectively, combining Embedding-based Retrieval-Augmented Generation (RAG) with Knowledge Graphs provides a robust and intelligent approach to repository analysis. This article describes the methodology, detailing how to overcome LLM context limitations and enhance repository analysis through structured knowledge representation.


Challenges of Repository Analysis with LLMs


A typical GitHub repository can contain thousands of code files, documentation, configurations, and dependencies. Modern LLMs, despite their impressive capabilities, have strict token limits that cannot accommodate entire repositories at once. This constraint demands smarter strategies for feeding meaningful subsets of data into the LLM in response to user queries.


Two complementary methods arise to address this limitation:

Embedding-based semantic retrieval (RAG)

Integration of Knowledge Graphs for structured context


Solution Overview


The proposed solution combines three essential components:

1. Chunking and Semantic Embeddings: Segmenting the repository content into meaningful chunks and generating semantic embeddings for efficient retrieval.

2. Knowledge Graph Construction: Representing relationships among repository entities (code components, classes, methods, dependencies) to capture structural knowledge.

3. Hybrid Retrieval (RAG + Graph): Dynamically retrieving both semantic chunks and graph-structured context based on user queries.


Step-by-Step Implementation


Step 1: Repository Ingestion


Clone the GitHub repository and parse files:


import git, os


def clone_and_parse_repo(url, repo_dir):

    git.Repo.clone_from(url, repo_dir)

    code_files = []

    for root, dirs, files in os.walk(repo_dir):

        for file in files:

            if file.endswith(('.py', '.js', '.java', '.cpp', '.cs', '.go')):

                with open(os.path.join(root, file), 'r', encoding='utf-8') as f:

                    code_files.append((file, f.read()))

    return code_files


Step 2: Chunking and Embedding


Split files into semantic units (functions, methods, classes) and compute embeddings using SentenceTransformers:


from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')


def chunk_and_embed(code_files):

    embeddings = []

    for filename, content in code_files:

        chunks = content.split('\ndef ')

        for chunk in chunks:

            chunk_text = 'def ' + chunk if not chunk.startswith('def ') else chunk

            emb = model.encode(chunk_text)

            embeddings.append((filename, chunk_text, emb))

    return embeddings


These embeddings are stored in vector databases like FAISS or Chroma for rapid semantic retrieval.


Step 3: Constructing a Knowledge Graph


Knowledge graphs explicitly represent entities (functions, classes, modules) and their relationships (dependencies, calls, inheritance), thus capturing structural repository knowledge. Use libraries like Neo4j for graph storage and querying:


from neo4j import GraphDatabase

import re


driver = GraphDatabase.driver("neo4j://localhost:7687", auth=("neo4j", "password"))


def build_graph(code_files):

    with driver.session() as session:

        session.run("MATCH (n) DETACH DELETE n")  # Clear existing graph

        for filename, content in code_files:

            functions = re.findall(r'def (\w+)\(', content)

            for func in functions:

                session.run("MERGE (:Function {name: $name, file: $file})",

                            name=func, file=filename)

            calls = re.findall(r'(\w+)\(', content)

            for call in calls:

                if call in functions:

                    session.run("""

                        MATCH (f1:Function {name: $caller, file: $file})

                        MATCH (f2:Function {name: $callee})

                        MERGE (f1)-[:CALLS]->(f2)

                    """, caller=func, callee=call, file=filename)


Step 4: Hybrid Retrieval Strategy


Combine semantic retrieval (vector similarity search) and graph-based retrieval (structured relationships):


import numpy as np

import faiss


def hybrid_retrieve(query, embeddings, index, top_k=5):

    query_emb = model.encode(query)

    _, indices = index.search(np.array([query_emb]), top_k)

    semantic_chunks = [embeddings[i][1] for i in indices[0]]


    # Graph-based retrieval

    with driver.session() as session:

        result = session.run("""

            MATCH (f:Function)

            WHERE toLower(f.name) CONTAINS toLower($query)

            RETURN f.name, f.file LIMIT $limit

        """, query=query, limit=top_k)

        graph_chunks = [f"Function {record['f.name']} in {record['f.file']}"

                        for record in result]


    return semantic_chunks + graph_chunks


Step 5: Querying the LLM


Pass combined retrieval results within the LLM’s context limits:


from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY')


def ask_llm(query, context_chunks):

    context = '\n'.join(context_chunks)

    prompt = f"Analyze these repository insights:\n{context}\n\nQuestion: {query}"

    response = client.chat.completions.create(

        model="gpt-4-turbo-preview",

        messages=[{"role": "user", "content": prompt}],

        temperature=0.1

    )

    return response.choices[0].message.content


Benefits of Knowledge Graph Integration


Knowledge Graphs significantly enhance repository analysis by:

Providing structural insights: Capturing explicit dependencies, calls, and architectural patterns, which embeddings alone might miss.

Improving retrieval accuracy: Combining semantic context with precise, structured relationships improves query responses.

Enabling complex queries: Supporting complex analytical questions (“Which modules depend on authentication logic?”) that embeddings alone struggle to answer accurately.


Best Practices and Recommendations

Graph Granularity: Model meaningful repository entities and their relationships clearly to ensure efficient querying.

Hybrid Queries: Always combine semantic and graph-based retrieval to enrich context fed into the LLM.

LLM Choice:

For robust analysis with deeper reasoning, use GPT-4-Turbo or Claude 3.

For privacy-sensitive environments or lower-cost deployments, choose local models like Mistral or Mixtral quantized with llama.cpp.

Continuous Updates: Regularly update the knowledge graph and embeddings to reflect repository changes accurately.


Conclusion


By integrating Knowledge Graphs with semantic embeddings in a hybrid retrieval strategy, developers can effectively overcome context length limitations of LLMs when analyzing extensive GitHub repositories. This combined approach leverages both unstructured semantic knowledge and structured relationships, delivering deeper, more accurate insights into software projects.


This method transforms repository analysis from a limited token-constrained exercise into a scalable, intelligent process, greatly enhancing productivity and insight for development teams.


Part 2: How Does an LLM Understand Knowledge Graphs?


A Large Language Model (LLM) such as GPT-4, Claude, or Mistral does not inherently “understand” a knowledge graph (KG) in the human cognitive sense. Instead, the LLM accesses information stored in knowledge graphs through carefully designed prompts and structured context provided by software systems that mediate between the KG and the LLM. Let’s clarify exactly how this process works.


Step 1: Extracting Information from the Knowledge Graph


A knowledge graph organizes information as interconnected entities and relationships. For instance, a software project’s KG might have entities like “functions,” “classes,” or “modules,” and relationships like “calls,” “depends on,” or “inherits from.”


The graph itself typically exists externally (in databases such as Neo4j, Amazon Neptune, or graph libraries) and is accessed via structured queries, commonly in a graph query language such as Cypher:


Example (Neo4j Cypher query):


MATCH (f:Function)-[:CALLS]->(g:Function)

WHERE f.name = 'authenticateUser'

RETURN g.name


The output of such a query is a structured list or narrative form, like:


authenticateUser calls functions:

- validateCredentials

- logAuthenticationAttempt

- checkAccountStatus


Step 2: Converting Graph Information into Textual Context


LLMs operate exclusively on text (tokens). Therefore, the structured information extracted from the KG needs to be converted into readable textual format. A typical textual prompt derived from the above query might look like:


The function authenticateUser calls the following functions:

- validateCredentials

- logAuthenticationAttempt

- checkAccountStatus


Step 3: Prompt Engineering for Structured Understanding


The LLM’s comprehension of a knowledge graph relies on proper prompt engineering. The prompt explicitly defines relationships and entities clearly. Good prompts typically include explicit contextual cues:


Here are the dependencies and functions involved in authentication:


authenticateUser -> validateCredentials  

authenticateUser -> logAuthenticationAttempt  

authenticateUser -> checkAccountStatus  


Explain potential risks or optimizations in this structure.


Step 4: Leveraging Chain-of-Thought Reasoning


Advanced LLM techniques, such as Chain-of-Thought (CoT) prompting, further enhance the model’s ability to reason over provided KG information. By explicitly asking the LLM to think through intermediate steps, developers significantly improve its ability to interpret and analyze graph-derived contexts.


For example, the prompt might be structured as:


Here is the call structure extracted from the knowledge graph:


authenticateUser calls:

- validateCredentials (validates username and password)

- logAuthenticationAttempt (records success or failure)

- checkAccountStatus (ensures account is active)


Analyze step-by-step whether this call structure introduces security or performance risks.


The LLM responds step-by-step, first examining each node and relationship in isolation, and then reasoning about potential risks based on this structured understanding.


Step 5: Using Few-shot Examples to Improve Graph Interpretation


LLMs can also be given “few-shot” examples—explicit demonstrations of how to interpret and reason about the knowledge graph structures provided. By providing examples, the LLM learns the intended reasoning style:


Example:


Given graph relationships:

- login calls validateInput

- validateInput calls sanitizeUserData


Q: Could the structure introduce security vulnerabilities?

A: Yes. If sanitizeUserData does not adequately handle input sanitization, it could introduce vulnerabilities such as SQL injection.


Now, using similar reasoning, analyze the provided authentication graph structure:


Example:


Given graph relationships:

- login calls validateInput

- validateInput calls sanitizeUserData


Q: Could the structure introduce security vulnerabilities?

A: Yes. If sanitizeUserData does not adequately handle input sanitization, it could introduce vulnerabilities such as SQL injection.


Such few-shot prompting greatly improves the LLM’s ability to correctly interpret and reason about relationships in graphs.


Step 6: Combining Semantic and Graph Contexts


Often, applications provide hybrid retrieval by combining KG-structured contexts with semantic embeddings. The KG gives precise relational context, while embeddings provide broader semantic similarity context:


For example:

Knowledge Graph context clearly defines explicit dependencies and relationships.

Embedding-based context provides additional information from code snippets or documentation, capturing more implicit meaning.


Thus, the LLM has both precise relational details (from the KG) and contextual nuance (from embeddings), significantly enhancing its understanding and reasoning capabilities.


Summary: How LLMs “Understand” Knowledge Graphs


To summarize clearly, an LLM does not have intrinsic graph-processing capability. Instead, it understands knowledge graphs by receiving carefully constructed textual contexts derived from those graphs. The process can be described as follows:

1. Query the Graph: Extract structured information.

2. Convert to Text: Format relationships explicitly in readable prompts.

3. Structured Prompts: Clearly represent entities and relations.

4. Chain-of-Thought: Ask explicitly for step-by-step reasoning.

5. Few-shot Prompting: Provide illustrative examples for reasoning guidance.

6. Hybrid Context: Combine semantic embeddings and graph context for richer interpretation.


This carefully engineered approach enables LLMs to effectively reason over structured knowledge, greatly enhancing their ability to analyze complex data sets like large GitHub repositories.

No comments: