Saturday, April 25, 2026

DETECTING LLM-GENERATED CONTENT IN ACADEMIC AND PROFESSIONAL WORK



SECTION 1: WHY DETECTION OF LLM-GENERATED CONTENT MATTERS

The emergence of large language models such as GPT-4, Claude, Gemini, LLaMA, Mistral, and their many derivatives has fundamentally changed the landscape of written and coded work in universities, companies, and research institutions. These models can produce fluent, coherent, and superficially convincing text and code with astonishing speed and at virtually zero marginal cost. A student who once needed hours to write a ten-page essay can now obtain a polished draft in seconds. A developer who once needed days to implement a module can receive a working skeleton in minutes. This shift is not inherently bad, but it creates a profound challenge for anyone who needs to evaluate whether a piece of work reflects genuine human understanding, effort, and creativity.

The core problem is one of authenticity and fairness. When a university assigns an essay or a coding project, the purpose is not merely to obtain a deliverable. The purpose is to assess whether the student has internalized the material, can reason about it independently, and can communicate that reasoning clearly. If the deliverable is produced entirely or substantially by an LLM, the assessment loses its validity. The grade awarded no longer reflects the student's knowledge but rather the student's ability to prompt a machine, which is a very different skill. This creates a fairness problem for students who do the work honestly, because they compete against peers who have effectively outsourced the cognitive labor.

Beyond fairness, there is a deeper epistemic concern. LLMs are trained on vast corpora of text and code, and they are extraordinarily good at producing plausible-sounding content. However, they do not understand the material in any meaningful sense. They predict tokens based on statistical patterns. This means that LLM-generated text can contain subtle errors, hallucinated facts, and logically inconsistent arguments that are difficult to spot without deep domain knowledge. In a medical, legal, or engineering context, such errors can have serious consequences. A student who submits LLM-generated work and receives a passing grade may enter professional practice without the knowledge they were supposed to have acquired. The detection of LLM usage is therefore not just an academic integrity issue but a safety issue.

In the corporate world, the stakes are different but equally real. When a consultant submits a report, a lawyer drafts a brief, or an engineer writes a specification, the client or employer expects that the work reflects genuine expertise and judgment. If the work is largely LLM-generated, the professional is misrepresenting their contribution. Intellectual property questions also arise, because LLM outputs may incorporate patterns from copyrighted training data in ways that create legal liability. Companies that use LLM-generated code in production systems without proper review may introduce security vulnerabilities, licensing violations, or subtle logic errors that are hard to trace.

There is also a pedagogical dimension. The process of struggling with a problem, making mistakes, and refining one's thinking is how humans develop expertise. If LLMs short-circuit this process, students may graduate with credentials but without the underlying competence those credentials are supposed to certify. Detection tools are therefore not just punitive instruments. They are also diagnostic tools that help educators understand where students are struggling and where additional support is needed.

Finally, there is the question of intellectual honesty and attribution. Academic and professional norms require that authors take responsibility for the claims they make. If a significant portion of a document was generated by a machine, the human author cannot fully stand behind every claim, because they may not have verified or even understood all of it. Proper attribution of LLM assistance is therefore an ethical obligation, and tools that detect undisclosed LLM usage help enforce this norm.

SECTION 2: THE LANDSCAPE OF LLM DETECTION METHODS

Detecting whether text or code was generated by an LLM is a genuinely difficult problem, and no single method is perfectly reliable. The field has evolved rapidly, and practitioners typically combine multiple signals to arrive at a confident assessment. Understanding each method deeply is essential before building a detection tool, because the quality of the tool depends entirely on the quality of the signals it uses.

2.1 Perplexity and Burstiness Analysis

Perplexity is a measure of how surprised a language model is by a given piece of text. Formally, if a language model assigns probability P to a sequence of n tokens, the perplexity is defined as:

PP = exp( - (1/n) * sum( log P(token_i | context) ) )

A low perplexity means the model found the text highly predictable, which is exactly what you would expect if the text was generated by a similar model. Human-written text tends to have higher perplexity because humans make idiosyncratic word choices, use unusual constructions, and occasionally violate statistical norms in ways that reflect genuine thought rather than pattern completion.

The burstiness concept captures a related but distinct phenomenon. Human writers tend to alternate between complex, long sentences and short, punchy ones. Their perplexity scores vary considerably from sentence to sentence. LLMs, by contrast, tend to produce text with more uniform perplexity across sentences. The variance of perplexity scores is therefore a useful signal. Low variance suggests LLM generation, while high variance suggests human authorship.

The practical challenge with perplexity-based detection is that it requires access to a language model to compute the scores, and the model used for detection may not be the same model that generated the text. Different models have different statistical fingerprints, so a text generated by GPT-4 may not appear low-perplexity to a LLaMA-based detector. Despite this limitation, perplexity analysis remains one of the most principled and theoretically grounded approaches to LLM detection.

2.2 Stylometric Analysis

Stylometry is the study of linguistic style as a fingerprint of authorship. It has a long history in literary scholarship, where it has been used to attribute anonymous texts to known authors. In the context of LLM detection, stylometric features serve as signals that distinguish machine-generated from human-generated text.

The most informative stylometric features for LLM detection include the following. Vocabulary richness, measured by the type-token ratio or more sophisticated measures like the Yule K characteristic, captures how diverse the vocabulary is relative to the total number of words. LLMs tend to use a moderately rich vocabulary that is neither too repetitive nor too exotic, whereas human writers show more extreme patterns in both directions. Sentence length distribution captures how varied the sentence lengths are. LLMs tend to produce sentences of moderate and relatively uniform length, while human writers show more dramatic variation. Function word usage captures the distribution of words like "the", "a", "of", "in", and "that", which are highly stable within a given author's style but differ between LLMs and humans. Punctuation patterns capture how commas, semicolons, dashes, and other marks are used. LLMs tend to use punctuation correctly but somewhat mechanically, while human writers show more idiosyncratic patterns. Hedging and uncertainty language captures how often phrases like "it seems", "arguably", "one might say", or "it is worth noting" appear. LLMs are notorious for overusing such phrases, particularly at the beginnings and ends of paragraphs.

Transition phrase overuse is one of the most reliable single indicators of LLM- generated text. Phrases like "Furthermore,", "Moreover,", "In conclusion,", "It is important to note that", "In summary,", "Delving into", "It is worth mentioning", and "In the realm of" appear with dramatically higher frequency in LLM outputs than in human writing. This is because LLMs learn from text that includes many such connective phrases, and they tend to reproduce them as a default structural scaffolding.

Another powerful stylometric signal is the absence of personal voice. Human writers, even when writing formally, tend to leave traces of their personality, their particular way of framing problems, their preferred analogies, and their characteristic rhetorical moves. LLM-generated text tends to be more generic, more balanced, and more neutral than human text, because the model is optimizing for broad acceptability rather than expressing a genuine point of view.

2.3 Semantic Coherence and Logical Structure Analysis

LLMs are very good at producing locally coherent text, meaning that each sentence follows plausibly from the previous one. However, they sometimes struggle with global coherence, meaning that the overall argument of a long document may be less tightly organized than a human expert would produce. This is because LLMs generate text token by token without a global plan, relying instead on the local context window to maintain coherence.

A detection tool can exploit this by analyzing the semantic similarity between adjacent paragraphs and between distant paragraphs. In human-written text, there is typically a clear argumentative thread that connects the introduction to the conclusion, with each section building on the previous one in a purposeful way. In LLM-generated text, the connections between sections are often more superficial, consisting of transition phrases rather than genuine logical dependencies.

Another related signal is the presence of contradictions or inconsistencies. Because LLMs generate text without a persistent memory of what they have already said, they sometimes make claims in one section that are subtly inconsistent with claims in another section. A detection tool that uses an LLM to check for internal consistency can therefore use the model's own capabilities against itself.

2.4 Code-Specific Detection Signals

When the work being analyzed contains code rather than prose, a different set of signals becomes relevant. LLM-generated code has several characteristic patterns that distinguish it from code written by experienced human developers.

The most prominent pattern is the overuse of comments. LLMs tend to add a comment to almost every line or block of code, explaining what the code does in a way that is often redundant and sometimes slightly inaccurate. Experienced human developers follow the principle that good code is self-documenting and that comments should explain why, not what. LLM-generated code typically violates this principle systematically.

Variable and function naming in LLM-generated code tends to be very descriptive and verbose, following a pattern like "calculate_total_price_including_tax" rather than the more varied and sometimes abbreviated naming that human developers use. This is because LLMs are trained on code from tutorials and documentation, which tends to use very explicit naming for pedagogical clarity.

Error handling in LLM-generated code is often either absent or overly generic. LLMs frequently produce try-except blocks that catch all exceptions with a broad "except Exception as e" clause and print a generic error message, rather than handling specific error conditions in a principled way. This pattern is a strong indicator of LLM generation.

The structure of LLM-generated code also tends to be very regular and modular in a somewhat mechanical way. Functions are typically short, well-separated, and follow a predictable template. Human-written code, especially in production systems, tends to show more organic growth, with some functions that are longer and more complex than others, and with structural choices that reflect the specific constraints and history of the project.

Import organization in LLM-generated Python code typically follows the PEP 8 standard perfectly, with standard library imports first, then third-party imports, then local imports. While this is good practice, the mechanical perfection of the organization is itself a signal, because human developers often have slightly messier import sections that reflect the iterative nature of their development process.

2.5 Watermark Detection

Some LLM providers embed statistical watermarks in their outputs by subtly biasing the token selection process during generation. The idea is to partition the vocabulary into "green" and "red" tokens and to preferentially select green tokens during generation. A detector that knows the partitioning scheme can then check whether a piece of text contains an unusually high proportion of green tokens, which would indicate that it was generated by the watermarked model.

The main limitation of watermark detection is that it requires knowledge of the specific watermarking scheme used by the model, and most commercial LLM providers do not publish this information. Furthermore, watermarks can be removed by paraphrasing the text, so a determined user can evade watermark detection with relatively little effort. Watermark detection is therefore most useful in controlled environments where the specific model used is known and the watermark scheme is accessible.

2.6 LLM-as-Judge Detection

A particularly powerful approach is to use a capable LLM as a judge to evaluate whether a piece of text or code was generated by another LLM. This approach leverages the fact that LLMs have internalized the statistical patterns of their own outputs and can often recognize those patterns in new text. The judge is given the text to be analyzed along with a detailed prompt that describes the specific signals to look for, and it returns a structured assessment.

The LLM-as-judge approach has several advantages. It can handle nuanced cases that rule-based methods would miss. It can provide natural language explanations for its assessments, which are useful for communicating findings to students or employees. It can be updated simply by changing the prompt, without retraining any models. And it can combine multiple signals in a flexible, context-sensitive way.

The main disadvantage is that the judge LLM can be wrong, and its errors are not always predictable. It may also be biased toward flagging text that is simply well-written or that follows academic conventions, even if it was written by a human. Calibration and validation against known human-written and LLM- generated texts is therefore essential.

2.7 Hybrid Approaches

The most reliable detection systems combine multiple methods. A typical hybrid approach might work as follows. First, the document is segmented into passages of manageable size. Second, each passage is analyzed using statistical methods to compute perplexity, burstiness, and stylometric features. Third, the passage is evaluated by an LLM judge that provides a likelihood score and a natural language explanation. Fourth, the scores from all methods are combined using a weighted ensemble to produce a final likelihood estimate. Fifth, the results are aggregated across passages to produce a document-level assessment.

The weighting of different methods in the ensemble should ideally be calibrated on a labeled dataset of known human-written and LLM-generated texts. In the absence of such a dataset, reasonable default weights can be assigned based on the theoretical reliability of each method.

SECTION 3: ARCHITECTURE OF THE LLM DETECTION APPLICATION

The application we are going to build is called LLMSleuth. It takes a file as input, which can be a plain text file, a Python source file, a PDF document, or a Microsoft Word document. It segments the file into passages, analyzes each passage using a combination of statistical methods and an LLM judge, and produces an Excel report with detailed findings.

The architecture of LLMSleuth consists of the following major components.

The file ingestion layer is responsible for reading the input file and converting it into a uniform internal representation. It handles different file formats using appropriate parsing libraries and extracts both the textual content and the structural metadata, such as page numbers, section headings, and line numbers.

The segmentation engine divides the document into passages that are small enough to be analyzed individually but large enough to contain meaningful stylistic signals. For prose text, passages are typically paragraphs or groups of sentences. For code, passages are typically functions, classes, or logical blocks.

The statistical analysis engine computes perplexity scores using a local language model, calculates stylometric features, and identifies specific linguistic patterns that are associated with LLM generation. It produces a numerical feature vector for each passage.

The LLM judge engine sends each passage to a language model with a carefully crafted prompt and receives a structured response containing a likelihood score, a list of specific indicators, and a natural language explanation. It supports both local models (via llama.cpp, Ollama, MLX, or HuggingFace Transformers with CUDA/ROCm support) and remote models (via the OpenAI or Anthropic APIs).

The ensemble scoring engine combines the outputs of the statistical analysis engine and the LLM judge engine to produce a final likelihood score for each passage. It also computes the percentage contribution of each passage to the overall document.

The report generation engine takes the scored passages and produces an Excel workbook with the required columns, applying color coding and formatting to make the report easy to read and interpret.

The command-line interface ties all the components together and provides a user-friendly way to invoke the tool with the appropriate options.

SECTION 4: BUILDING THE APPLICATION STEP BY STEP

We will now build LLMSleuth step by step, explaining each component in detail and illustrating the key concepts with code examples. The full production-ready implementation is provided in the Addendum at the end of this article.

4.1 Setting Up the Project Structure

A clean project structure is essential for maintainability. LLMSleuth follows the src-layout convention, which separates source code from tests and configuration files.

llmsleuth/
    src/
        llmsleuth/
            __init__.py
            cli.py
            ingestion/
                __init__.py
                file_reader.py
                segmenter.py
            analysis/
                __init__.py
                statistical.py
                llm_judge.py
                ensemble.py
            reporting/
                __init__.py
                excel_reporter.py
            backends/
                __init__.py
                local_backend.py
                remote_backend.py
                backend_factory.py
    tests/
    pyproject.toml
    README.txt

This structure separates concerns cleanly. The ingestion package handles everything related to reading and segmenting input files. The analysis package contains the three analysis engines. The reporting package handles Excel generation. The backends package abstracts over the different LLM inference backends.

4.2 File Ingestion and Segmentation

The first challenge is reading the input file. We need to handle plain text, Python source files, PDF documents, and Word documents. Each format requires a different parsing approach, but the output should always be a list of Segment objects, where each Segment has a type (text or code), a content string, a location descriptor, and a character offset within the document.

A Segment is defined as a simple dataclass:

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Segment:
    # The type of content: "text" or "code"
    segment_type: str

    # The actual content of the segment
    content: str

    # A human-readable description of where this segment appears
    location: str

    # The character offset of the start of this segment in the document
    char_offset: int

    # The length of this segment in characters
    char_length: int

    # Optional: the name of the function or class if this is a code segment
    code_unit_name: Optional[str] = None

    # Analysis results, populated later by the analysis engines
    statistical_features: dict = field(default_factory=dict)
    llm_judge_result: dict = field(default_factory=dict)
    final_score: float = 0.0

This dataclass serves as the central data structure that flows through the entire pipeline. The statistical_features and llm_judge_result fields are initially empty and are populated by the respective analysis engines. The final_score field is populated by the ensemble scoring engine.

For reading PDF files, we use the PyMuPDF library, which provides fast and accurate text extraction with layout information. The key insight is that we want to preserve the structure of the document, including page numbers and paragraph boundaries, so that the location field of each Segment is meaningful.

import fitz  # PyMuPDF

def read_pdf(file_path: str) -> list[dict]:
    """
    Reads a PDF file and returns a list of page dictionaries,
    each containing the page number and the extracted text blocks.
    Blocks are separated by PyMuPDF based on layout analysis.
    """
    doc = fitz.open(file_path)
    pages = []
    for page_num, page in enumerate(doc, start=1):
        # Extract text as a dictionary with block information
        blocks = page.get_text("blocks")
        text_blocks = []
        for block in blocks:
            # block is (x0, y0, x1, y1, text, block_no, block_type)
            # block_type 0 is text, 1 is image
            if block[6] == 0:  # text block
                text_blocks.append({
                    "text": block[4].strip(),
                    "page": page_num,
                    "block_no": block[5]
                })
        pages.append({"page": page_num, "blocks": text_blocks})
    doc.close()
    return pages

This function returns a structured representation of the PDF that preserves page and block boundaries. The block_no field allows us to construct a precise location descriptor like "Page 3, Block 7".

For Python source files, we use the ast module to parse the code into an abstract syntax tree and then extract individual functions and classes as separate segments. This is much more informative than treating the entire file as a single block of text.

import ast
import textwrap

def extract_code_units(source_code: str, file_path: str) -> list[dict]:
    """
    Parses a Python source file and extracts all top-level and nested
    functions and classes as individual code units, along with their
    line numbers and source text.
    """
    tree = ast.parse(source_code)
    units = []
    source_lines = source_code.splitlines()

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef,
                              ast.ClassDef)):
            start_line = node.lineno - 1  # 0-indexed
            end_line = node.end_lineno    # 1-indexed, inclusive

            # Extract the source lines for this unit
            unit_lines = source_lines[start_line:end_line]
            unit_source = "\n".join(unit_lines)

            # Dedent to remove common leading whitespace
            unit_source = textwrap.dedent(unit_source)

            unit_type = "class" if isinstance(node, ast.ClassDef) \
                        else "function"

            units.append({
                "name": node.name,
                "type": unit_type,
                "source": unit_source,
                "start_line": node.lineno,
                "end_line": node.end_lineno,
                "file": file_path
            })

    return units

The ast.walk function traverses the entire AST, including nested functions and classes. This means that a method inside a class will be extracted as a separate unit in addition to the class itself. This is intentional, because a student might have written some methods themselves while using an LLM to generate others.

The segmentation of prose text into paragraphs is more straightforward. We split on double newlines, which is the standard paragraph separator in plain text documents, and we also handle the case where paragraphs are separated by single newlines with indentation.

import re

def segment_prose(text: str, base_location: str) -> list[dict]:
    """
    Splits a block of prose text into paragraphs, filtering out
    very short paragraphs that do not contain enough content for
    reliable analysis.
    """
    # Normalize line endings and split on blank lines
    normalized = text.replace("\r\n", "\n").replace("\r", "\n")
    raw_paragraphs = re.split(r"\n\s*\n", normalized)

    segments = []
    char_offset = 0
    para_index = 0

    for para in raw_paragraphs:
        para = para.strip()
        if len(para) < 50:
            # Skip very short paragraphs (headings, page numbers, etc.)
            char_offset += len(para) + 2
            continue

        para_index += 1
        segments.append({
            "text": para,
            "location": f"{base_location}, Paragraph {para_index}",
            "char_offset": char_offset
        })
        char_offset += len(para) + 2  # +2 for the blank line separator

    return segments

The minimum length threshold of 50 characters filters out headings, page numbers, and other short fragments that would produce unreliable analysis results. This threshold is a reasonable default but could be made configurable.

4.3 Statistical Analysis Engine

The statistical analysis engine computes a set of numerical features for each segment. These features capture the stylometric and structural properties that distinguish LLM-generated from human-generated content. We compute them without any LLM inference, which makes this engine fast and deterministic.

The first feature we compute is the average sentence length and its standard deviation. We use the NLTK sentence tokenizer for this, which handles abbreviations and other edge cases better than a simple period-based split.

import nltk
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize

# Ensure the required NLTK data is available
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)

def compute_sentence_stats(text: str) -> dict:
    """
    Computes sentence length statistics for a text passage.
    Returns the mean and standard deviation of sentence lengths
    in words, and the coefficient of variation (std/mean).
    A low coefficient of variation suggests LLM generation.
    """
    sentences = sent_tokenize(text)
    if len(sentences) < 2:
        return {
            "mean_sentence_length": 0.0,
            "std_sentence_length": 0.0,
            "sentence_length_cv": 0.0,
            "sentence_count": len(sentences)
        }

    lengths = [len(word_tokenize(s)) for s in sentences]
    mean_len = float(np.mean(lengths))
    std_len = float(np.std(lengths))
    cv = std_len / mean_len if mean_len > 0 else 0.0

    return {
        "mean_sentence_length": mean_len,
        "std_sentence_length": std_len,
        "sentence_length_cv": cv,
        "sentence_count": len(sentences)
    }

The coefficient of variation (CV) is a normalized measure of variability. A CV close to zero means all sentences have nearly the same length, which is characteristic of LLM output. A high CV means sentence lengths vary widely, which is more characteristic of human writing. In practice, a CV below 0.3 is a weak indicator of LLM generation, while a CV below 0.15 is a strong indicator.

The next feature is the density of LLM-characteristic phrases. We maintain a curated list of phrases that are strongly associated with LLM outputs and count how many of them appear in the text.

LLM_MARKER_PHRASES = [
    "it is important to note",
    "it is worth noting",
    "it is worth mentioning",
    "furthermore",
    "moreover",
    "in conclusion",
    "in summary",
    "to summarize",
    "delving into",
    "in the realm of",
    "it is crucial to",
    "one must consider",
    "it goes without saying",
    "needless to say",
    "as previously mentioned",
    "as mentioned earlier",
    "in other words",
    "that being said",
    "with that said",
    "at the end of the day",
    "when it comes to",
    "in terms of",
    "it should be noted",
    "it is essential to",
    "plays a crucial role",
    "plays a vital role",
    "it is imperative",
    "a comprehensive",
    "a holistic",
    "leveraging",
    "utilize",
    "utilize the power of",
    "in today's world",
    "in today's fast-paced",
    "the landscape of",
    "a testament to",
    "foster",
    "facilitate",
    "streamline",
    "robust",
    "cutting-edge",
    "state-of-the-art",
    "paradigm shift",
    "synergy",
    "actionable insights",
    "deep dive",
    "game changer",
    "transformative"
]

def compute_llm_phrase_density(text: str) -> dict:
    """
    Counts the number of LLM-characteristic phrases in the text
    and returns the density (phrases per 100 words) and the list
    of matched phrases.
    """
    text_lower = text.lower()
    word_count = len(text_lower.split())
    matched = []

    for phrase in LLM_MARKER_PHRASES:
        if phrase in text_lower:
            matched.append(phrase)

    density = (len(matched) / word_count * 100) if word_count > 0 else 0.0

    return {
        "llm_phrase_count": len(matched),
        "llm_phrase_density": density,
        "matched_phrases": matched
    }

The phrase density metric normalizes the count by the length of the text, which allows fair comparison between short and long passages. A density above 1.0 phrases per 100 words is a moderate indicator of LLM generation, while a density above 3.0 is a strong indicator.

For vocabulary richness, we compute the type-token ratio (TTR), which is the ratio of unique words to total words. We also compute the moving average TTR (MATTR), which is more robust for texts of different lengths because it computes the TTR over a sliding window of fixed size.

def compute_vocabulary_richness(text: str, window_size: int = 50) -> dict:
    """
    Computes vocabulary richness metrics for a text passage.
    The MATTR (Moving Average Type-Token Ratio) is computed over
    a sliding window to normalize for text length.
    A very high or very low MATTR can indicate LLM generation,
    but the signal is weaker than other features.
    """
    words = word_tokenize(text.lower())
    # Remove punctuation tokens
    words = [w for w in words if w.isalpha()]

    if len(words) == 0:
        return {"ttr": 0.0, "mattr": 0.0, "word_count": 0}

    ttr = len(set(words)) / len(words)

    # Compute MATTR
    if len(words) < window_size:
        mattr = ttr
    else:
        ttrs = []
        for i in range(len(words) - window_size + 1):
            window = words[i:i + window_size]
            ttrs.append(len(set(window)) / window_size)
        mattr = float(np.mean(ttrs))

    return {
        "ttr": ttr,
        "mattr": mattr,
        "word_count": len(words)
    }

For code segments, we compute a different set of features that are specific to programming style. The comment-to-code ratio is one of the most reliable indicators of LLM-generated code.

def compute_code_features(source_code: str, language: str = "python") -> dict:
    """
    Computes code-specific features that are indicative of LLM generation.
    Currently supports Python. The comment ratio, generic exception handling,
    and naming verbosity are the most reliable indicators.
    """
    lines = source_code.splitlines()
    total_lines = len(lines)

    if total_lines == 0:
        return {}

    # Count comment lines (lines starting with # after stripping)
    comment_lines = sum(
        1 for line in lines
        if line.strip().startswith("#")
    )
    comment_ratio = comment_lines / total_lines

    # Count docstring lines (lines between triple quotes)
    in_docstring = False
    docstring_lines = 0
    for line in lines:
        stripped = line.strip()
        if stripped.startswith('"""') or stripped.startswith("'''"):
            in_docstring = not in_docstring
            docstring_lines += 1
        elif in_docstring:
            docstring_lines += 1

    # Detect generic exception handling
    generic_except_count = sum(
        1 for line in lines
        if re.search(r"except\s+(Exception|BaseException)\s+as", line)
        or line.strip() == "except:"
    )

    # Detect verbose variable names (names longer than 20 chars)
    identifiers = re.findall(r"\b([a-zA-Z_][a-zA-Z0-9_]*)\b", source_code)
    long_identifiers = [i for i in identifiers if len(i) > 20]
    verbosity_ratio = len(long_identifiers) / len(identifiers) \
                      if identifiers else 0.0

    # Detect print-based debugging (common in LLM-generated code)
    print_debug_count = sum(
        1 for line in lines
        if re.search(r'\bprint\s*\(', line)
        and "test" not in line.lower()
    )

    return {
        "comment_ratio": comment_ratio,
        "docstring_line_ratio": docstring_lines / total_lines,
        "generic_except_count": generic_except_count,
        "verbosity_ratio": verbosity_ratio,
        "print_debug_count": print_debug_count,
        "total_lines": total_lines
    }

The comment_ratio is particularly telling. In LLM-generated Python code, it is common to see 30-50% of lines being comments or docstrings, whereas in production human-written code, the ratio is typically 10-20%. A ratio above 40% is a strong indicator of LLM generation.

4.4 Perplexity Scoring with a Local Language Model

Computing perplexity requires a language model. We use a small but capable model that can run locally on the user's hardware. The model is loaded once at startup and then used to score all passages. We support multiple inference backends to accommodate different hardware configurations.

The perplexity computation works by feeding the text to the model and asking it to compute the log-probability of each token given its context. The perplexity is then the exponential of the average negative log-probability.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class PerplexityScorer:
    """
    Computes perplexity scores for text passages using a local
    causal language model. Supports CUDA (NVIDIA), MPS (Apple Silicon),
    and CPU backends automatically.
    """

    def __init__(self, model_name: str = "gpt2"):
        # Detect the best available device
        if torch.cuda.is_available():
            self.device = torch.device("cuda")
        elif hasattr(torch.backends, "mps") \
             and torch.backends.mps.is_available():
            self.device = torch.device("mps")
        else:
            self.device = torch.device("cpu")

        print(f"PerplexityScorer: using device {self.device}")

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16
                if self.device.type in ("cuda", "mps") else torch.float32
        ).to(self.device)
        self.model.eval()

    def score(self, text: str, max_length: int = 512) -> dict:
        """
        Computes the perplexity of the given text and returns it
        along with the per-token log-probabilities, which can be
        used to identify the most surprising (and therefore most
        likely human-written) parts of the text.
        """
        encodings = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=max_length
        ).to(self.device)

        input_ids = encodings["input_ids"]

        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
            # outputs.loss is the mean negative log-likelihood per token
            loss = outputs.loss.item()

        perplexity = float(torch.exp(torch.tensor(loss)).item())

        return {
            "perplexity": perplexity,
            "mean_nll": loss
        }

The device detection logic handles NVIDIA GPUs via CUDA, Apple Silicon via MPS (Metal Performance Shaders), and falls back to CPU if neither is available. The float16 data type is used on GPU and MPS devices to reduce memory usage and increase throughput, while float32 is used on CPU for numerical stability.

The model_name parameter defaults to "gpt2", which is a small model that can run on almost any hardware. For better accuracy, a larger model like "EleutherAI/gpt-neo-1.3B" or "microsoft/phi-2" can be used, but these require more memory and compute time.

For Apple Silicon users who want to use MLX for maximum performance, we provide an alternative backend:

# This class is used when the user explicitly requests the MLX backend,
# which provides the best performance on Apple Silicon Macs.
# It requires the mlx and mlx-lm packages to be installed.

class MLXPerplexityScorer:
    """
    Computes perplexity scores using Apple MLX for maximum performance
    on Apple Silicon hardware. Falls back gracefully if MLX is not
    available.
    """

    def __init__(self, model_name: str = "mlx-community/gpt2"):
        try:
            import mlx.core as mx
            import mlx.nn as nn
            from mlx_lm import load, generate
            self.mx = mx
            self.load = load
            self.model, self.tokenizer = load(model_name)
            self.available = True
            print(f"MLXPerplexityScorer: loaded {model_name} via MLX")
        except ImportError:
            self.available = False
            print("MLX not available, falling back to PyTorch backend")

    def score(self, text: str) -> dict:
        """
        Computes perplexity using MLX. If MLX is not available,
        returns a placeholder that signals the fallback should be used.
        """
        if not self.available:
            return {"perplexity": -1.0, "mean_nll": -1.0}

        mx = self.mx
        tokens = self.tokenizer.encode(text)
        # MLX perplexity computation
        input_ids = mx.array(tokens[:-1])[None]
        target_ids = mx.array(tokens[1:])

        logits = self.model(input_ids)
        # Compute cross-entropy loss manually
        log_probs = mx.log(mx.softmax(logits[0], axis=-1))
        token_log_probs = log_probs[
            mx.arange(len(target_ids)), target_ids
        ]
        mean_nll = float(-mx.mean(token_log_probs).item())
        perplexity = float(mx.exp(mx.array(mean_nll)).item())

        return {"perplexity": perplexity, "mean_nll": mean_nll}

The MLX backend uses Apple's Metal GPU acceleration to compute perplexity significantly faster than the CPU backend on Apple Silicon hardware. The computation is mathematically equivalent to the PyTorch version: we compute the log-softmax of the logits, select the log-probability of each actual token, and compute the mean negative log-probability.

4.5 The LLM Judge Engine

The LLM judge engine is the most powerful component of LLMSleuth. It sends each passage to a capable language model with a carefully crafted prompt and receives a structured JSON response containing a likelihood score and detailed reasoning. The prompt is designed to elicit specific, actionable indicators rather than vague impressions.

The prompt engineering for the judge is critical. A poorly designed prompt will produce unreliable scores and unhelpful explanations. The prompt must instruct the model to look for specific signals, to provide concrete examples from the text, and to express its confidence appropriately.

The judge prompt for text passages is structured as follows:

JUDGE_PROMPT_TEXT = """You are an expert in detecting AI-generated text.
Your task is to analyze the following text passage and determine whether
it was written by a human or generated by a large language model (LLM)
such as GPT-4, Claude, Gemini, or LLaMA.

Analyze the passage for the following specific indicators:

1. TRANSITION PHRASES: Does the text use phrases like "Furthermore,",
   "Moreover,", "In conclusion,", "It is important to note", "It is worth
   mentioning", or similar LLM-characteristic connectives?

2. SENTENCE LENGTH UNIFORMITY: Are the sentences of similar length and
   structure, suggesting mechanical generation rather than natural writing?

3. GENERIC BALANCE: Does the text present a suspiciously balanced view of
   all sides of an issue without taking a genuine position, as LLMs tend to
   do when trying to be neutral?

4. HEDGING LANGUAGE: Does the text overuse hedging phrases like "arguably",
   "one might say", "it could be argued", or "it seems"?

5. ABSENCE OF PERSONAL VOICE: Does the text lack the idiosyncratic word
   choices, personal anecdotes, or distinctive rhetorical style that
   characterizes human writing?

6. FACTUAL PRECISION: Does the text make specific factual claims without
   citing sources, or does it use vague quantifiers like "many studies show"
   or "research suggests" without specifics?

7. STRUCTURAL PERFECTION: Is the text organized with suspicious regularity,
   such as exactly three points per argument or perfectly parallel sentence
   structures?

8. VOCABULARY: Does the text use certain LLM-favored words like "robust",
   "leverage", "utilize", "comprehensive", "holistic", "paradigm", or
   "synergy" with unusual frequency?

Respond ONLY with a valid JSON object in the following format:
{
    "likelihood_llm": <integer from 0 to 100>,
    "confidence": <"low", "medium", or "high">,
    "indicators": [
        {
            "type": <indicator type from the list above>,
            "evidence": <exact quote or description from the text>,
            "severity": <"weak", "moderate", or "strong">
        }
    ],
    "human_indicators": [
        {
            "type": <description of human-like feature>,
            "evidence": <exact quote or description>
        }
    ],
    "reasoning": <one paragraph explaining the overall assessment>
}

The text to analyze:
---
{text}
---
"""

This prompt is carefully structured to guide the model toward specific, evidence-based reasoning rather than impressionistic judgments. The requirement to provide exact quotes as evidence is particularly important, because it forces the model to ground its assessment in the actual text rather than making unsupported claims.

The judge prompt for code passages is different:

JUDGE_PROMPT_CODE = """You are an expert software engineer and AI researcher
specializing in detecting AI-generated code. Analyze the following code
passage and determine whether it was written by a human developer or
generated by an LLM.

Look for these specific indicators of LLM-generated code:

1. COMMENT DENSITY: Is every line or block commented, even for obvious
   operations? LLMs over-comment code.

2. GENERIC EXCEPTION HANDLING: Does the code use broad "except Exception"
   clauses or bare "except:" statements instead of specific exception types?

3. VERBOSE NAMING: Are variable and function names excessively long and
   descriptive, like "calculate_total_price_including_tax" rather than
   shorter, more natural names?

4. STRUCTURAL REGULARITY: Are all functions of similar length and structure,
   suggesting template-based generation?

5. PRINT-BASED DEBUGGING: Does the code use print statements for logging
   instead of a proper logging framework?

6. PERFECT PEP8 COMPLIANCE: Is the code suspiciously well-formatted with
   perfect adherence to style guidelines that human developers often relax?

7. BOILERPLATE PATTERNS: Does the code follow a very standard template
   (e.g., argparse setup, main guard, standard docstring format) that
   suggests it was generated from a common pattern?

8. MISSING EDGE CASES: Does the code handle the happy path well but miss
   subtle edge cases that an experienced developer would anticipate?

Respond ONLY with a valid JSON object in the following format:
{
    "likelihood_llm": <integer from 0 to 100>,
    "confidence": <"low", "medium", or "high">,
    "indicators": [
        {
            "type": <indicator type from the list above>,
            "evidence": <exact quote or line from the code>,
            "severity": <"weak", "moderate", or "strong">
        }
    ],
    "human_indicators": [
        {
            "type": <description of human-like feature>,
            "evidence": <exact quote or line>
        }
    ],
    "reasoning": <one paragraph explaining the overall assessment>
}

The code to analyze:
---
{code}
---
"""

The LLM judge engine supports multiple backends. For local inference, it supports Ollama (which provides a simple REST API for local models), llama.cpp via the llama-cpp-python bindings, and HuggingFace Transformers with automatic device detection. For remote inference, it supports the OpenAI API and the Anthropic API.

The backend abstraction is implemented using a simple interface:

from abc import ABC, abstractmethod

class LLMBackend(ABC):
    """
    Abstract base class for LLM inference backends.
    All backends must implement the generate method, which takes
    a prompt string and returns the model's response as a string.
    """

    @abstractmethod
    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        """
        Generates a response to the given prompt.
        Returns the raw text of the model's response.
        """
        pass

    @abstractmethod
    def is_available(self) -> bool:
        """
        Returns True if this backend is available and properly configured.
        """
        pass

The Ollama backend is the simplest to implement because Ollama provides a standard REST API that works the same way regardless of which model is loaded:

import requests
import json

class OllamaBackend(LLMBackend):
    """
    LLM backend that uses a locally running Ollama server.
    Ollama supports many models including LLaMA, Mistral, Gemma,
    and others, and handles GPU acceleration automatically.
    """

    def __init__(self, model: str = "llama3.2",
                 host: str = "http://localhost:11434"):
        self.model = model
        self.host = host
        self.api_url = f"{host}/api/generate"

    def is_available(self) -> bool:
        """Checks if the Ollama server is running and the model is loaded."""
        try:
            response = requests.get(
                f"{self.host}/api/tags", timeout=5
            )
            if response.status_code != 200:
                return False
            tags = response.json()
            model_names = [m["name"] for m in tags.get("models", [])]
            return any(self.model in name for name in model_names)
        except requests.exceptions.ConnectionError:
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        """
        Sends the prompt to the Ollama API and returns the response.
        Uses streaming to handle long responses efficiently.
        """
        payload = {
            "model": self.model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "num_predict": max_tokens,
                "temperature": 0.1  # Low temperature for consistent output
            }
        }

        response = requests.post(
            self.api_url,
            json=payload,
            timeout=300  # 5 minute timeout for slow hardware
        )
        response.raise_for_status()
        return response.json()["response"]

The temperature is set to 0.1 rather than 0 because a small amount of randomness helps the model explore its reasoning space while still producing consistent results. A temperature of exactly 0 can sometimes cause the model to get stuck in repetitive patterns.

The OpenAI backend uses the official openai Python library:

from openai import OpenAI

class OpenAIBackend(LLMBackend):
    """
    LLM backend that uses the OpenAI API for remote inference.
    Requires an OPENAI_API_KEY environment variable to be set.
    Supports GPT-4o, GPT-4-turbo, and other OpenAI models.
    """

    def __init__(self, model: str = "gpt-4o",
                 api_key: str = None):
        import os
        self.model = model
        self.client = OpenAI(
            api_key=api_key or os.environ.get("OPENAI_API_KEY")
        )

    def is_available(self) -> bool:
        """Checks if the API key is set and the API is reachable."""
        try:
            # Make a minimal API call to verify connectivity
            self.client.models.list()
            return True
        except Exception:
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        """
        Sends the prompt to the OpenAI Chat Completions API.
        Uses the system message to set the context and the user
        message to provide the actual prompt.
        """
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert AI detection system. "
                               "Always respond with valid JSON only."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            max_tokens=max_tokens,
            temperature=0.1,
            response_format={"type": "json_object"}
        )
        return response.choices[0].message.content

The response_format parameter with "json_object" is a feature of the OpenAI API that guarantees the response will be valid JSON. This is very useful for our application because we need to parse the response as a structured object.

The LLM judge engine uses the backend to analyze each passage and parses the JSON response:

import json
import re

class LLMJudgeEngine:
    """
    Uses an LLM backend to analyze passages for signs of LLM generation.
    Handles prompt construction, response parsing, and error recovery.
    """

    def __init__(self, backend: LLMBackend):
        self.backend = backend

    def _extract_json(self, text: str) -> dict:
        """
        Extracts a JSON object from the model's response, handling
        cases where the model includes extra text before or after
        the JSON object.
        """
        # Try direct parsing first
        try:
            return json.loads(text)
        except json.JSONDecodeError:
            pass

        # Try to find a JSON object in the text
        json_match = re.search(r'\{.*\}', text, re.DOTALL)
        if json_match:
            try:
                return json.loads(json_match.group())
            except json.JSONDecodeError:
                pass

        # Return a default response if parsing fails
        return {
            "likelihood_llm": 50,
            "confidence": "low",
            "indicators": [],
            "human_indicators": [],
            "reasoning": "Could not parse model response."
        }

    def analyze_text(self, text: str) -> dict:
        """
        Analyzes a text passage using the LLM judge.
        Returns a structured dictionary with the likelihood score,
        confidence level, indicators, and reasoning.
        """
        prompt = JUDGE_PROMPT_TEXT.format(text=text)
        response = self.backend.generate(prompt)
        return self._extract_json(response)

    def analyze_code(self, code: str) -> dict:
        """
        Analyzes a code passage using the LLM judge.
        Returns a structured dictionary with the likelihood score,
        confidence level, indicators, and reasoning.
        """
        prompt = JUDGE_PROMPT_CODE.format(code=code)
        response = self.backend.generate(prompt)
        return self._extract_json(response)

The _extract_json method is an important piece of defensive programming. Even with a carefully crafted prompt, LLMs sometimes include extra text before or after the JSON object, or they produce slightly malformed JSON. The method handles these cases gracefully by trying multiple parsing strategies before falling back to a default response.

4.6 The Ensemble Scoring Engine

The ensemble scoring engine combines the outputs of the statistical analysis engine and the LLM judge engine to produce a final likelihood score for each passage. The combination is done using a weighted average, where the weights reflect the relative reliability of each signal.

The weighting scheme is based on the following reasoning. The LLM judge score is the most informative single signal because it combines many features in a flexible, context-sensitive way. We therefore give it the highest weight. The perplexity score is theoretically well-grounded but depends on the quality of the local model, so it gets a moderate weight. The stylometric features are individually weak but collectively informative, so their combined contribution gets a moderate weight.

class EnsembleScoringEngine:
    """
    Combines statistical features and LLM judge scores to produce
    a final likelihood estimate for each passage.
    The weights are calibrated based on empirical reliability estimates.
    """

    # Weights for the ensemble (must sum to 1.0)
    WEIGHT_LLM_JUDGE = 0.55
    WEIGHT_PERPLEXITY = 0.20
    WEIGHT_STYLOMETRIC = 0.25

    # Perplexity thresholds for normalization
    # Text with perplexity below LOW_PPL is very likely LLM-generated
    # Text with perplexity above HIGH_PPL is very likely human-written
    LOW_PPL = 20.0
    HIGH_PPL = 200.0

    def compute_perplexity_score(self, perplexity: float) -> float:
        """
        Converts a raw perplexity value to a likelihood-of-LLM score
        in the range [0, 100]. Lower perplexity -> higher LLM likelihood.
        Uses a sigmoid-like mapping for smooth interpolation.
        """
        if perplexity <= 0:
            return 50.0  # Unknown, return neutral score

        # Clamp to the expected range
        ppl = max(self.LOW_PPL, min(self.HIGH_PPL, perplexity))

        # Linear interpolation: LOW_PPL -> 90, HIGH_PPL -> 10
        score = 90.0 - 80.0 * (ppl - self.LOW_PPL) / \
                (self.HIGH_PPL - self.LOW_PPL)
        return float(score)

    def compute_stylometric_score(self, features: dict) -> float:
        """
        Converts the stylometric feature vector to a likelihood-of-LLM
        score in the range [0, 100]. Each feature contributes a partial
        score, and the contributions are averaged.
        """
        scores = []

        # Sentence length coefficient of variation
        # Low CV (< 0.2) -> high LLM likelihood
        cv = features.get("sentence_length_cv", 0.5)
        cv_score = max(0.0, min(100.0, (0.5 - cv) / 0.5 * 100))
        scores.append(cv_score)

        # LLM phrase density
        # High density (> 2.0 per 100 words) -> high LLM likelihood
        density = features.get("llm_phrase_density", 0.0)
        density_score = min(100.0, density / 2.0 * 100)
        scores.append(density_score)

        # Comment ratio for code
        comment_ratio = features.get("comment_ratio", -1)
        if comment_ratio >= 0:
            # High ratio (> 0.4) -> high LLM likelihood
            comment_score = min(100.0, comment_ratio / 0.4 * 100)
            scores.append(comment_score)

        # Generic exception handling
        generic_except = features.get("generic_except_count", 0)
        if generic_except > 0:
            scores.append(min(100.0, generic_except * 30.0))

        if not scores:
            return 50.0

        return float(np.mean(scores))

    def compute_final_score(self, segment) -> float:
        """
        Computes the final ensemble score for a segment by combining
        the LLM judge score, perplexity score, and stylometric score
        using the predefined weights.
        """
        # Get the LLM judge score
        judge_score = float(
            segment.llm_judge_result.get("likelihood_llm", 50)
        )

        # Get the perplexity score
        perplexity = segment.statistical_features.get("perplexity", -1)
        ppl_score = self.compute_perplexity_score(perplexity)

        # Get the stylometric score
        stylo_score = self.compute_stylometric_score(
            segment.statistical_features
        )

        # Weighted combination
        final = (
            self.WEIGHT_LLM_JUDGE * judge_score +
            self.WEIGHT_PERPLEXITY * ppl_score +
            self.WEIGHT_STYLOMETRIC * stylo_score
        )

        return round(final, 1)

The perplexity normalization maps the raw perplexity value to a 0-100 scale using linear interpolation between the LOW_PPL and HIGH_PPL thresholds. A perplexity of 20 (very low, very predictable) maps to a score of 90 (very likely LLM-generated), while a perplexity of 200 (high, unpredictable) maps to a score of 10 (very likely human-written). Values outside this range are clamped to the endpoints.

4.7 Excel Report Generation

The final component is the Excel report generator. We use the openpyxl library to create a formatted workbook with the required columns. The report uses color coding to make it easy to identify high-likelihood LLM passages at a glance.

import openpyxl
from openpyxl.styles import (
    PatternFill, Font, Alignment, Border, Side
)
from openpyxl.utils import get_column_letter

class ExcelReporter:
    """
    Generates a formatted Excel report from the analyzed segments.
    The report includes color coding, frozen headers, and auto-sized
    columns for easy reading.
    """

    # Color thresholds for the likelihood column
    COLOR_HIGH = "FF4444"    # Red: >= 70% likely LLM
    COLOR_MEDIUM = "FFAA00"  # Orange: 40-69% likely LLM
    COLOR_LOW = "44BB44"     # Green: < 40% likely LLM

    HEADER_COLOR = "1F4E79"  # Dark blue for the header row
    HEADER_FONT_COLOR = "FFFFFF"  # White text for the header

    def _get_likelihood_color(self, score: float) -> str:
        """Returns the hex color code for a given likelihood score."""
        if score >= 70:
            return self.COLOR_HIGH
        elif score >= 40:
            return self.COLOR_MEDIUM
        else:
            return self.COLOR_LOW

    def _format_indicators(self, judge_result: dict) -> str:
        """
        Formats the LLM judge indicators as a readable string
        for the Excel cell.
        """
        indicators = judge_result.get("indicators", [])
        if not indicators:
            return judge_result.get("reasoning", "No indicators found.")

        lines = []
        for ind in indicators:
            severity = ind.get("severity", "unknown").upper()
            ind_type = ind.get("type", "unknown")
            evidence = ind.get("evidence", "")
            lines.append(
                f"[{severity}] {ind_type}: \"{evidence}\""
            )

        reasoning = judge_result.get("reasoning", "")
        if reasoning:
            lines.append(f"\nSummary: {reasoning}")

        return "\n".join(lines)

    def generate(self, segments: list, output_path: str,
                 document_total_chars: int) -> None:
        """
        Generates the Excel report and saves it to the given path.
        Each row corresponds to one analyzed segment.
        """
        wb = openpyxl.Workbook()
        ws = wb.active
        ws.title = "LLM Detection Report"

        # Define column headers
        headers = [
            "Text / Code Excerpt",
            "Type",
            "Location",
            "LLM Likelihood (%)",
            "Proofs & Indicators",
            "Contribution to Document (%)"
        ]

        # Style the header row
        header_fill = PatternFill(
            start_color=self.HEADER_COLOR,
            end_color=self.HEADER_COLOR,
            fill_type="solid"
        )
        header_font = Font(
            color=self.HEADER_FONT_COLOR,
            bold=True,
            size=11
        )

        for col_idx, header in enumerate(headers, start=1):
            cell = ws.cell(row=1, column=col_idx, value=header)
            cell.fill = header_fill
            cell.font = header_font
            cell.alignment = Alignment(
                horizontal="center",
                vertical="center",
                wrap_text=True
            )

        # Freeze the header row
        ws.freeze_panes = "A2"

        # Write data rows
        for row_idx, segment in enumerate(segments, start=2):
            # Truncate long content for the excerpt column
            content = segment.content
            excerpt = content[:300] + "..." \
                      if len(content) > 300 else content

            # Compute contribution percentage
            contribution = round(
                segment.char_length / document_total_chars * 100, 1
            ) if document_total_chars > 0 else 0.0

            # Format the indicators column
            indicators_text = self._format_indicators(
                segment.llm_judge_result
            )

            # Add statistical feature summary to indicators
            stat_summary = self._format_statistical_summary(
                segment.statistical_features
            )
            if stat_summary:
                indicators_text = stat_summary + "\n\n" + indicators_text

            row_data = [
                excerpt,
                segment.segment_type,
                segment.location,
                segment.final_score,
                indicators_text,
                contribution
            ]

            for col_idx, value in enumerate(row_data, start=1):
                cell = ws.cell(row=row_idx, column=col_idx, value=value)
                cell.alignment = Alignment(
                    vertical="top",
                    wrap_text=True
                )

            # Color-code the likelihood cell
            likelihood_cell = ws.cell(row=row_idx, column=4)
            color = self._get_likelihood_color(segment.final_score)
            likelihood_cell.fill = PatternFill(
                start_color=color,
                end_color=color,
                fill_type="solid"
            )
            likelihood_cell.font = Font(bold=True, color="FFFFFF")
            likelihood_cell.alignment = Alignment(
                horizontal="center",
                vertical="center"
            )

        # Set column widths
        column_widths = [60, 10, 30, 20, 80, 20]
        for col_idx, width in enumerate(column_widths, start=1):
            ws.column_dimensions[
                get_column_letter(col_idx)
            ].width = width

        # Set row heights for data rows
        for row_idx in range(2, len(segments) + 2):
            ws.row_dimensions[row_idx].height = 80

        # Add a summary sheet
        self._add_summary_sheet(wb, segments, document_total_chars)

        wb.save(output_path)
        print(f"Report saved to: {output_path}")

    def _format_statistical_summary(self, features: dict) -> str:
        """
        Formats the statistical features as a concise summary string
        for inclusion in the indicators column.
        """
        lines = ["--- Statistical Analysis ---"]

        perplexity = features.get("perplexity", -1)
        if perplexity > 0:
            lines.append(f"Perplexity: {perplexity:.1f} "
                         f"(lower = more LLM-like)")

        cv = features.get("sentence_length_cv", -1)
        if cv >= 0:
            lines.append(f"Sentence length CV: {cv:.2f} "
                         f"(lower = more uniform = more LLM-like)")

        density = features.get("llm_phrase_density", -1)
        if density >= 0:
            matched = features.get("matched_phrases", [])
            lines.append(f"LLM phrase density: {density:.2f}/100 words")
            if matched:
                lines.append(f"Matched phrases: {', '.join(matched[:5])}")

        comment_ratio = features.get("comment_ratio", -1)
        if comment_ratio >= 0:
            lines.append(f"Comment ratio: {comment_ratio:.1%} "
                         f"(>40% is LLM-like)")

        if len(lines) == 1:
            return ""

        return "\n".join(lines)

    def _add_summary_sheet(self, wb: openpyxl.Workbook,
                           segments: list,
                           document_total_chars: int) -> None:
        """
        Adds a summary sheet to the workbook with overall statistics
        about the document's LLM likelihood.
        """
        ws = wb.create_sheet("Summary")

        scores = [s.final_score for s in segments]
        if not scores:
            return

        avg_score = np.mean(scores)
        max_score = max(scores)
        high_likelihood = [s for s in segments if s.final_score >= 70]
        medium_likelihood = [s for s in segments
                             if 40 <= s.final_score < 70]
        low_likelihood = [s for s in segments if s.final_score < 40]

        # Compute weighted average by contribution
        weighted_score = sum(
            s.final_score * s.char_length
            for s in segments
        ) / document_total_chars if document_total_chars > 0 else avg_score

        summary_data = [
            ("Overall Assessment", ""),
            ("Total segments analyzed", len(segments)),
            ("Weighted LLM likelihood (%)", round(weighted_score, 1)),
            ("Average LLM likelihood (%)", round(avg_score, 1)),
            ("Maximum LLM likelihood (%)", round(max_score, 1)),
            ("", ""),
            ("Segment Breakdown", ""),
            ("High likelihood (>= 70%)", len(high_likelihood)),
            ("Medium likelihood (40-69%)", len(medium_likelihood)),
            ("Low likelihood (< 40%)", len(low_likelihood)),
            ("", ""),
            ("Interpretation", ""),
            (
                "Overall verdict",
                "LIKELY LLM-GENERATED" if weighted_score >= 60
                else "MIXED (partial LLM use possible)" if weighted_score >= 35
                else "LIKELY HUMAN-WRITTEN"
            )
        ]

        for row_idx, (label, value) in enumerate(summary_data, start=1):
            ws.cell(row=row_idx, column=1, value=label).font = Font(
                bold=(label != "" and value == "")
            )
            ws.cell(row=row_idx, column=2, value=value)

        ws.column_dimensions["A"].width = 40
        ws.column_dimensions["B"].width = 30

4.8 The Command-Line Interface

The CLI ties all components together and provides a user-friendly interface. It uses the argparse module to handle command-line arguments and provides helpful error messages and progress indicators.

import argparse
import sys
from pathlib import Path

def create_parser() -> argparse.ArgumentParser:
    """
    Creates the argument parser for the LLMSleuth CLI.
    """
    parser = argparse.ArgumentParser(
        prog="llmsleuth",
        description=(
            "LLMSleuth: Detect AI-generated content in documents and code.\n"
            "Analyzes text and code passages for signs of LLM generation\n"
            "and produces a detailed Excel report."
        ),
        formatter_class=argparse.RawDescriptionHelpFormatter
    )

    parser.add_argument(
        "input_file",
        type=str,
        help="Path to the file to analyze (.txt, .py, .pdf, .docx)"
    )

    parser.add_argument(
        "-o", "--output",
        type=str,
        default=None,
        help="Path for the output Excel file (default: <input>_report.xlsx)"
    )

    parser.add_argument(
        "--backend",
        type=str,
        choices=["ollama", "openai", "anthropic", "transformers"],
        default="ollama",
        help="LLM backend to use for the judge engine (default: ollama)"
    )

    parser.add_argument(
        "--model",
        type=str,
        default=None,
        help=(
            "Model name for the backend. "
            "For ollama: e.g. 'llama3.2', 'mistral', 'gemma2'. "
            "For openai: e.g. 'gpt-4o', 'gpt-4-turbo'. "
            "For anthropic: e.g. 'claude-3-5-sonnet-20241022'. "
            "For transformers: e.g. 'microsoft/phi-2'."
        )
    )

    parser.add_argument(
        "--perplexity-model",
        type=str,
        default="gpt2",
        help=(
            "HuggingFace model for perplexity scoring "
            "(default: gpt2). Use a larger model for better accuracy."
        )
    )

    parser.add_argument(
        "--no-perplexity",
        action="store_true",
        help="Skip perplexity scoring (faster but less accurate)"
    )

    parser.add_argument(
        "--api-key",
        type=str,
        default=None,
        help=(
            "API key for remote backends. "
            "Can also be set via OPENAI_API_KEY or ANTHROPIC_API_KEY "
            "environment variables."
        )
    )

    parser.add_argument(
        "--ollama-host",
        type=str,
        default="http://localhost:11434",
        help="Ollama server URL (default: http://localhost:11434)"
    )

    parser.add_argument(
        "--min-segment-length",
        type=int,
        default=50,
        help=(
            "Minimum character length for a segment to be analyzed "
            "(default: 50)"
        )
    )

    parser.add_argument(
        "--verbose",
        action="store_true",
        help="Print detailed progress information"
    )

    return parser

def main():
    """
    Main entry point for the LLMSleuth CLI.
    Orchestrates the entire analysis pipeline.
    """
    parser = create_parser()
    args = parser.parse_args()

    input_path = Path(args.input_file)
    if not input_path.exists():
        print(f"Error: File not found: {input_path}", file=sys.stderr)
        sys.exit(1)

    output_path = args.output or str(
        input_path.with_suffix("").name + "_report.xlsx"
    )

    print(f"LLMSleuth: Analyzing {input_path}")
    print(f"Backend: {args.backend}")
    print(f"Output: {output_path}")
    print()

    # Initialize the backend
    backend = create_backend(args)
    if not backend.is_available():
        print(
            f"Error: Backend '{args.backend}' is not available. "
            f"Please check your configuration.",
            file=sys.stderr
        )
        sys.exit(1)

    # Run the full analysis pipeline
    pipeline = AnalysisPipeline(
        backend=backend,
        perplexity_model=args.perplexity_model
            if not args.no_perplexity else None,
        min_segment_length=args.min_segment_length,
        verbose=args.verbose
    )

    segments, total_chars = pipeline.run(str(input_path))

    # Generate the report
    reporter = ExcelReporter()
    reporter.generate(segments, output_path, total_chars)

    print(f"\nAnalysis complete. Report saved to: {output_path}")

if __name__ == "__main__":
    main()

The AnalysisPipeline class orchestrates the entire analysis process. It reads the file, segments it, runs the statistical analysis, runs the LLM judge, and computes the ensemble scores. This class is the heart of the application.

SECTION 5: PROS AND CONS OF THE TOOL

Like any analytical tool, LLMSleuth has both strengths and limitations that users must understand to interpret its results correctly.

5.1 Advantages

The most significant advantage of LLMSleuth is that it combines multiple independent signals into a single, calibrated score. No single indicator of LLM generation is reliable on its own, but the combination of perplexity analysis, stylometric features, and LLM judge reasoning provides a much more robust assessment than any individual method. This multi-signal approach reduces both false positives (incorrectly flagging human-written text as LLM-generated) and false negatives (missing LLM-generated text).

The tool provides concrete, evidence-based explanations for its assessments. Rather than simply outputting a score, it identifies specific phrases, patterns, and structural features that support its conclusion. This makes the tool useful not just as a detection mechanism but as an educational resource that helps students and employees understand what distinguishes LLM-generated from human- generated work.

The support for both local and remote LLM backends makes the tool flexible and accessible. Users who have privacy concerns or who work in air-gapped environments can use a local Ollama model. Users who want the highest accuracy can use a commercial API. The tool handles both cases with the same interface.

The GPU support across multiple hardware platforms (NVIDIA CUDA, Apple Silicon MPS, AMD ROCm via PyTorch) ensures that the tool can take advantage of available hardware acceleration, making it practical for analyzing large documents.

The Excel output format is universally accessible and easy to share with colleagues, students, or administrators. The color coding and summary sheet make it easy to identify the most suspicious passages at a glance.

5.2 Limitations and Disadvantages

The most fundamental limitation of LLMSleuth is that it cannot achieve perfect accuracy. LLM detection is an inherently adversarial problem: as detection tools improve, users can adapt their prompting strategies to evade detection. A student who knows that LLMs overuse transition phrases can ask the LLM to avoid them. A developer who knows that LLMs over-comment code can ask the LLM to minimize comments. The tool is therefore most effective against unsophisticated use of LLMs and less effective against sophisticated, targeted use.

The perplexity-based component of the tool depends on the quality of the local model used for scoring. A small model like GPT-2 may not accurately capture the statistical patterns of text generated by much larger models like GPT-4 or Claude. This can lead to inaccurate perplexity scores, particularly for highly technical or domain-specific text.

The LLM judge component is subject to the same limitations as any LLM-based system. It can be wrong, and its errors are not always predictable. It may flag well-written human text as LLM-generated, particularly if the human writer happens to use phrases that are also common in LLM outputs. This is a particularly serious concern for non-native speakers of English, who may use more formal and structured language that superficially resembles LLM output.

The tool has no ground truth to calibrate against for any specific user or context. The thresholds and weights in the ensemble scoring engine are based on general empirical observations, not on data from the specific institution or organization using the tool. Calibration on a local dataset of known human- written and LLM-generated texts would significantly improve accuracy.

The analysis of very short passages is unreliable. A single sentence or a three-line function does not contain enough information for any of the detection methods to work well. The tool's minimum segment length threshold mitigates this problem, but it means that very short documents or very granular code may not be well-analyzed.

The tool cannot detect LLM use that has been heavily edited by a human. If a student uses an LLM to generate a first draft and then substantially rewrites it, the final product may show few or no signs of LLM generation. The tool detects the statistical fingerprint of LLM outputs, not the fact that an LLM was involved in the process.

Finally, there are ethical concerns about using such a tool. A false positive accusation of LLM use can be seriously damaging to a student's or employee's reputation. The tool should always be used as one input into a human judgment process, never as the sole basis for an accusation or a disciplinary action. The output of the tool should be treated as probabilistic evidence, not as proof.

ADDENDUM: FULL PRODUCTION-READY IMPLEMENTATION

The following is the complete, production-ready implementation of LLMSleuth. All functionality described in the article is implemented here. The code is organized into modules that correspond to the architecture described in Part 3. No mocks, simulations, or placeholders are used. The implementation supports all file formats, all backends, and all hardware configurations described in the article.

To install the required dependencies, create a virtual environment and run:

pip install PyMuPDF python-docx nltk numpy openpyxl transformers torch
pip install openai anthropic requests llama-cpp-python

For Ollama support, install Ollama from https://ollama.ai and pull a model:

ollama pull llama3.2

For Apple MLX support, additionally install:

pip install mlx mlx-lm

Usage examples:

# Analyze a PDF using local Ollama with llama3.2
python llmsleuth.py essay.pdf --backend ollama --model llama3.2

# Analyze a Python file using OpenAI GPT-4o
python llmsleuth.py solution.py --backend openai --model gpt-4o

# Analyze a Word document using Anthropic Claude
python llmsleuth.py report.docx --backend anthropic \
    --model claude-3-5-sonnet-20241022

# Analyze without perplexity scoring (faster)
python llmsleuth.py essay.txt --backend ollama --no-perplexity

FILE: llmsleuth.py (single-file production implementation)

#!/usr/bin/env python3
"""
LLMSleuth: AI-Generated Content Detection Tool
================================================
Analyzes text and code files for signs of LLM generation and produces
a detailed Excel report with per-passage likelihood scores, indicators,
and evidence.

Supported input formats: .txt, .py, .pdf, .docx
Supported LLM backends: Ollama (local), OpenAI, Anthropic,
                        HuggingFace Transformers (local)
Supported hardware: NVIDIA CUDA, Apple MPS, AMD ROCm, CPU

Author: LLMSleuth Project
License: MIT
"""

from __future__ import annotations

import argparse
import ast
import json
import logging
import os
import re
import sys
import textwrap
import time
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Optional

import nltk
import numpy as np
import openpyxl
import requests
from nltk.tokenize import sent_tokenize, word_tokenize
from openpyxl.styles import Alignment, Border, Font, PatternFill, Side
from openpyxl.utils import get_column_letter

# ---------------------------------------------------------------------------
# NLTK Data Downloads
# ---------------------------------------------------------------------------

for resource in ["punkt", "punkt_tab", "stopwords"]:
    nltk.download(resource, quiet=True)

# ---------------------------------------------------------------------------
# Logging Configuration
# ---------------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%H:%M:%S"
)
logger = logging.getLogger("llmsleuth")

# ---------------------------------------------------------------------------
# Data Structures
# ---------------------------------------------------------------------------

@dataclass
class Segment:
    """
    Represents a single analyzable unit of the input document.
    This is the central data structure that flows through the pipeline.
    Each segment is either a prose text passage or a code block.
    """

    # "text" or "code"
    segment_type: str

    # The full content of the segment
    content: str

    # Human-readable location descriptor (e.g., "Page 2, Paragraph 3")
    location: str

    # Character offset of the start of this segment in the document
    char_offset: int

    # Length of this segment in characters
    char_length: int

    # For code segments: the name of the function, class, or block
    code_unit_name: Optional[str] = None

    # Populated by the statistical analysis engine
    statistical_features: dict = field(default_factory=dict)

    # Populated by the LLM judge engine
    llm_judge_result: dict = field(default_factory=dict)

    # Final ensemble score (0-100, higher = more likely LLM-generated)
    final_score: float = 0.0

# ---------------------------------------------------------------------------
# LLM-Characteristic Phrase List
# ---------------------------------------------------------------------------

LLM_MARKER_PHRASES = [
    "it is important to note",
    "it is worth noting",
    "it is worth mentioning",
    "furthermore",
    "moreover",
    "in conclusion",
    "in summary",
    "to summarize",
    "delving into",
    "in the realm of",
    "it is crucial to",
    "one must consider",
    "it goes without saying",
    "needless to say",
    "as previously mentioned",
    "as mentioned earlier",
    "in other words",
    "that being said",
    "with that said",
    "at the end of the day",
    "when it comes to",
    "in terms of",
    "it should be noted",
    "it is essential to",
    "plays a crucial role",
    "plays a vital role",
    "it is imperative",
    "a comprehensive",
    "a holistic",
    "leveraging",
    "utilize",
    "in today's world",
    "in today's fast-paced",
    "the landscape of",
    "a testament to",
    "foster",
    "facilitate",
    "streamline",
    "robust solution",
    "cutting-edge",
    "state-of-the-art",
    "paradigm shift",
    "synergy",
    "actionable insights",
    "deep dive",
    "game changer",
    "transformative",
    "it is undeniable",
    "it cannot be overstated",
    "in the context of",
    "with respect to",
    "it is clear that",
    "it is evident that",
    "one can observe",
    "it is noteworthy",
    "a myriad of",
    "multifaceted",
    "nuanced",
    "intricate",
    "pivotal",
    "paramount",
    "quintessential",
    "embark on",
    "navigate the complexities",
    "unlock the potential",
    "harness the power"
]

# ---------------------------------------------------------------------------
# Judge Prompts
# ---------------------------------------------------------------------------

JUDGE_PROMPT_TEXT = """You are an expert forensic linguist specializing in
detecting AI-generated text. Your task is to analyze the following text
passage and determine whether it was written by a human or generated by a
large language model (LLM) such as GPT-4, Claude, Gemini, or LLaMA.

Analyze the passage carefully for these specific indicators:

1. TRANSITION_PHRASES: Does the text use phrases like "Furthermore,",
   "Moreover,", "In conclusion,", "It is important to note", "It is worth
   mentioning", or similar LLM-characteristic connectives?

2. SENTENCE_UNIFORMITY: Are the sentences of similar length and structure,
   suggesting mechanical generation rather than natural writing?

3. GENERIC_BALANCE: Does the text present a suspiciously balanced view
   without taking a genuine position, as LLMs do when trying to be neutral?

4. HEDGING_LANGUAGE: Does the text overuse hedging phrases like "arguably",
   "one might say", "it could be argued", or "it seems"?

5. ABSENT_PERSONAL_VOICE: Does the text lack idiosyncratic word choices,
   personal anecdotes, or distinctive rhetorical style?

6. VAGUE_CITATIONS: Does the text make specific factual claims without
   citing sources, or use vague quantifiers like "many studies show"?

7. STRUCTURAL_PERFECTION: Is the text organized with suspicious regularity,
   such as exactly three points per argument or perfectly parallel structures?

8. LLM_VOCABULARY: Does the text use LLM-favored words like "robust",
   "leverage", "utilize", "comprehensive", "holistic", "paradigm",
   "synergy", "multifaceted", or "nuanced" with unusual frequency?

9. OVER_EXPLANATION: Does the text explain obvious things in detail,
   as if written for a general audience even when the context is specialized?

10. PERFECT_GRAMMAR: Is the grammar suspiciously perfect, with no
    contractions, colloquialisms, or informal constructions that human
    writers naturally use?

Respond ONLY with a valid JSON object. Do not include any text outside
the JSON object. Use this exact format:
{{
    "likelihood_llm": <integer from 0 to 100>,
    "confidence": "<low, medium, or high>",
    "indicators": [
        {{
            "type": "<indicator type from the numbered list above>",
            "evidence": "<exact quote or specific description from the text>",
            "severity": "<weak, moderate, or strong>"
        }}
    ],
    "human_indicators": [
        {{
            "type": "<description of human-like feature found>",
            "evidence": "<exact quote or specific description>"
        }}
    ],
    "reasoning": "<one concise paragraph explaining the overall assessment>"
}}

Text to analyze:
---
{text}
---
"""

JUDGE_PROMPT_CODE = """You are an expert software engineer and AI researcher
specializing in detecting AI-generated code. Analyze the following code
passage and determine whether it was written by a human developer or
generated by a large language model.

Look for these specific indicators of LLM-generated code:

1. COMMENT_DENSITY: Is every line or block commented, even for obvious
   operations? LLMs systematically over-comment code.

2. GENERIC_EXCEPTIONS: Does the code use broad "except Exception" clauses
   or bare "except:" statements instead of specific exception types?

3. VERBOSE_NAMING: Are variable and function names excessively long and
   descriptive in a pedagogical way?

4. STRUCTURAL_REGULARITY: Are all functions of similar length and structure,
   suggesting template-based generation?

5. PRINT_DEBUGGING: Does the code use print statements for logging instead
   of a proper logging framework?

6. PERFECT_STYLE: Is the code suspiciously well-formatted with perfect
   adherence to style guidelines that human developers often relax?

7. BOILERPLATE_PATTERNS: Does the code follow a very standard template
   that suggests it was generated from a common pattern?

8. MISSING_EDGE_CASES: Does the code handle the happy path well but miss
   subtle edge cases that an experienced developer would anticipate?

9. GENERIC_DOCSTRINGS: Are docstrings present but generic, describing
   the obvious rather than providing useful context?

10. UNIFORM_COMPLEXITY: Are all functions of similar complexity, suggesting
    they were generated in one session rather than evolved over time?

Respond ONLY with a valid JSON object. Do not include any text outside
the JSON object. Use this exact format:
{{
    "likelihood_llm": <integer from 0 to 100>,
    "confidence": "<low, medium, or high>",
    "indicators": [
        {{
            "type": "<indicator type from the numbered list above>",
            "evidence": "<exact quote or line from the code>",
            "severity": "<weak, moderate, or strong>"
        }}
    ],
    "human_indicators": [
        {{
            "type": "<description of human-like feature found>",
            "evidence": "<exact quote or line>"
        }}
    ],
    "reasoning": "<one concise paragraph explaining the overall assessment>"
}}

Code to analyze:
---
{code}
---
"""

# ---------------------------------------------------------------------------
# LLM Backend Abstraction
# ---------------------------------------------------------------------------

class LLMBackend(ABC):
    """
    Abstract base class for all LLM inference backends.
    Concrete implementations handle the specifics of each API or
    local inference engine.
    """

    @abstractmethod
    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        """Generates a response to the given prompt."""
        pass

    @abstractmethod
    def is_available(self) -> bool:
        """Returns True if this backend is configured and reachable."""
        pass

    @abstractmethod
    def backend_name(self) -> str:
        """Returns a human-readable name for this backend."""
        pass


class OllamaBackend(LLMBackend):
    """
    Backend for locally running Ollama server.
    Supports all models available in the Ollama model library,
    including LLaMA 3, Mistral, Gemma, Phi, and many others.
    Ollama handles GPU acceleration (CUDA, ROCm, Metal) automatically.
    """

    DEFAULT_MODEL = "llama3.2"

    def __init__(self, model: str = DEFAULT_MODEL,
                 host: str = "http://localhost:11434"):
        self.model = model
        self.host = host.rstrip("/")
        self.api_url = f"{self.host}/api/generate"
        self._log = logging.getLogger("llmsleuth.backend.ollama")

    def backend_name(self) -> str:
        return f"Ollama ({self.model})"

    def is_available(self) -> bool:
        try:
            resp = requests.get(f"{self.host}/api/tags", timeout=5)
            if resp.status_code != 200:
                return False
            tags = resp.json()
            model_names = [m["name"] for m in tags.get("models", [])]
            # Check if any loaded model name starts with our model name
            return any(
                name.startswith(self.model.split(":")[0])
                for name in model_names
            )
        except Exception as exc:
            self._log.debug("Ollama availability check failed: %s", exc)
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        payload = {
            "model": self.model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "num_predict": max_tokens,
                "temperature": 0.1,
                "top_p": 0.9,
                "repeat_penalty": 1.1
            }
        }
        try:
            resp = requests.post(
                self.api_url,
                json=payload,
                timeout=600
            )
            resp.raise_for_status()
            return resp.json().get("response", "")
        except requests.exceptions.Timeout:
            self._log.error("Ollama request timed out")
            return "{}"
        except Exception as exc:
            self._log.error("Ollama request failed: %s", exc)
            return "{}"


class OpenAIBackend(LLMBackend):
    """
    Backend for the OpenAI API (GPT-4o, GPT-4-turbo, etc.).
    Requires the OPENAI_API_KEY environment variable or the api_key
    parameter to be set.
    """

    DEFAULT_MODEL = "gpt-4o"

    def __init__(self, model: str = DEFAULT_MODEL,
                 api_key: Optional[str] = None):
        self.model = model
        self._api_key = api_key or os.environ.get("OPENAI_API_KEY", "")
        self._log = logging.getLogger("llmsleuth.backend.openai")
        self._client = None

    def _get_client(self):
        if self._client is None:
            try:
                from openai import OpenAI
                self._client = OpenAI(api_key=self._api_key)
            except ImportError:
                raise RuntimeError(
                    "openai package not installed. "
                    "Run: pip install openai"
                )
        return self._client

    def backend_name(self) -> str:
        return f"OpenAI ({self.model})"

    def is_available(self) -> bool:
        if not self._api_key:
            self._log.error(
                "OPENAI_API_KEY not set. "
                "Set the environment variable or use --api-key."
            )
            return False
        try:
            client = self._get_client()
            client.models.list()
            return True
        except Exception as exc:
            self._log.error("OpenAI availability check failed: %s", exc)
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        try:
            client = self._get_client()
            response = client.chat.completions.create(
                model=self.model,
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "You are an expert AI detection system. "
                            "Always respond with valid JSON only, "
                            "with no additional text."
                        )
                    },
                    {"role": "user", "content": prompt}
                ],
                max_tokens=max_tokens,
                temperature=0.1,
                response_format={"type": "json_object"}
            )
            return response.choices[0].message.content
        except Exception as exc:
            self._log.error("OpenAI generation failed: %s", exc)
            return "{}"


class AnthropicBackend(LLMBackend):
    """
    Backend for the Anthropic API (Claude 3.5 Sonnet, Claude 3 Opus, etc.).
    Requires the ANTHROPIC_API_KEY environment variable or the api_key
    parameter to be set.
    """

    DEFAULT_MODEL = "claude-3-5-sonnet-20241022"

    def __init__(self, model: str = DEFAULT_MODEL,
                 api_key: Optional[str] = None):
        self.model = model
        self._api_key = api_key or os.environ.get("ANTHROPIC_API_KEY", "")
        self._log = logging.getLogger("llmsleuth.backend.anthropic")
        self._client = None

    def _get_client(self):
        if self._client is None:
            try:
                import anthropic
                self._client = anthropic.Anthropic(api_key=self._api_key)
            except ImportError:
                raise RuntimeError(
                    "anthropic package not installed. "
                    "Run: pip install anthropic"
                )
        return self._client

    def backend_name(self) -> str:
        return f"Anthropic ({self.model})"

    def is_available(self) -> bool:
        if not self._api_key:
            self._log.error(
                "ANTHROPIC_API_KEY not set. "
                "Set the environment variable or use --api-key."
            )
            return False
        try:
            client = self._get_client()
            # Make a minimal test call
            client.messages.create(
                model=self.model,
                max_tokens=10,
                messages=[{"role": "user", "content": "ping"}]
            )
            return True
        except Exception as exc:
            self._log.error(
                "Anthropic availability check failed: %s", exc
            )
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        try:
            client = self._get_client()
            message = client.messages.create(
                model=self.model,
                max_tokens=max_tokens,
                system=(
                    "You are an expert AI detection system. "
                    "Always respond with valid JSON only, "
                    "with no additional text before or after the JSON."
                ),
                messages=[{"role": "user", "content": prompt}]
            )
            return message.content[0].text
        except Exception as exc:
            self._log.error("Anthropic generation failed: %s", exc)
            return "{}"


class TransformersBackend(LLMBackend):
    """
    Backend for local inference using HuggingFace Transformers.
    Automatically selects the best available device:
    NVIDIA CUDA, Apple MPS, AMD ROCm (via PyTorch), or CPU.
    """

    DEFAULT_MODEL = "microsoft/phi-2"

    def __init__(self, model: str = DEFAULT_MODEL):
        self.model_name = model
        self._log = logging.getLogger("llmsleuth.backend.transformers")
        self._model = None
        self._tokenizer = None
        self._device = None
        self._pipeline = None

    def _load_model(self):
        if self._pipeline is not None:
            return

        try:
            import torch
            from transformers import pipeline as hf_pipeline

            if torch.cuda.is_available():
                device = 0  # First CUDA device
                self._log.info("Using NVIDIA CUDA for inference")
            elif (hasattr(torch.backends, "mps")
                  and torch.backends.mps.is_available()):
                device = "mps"
                self._log.info("Using Apple MPS for inference")
            else:
                device = -1  # CPU
                self._log.info("Using CPU for inference")

            self._pipeline = hf_pipeline(
                "text-generation",
                model=self.model_name,
                device=device,
                torch_dtype="auto",
                trust_remote_code=True
            )
            self._log.info(
                "Loaded model %s on device %s",
                self.model_name, device
            )
        except Exception as exc:
            self._log.error("Failed to load model: %s", exc)
            raise

    def backend_name(self) -> str:
        return f"Transformers ({self.model_name})"

    def is_available(self) -> bool:
        try:
            import torch
            import transformers
            return True
        except ImportError:
            return False

    def generate(self, prompt: str, max_tokens: int = 2048) -> str:
        try:
            self._load_model()
            outputs = self._pipeline(
                prompt,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.1,
                top_p=0.9,
                return_full_text=False,
                pad_token_id=self._pipeline.tokenizer.eos_token_id
            )
            return outputs[0]["generated_text"]
        except Exception as exc:
            self._log.error("Transformers generation failed: %s", exc)
            return "{}"


def create_backend(args: argparse.Namespace) -> LLMBackend:
    """
    Factory function that creates the appropriate LLM backend
    based on the command-line arguments.
    """
    default_models = {
        "ollama": OllamaBackend.DEFAULT_MODEL,
        "openai": OpenAIBackend.DEFAULT_MODEL,
        "anthropic": AnthropicBackend.DEFAULT_MODEL,
        "transformers": TransformersBackend.DEFAULT_MODEL,
    }

    model = args.model or default_models.get(args.backend)

    if args.backend == "ollama":
        return OllamaBackend(
            model=model,
            host=args.ollama_host
        )
    elif args.backend == "openai":
        return OpenAIBackend(
            model=model,
            api_key=args.api_key
        )
    elif args.backend == "anthropic":
        return AnthropicBackend(
            model=model,
            api_key=args.api_key
        )
    elif args.backend == "transformers":
        return TransformersBackend(model=model)
    else:
        raise ValueError(f"Unknown backend: {args.backend}")

# ---------------------------------------------------------------------------
# Perplexity Scorer
# ---------------------------------------------------------------------------

class PerplexityScorer:
    """
    Computes perplexity scores for text passages using a local causal
    language model via HuggingFace Transformers.
    Automatically selects CUDA, MPS, or CPU based on availability.
    """

    def __init__(self, model_name: str = "gpt2"):
        self._log = logging.getLogger("llmsleuth.perplexity")
        self.model_name = model_name
        self._model = None
        self._tokenizer = None
        self._device = None
        self._available = False

    def _load(self):
        if self._model is not None:
            return

        try:
            import torch
            from transformers import (
                AutoModelForCausalLM,
                AutoTokenizer
            )

            if torch.cuda.is_available():
                self._device = torch.device("cuda")
                dtype = torch.float16
                self._log.info(
                    "PerplexityScorer: using CUDA (GPU: %s)",
                    torch.cuda.get_device_name(0)
                )
            elif (hasattr(torch.backends, "mps")
                  and torch.backends.mps.is_available()):
                self._device = torch.device("mps")
                dtype = torch.float16
                self._log.info(
                    "PerplexityScorer: using Apple MPS"
                )
            else:
                self._device = torch.device("cpu")
                dtype = torch.float32
                self._log.info(
                    "PerplexityScorer: using CPU"
                )

            self._tokenizer = AutoTokenizer.from_pretrained(
                self.model_name
            )
            self._model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                torch_dtype=dtype
            ).to(self._device)
            self._model.eval()
            self._available = True
            self._log.info(
                "PerplexityScorer: loaded %s", self.model_name
            )

        except Exception as exc:
            self._log.error(
                "Failed to load perplexity model: %s", exc
            )
            self._available = False

    def is_available(self) -> bool:
        try:
            import torch
            import transformers
            return True
        except ImportError:
            return False

    def score(self, text: str, max_length: int = 512) -> dict:
        """
        Computes the perplexity of the given text.
        Returns a dict with 'perplexity' and 'mean_nll' keys.
        Returns {'perplexity': -1.0, 'mean_nll': -1.0} on failure.
        """
        self._load()

        if not self._available:
            return {"perplexity": -1.0, "mean_nll": -1.0}

        try:
            import torch

            encodings = self._tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=max_length
            ).to(self._device)

            input_ids = encodings["input_ids"]

            if input_ids.shape[1] < 2:
                return {"perplexity": -1.0, "mean_nll": -1.0}

            with torch.no_grad():
                outputs = self._model(
                    input_ids, labels=input_ids
                )
                loss = outputs.loss.item()

            perplexity = float(
                torch.exp(torch.tensor(loss)).item()
            )

            return {
                "perplexity": perplexity,
                "mean_nll": float(loss)
            }

        except Exception as exc:
            self._log.error("Perplexity scoring failed: %s", exc)
            return {"perplexity": -1.0, "mean_nll": -1.0}

# ---------------------------------------------------------------------------
# Statistical Analysis Engine
# ---------------------------------------------------------------------------

class StatisticalAnalyzer:
    """
    Computes statistical and stylometric features for text and code
    segments. All computations are deterministic and require no LLM
    inference, making this engine fast and reproducible.
    """

    def __init__(self, perplexity_scorer: Optional[PerplexityScorer] = None):
        self._ppl_scorer = perplexity_scorer
        self._log = logging.getLogger("llmsleuth.statistical")

    def analyze_text(self, text: str) -> dict:
        """
        Computes all statistical features for a prose text segment.
        Returns a feature dictionary suitable for ensemble scoring.
        """
        features = {}

        # Sentence statistics
        features.update(self._compute_sentence_stats(text))

        # LLM phrase density
        features.update(self._compute_llm_phrase_density(text))

        # Vocabulary richness
        features.update(self._compute_vocabulary_richness(text))

        # Perplexity (if scorer is available)
        if self._ppl_scorer is not None:
            features.update(self._ppl_scorer.score(text))
        else:
            features["perplexity"] = -1.0
            features["mean_nll"] = -1.0

        # Punctuation analysis
        features.update(self._compute_punctuation_features(text))

        # Formality score
        features["formality_score"] = self._compute_formality(text)

        return features

    def analyze_code(self, code: str, language: str = "python") -> dict:
        """
        Computes all statistical features for a code segment.
        Returns a feature dictionary suitable for ensemble scoring.
        """
        features = {}

        # Code-specific features
        features.update(self._compute_code_features(code, language))

        # Also compute some text features on the comments and docstrings
        comments_text = self._extract_comments(code)
        if len(comments_text) > 30:
            features.update(
                self._compute_llm_phrase_density(comments_text)
            )
            if self._ppl_scorer is not None:
                ppl = self._ppl_scorer.score(comments_text)
                features["comment_perplexity"] = ppl.get(
                    "perplexity", -1.0
                )
        else:
            features["perplexity"] = -1.0
            features["comment_perplexity"] = -1.0

        return features

    def _compute_sentence_stats(self, text: str) -> dict:
        sentences = sent_tokenize(text)
        if len(sentences) < 2:
            return {
                "mean_sentence_length": 0.0,
                "std_sentence_length": 0.0,
                "sentence_length_cv": 0.5,
                "sentence_count": len(sentences)
            }

        lengths = [len(word_tokenize(s)) for s in sentences]
        mean_len = float(np.mean(lengths))
        std_len = float(np.std(lengths))
        cv = std_len / mean_len if mean_len > 0 else 0.5

        return {
            "mean_sentence_length": round(mean_len, 2),
            "std_sentence_length": round(std_len, 2),
            "sentence_length_cv": round(cv, 3),
            "sentence_count": len(sentences)
        }

    def _compute_llm_phrase_density(self, text: str) -> dict:
        text_lower = text.lower()
        words = text_lower.split()
        word_count = len(words)
        matched = [
            phrase for phrase in LLM_MARKER_PHRASES
            if phrase in text_lower
        ]
        density = (
            len(matched) / word_count * 100
        ) if word_count > 0 else 0.0

        return {
            "llm_phrase_count": len(matched),
            "llm_phrase_density": round(density, 3),
            "matched_phrases": matched
        }

    def _compute_vocabulary_richness(self, text: str,
                                      window_size: int = 50) -> dict:
        words = word_tokenize(text.lower())
        words = [w for w in words if w.isalpha()]

        if not words:
            return {"ttr": 0.0, "mattr": 0.0, "word_count": 0}

        ttr = len(set(words)) / len(words)

        if len(words) < window_size:
            mattr = ttr
        else:
            ttrs = [
                len(set(words[i:i + window_size])) / window_size
                for i in range(len(words) - window_size + 1)
            ]
            mattr = float(np.mean(ttrs))

        return {
            "ttr": round(ttr, 3),
            "mattr": round(mattr, 3),
            "word_count": len(words)
        }

    def _compute_punctuation_features(self, text: str) -> dict:
        word_count = len(text.split())
        if word_count == 0:
            return {
                "comma_density": 0.0,
                "semicolon_density": 0.0,
                "exclamation_density": 0.0,
                "question_density": 0.0
            }

        return {
            "comma_density": round(
                text.count(",") / word_count * 100, 2
            ),
            "semicolon_density": round(
                text.count(";") / word_count * 100, 2
            ),
            "exclamation_density": round(
                text.count("!") / word_count * 100, 2
            ),
            "question_density": round(
                text.count("?") / word_count * 100, 2
            )
        }

    def _compute_formality(self, text: str) -> float:
        """
        Estimates text formality as a score from 0 (informal) to 1
        (very formal). High formality in informal contexts is an
        LLM indicator.
        """
        informal_markers = [
            "gonna", "wanna", "gotta", "kinda", "sorta", "yeah",
            "nope", "yep", "ok", "okay", "lol", "btw", "tbh",
            "imo", "imho", "fyi", "asap", "etc", "vs", "i'm",
            "it's", "don't", "can't", "won't", "isn't", "aren't"
        ]
        text_lower = text.lower()
        words = text_lower.split()
        if not words:
            return 0.5

        informal_count = sum(
            1 for marker in informal_markers
            if marker in text_lower
        )
        formality = 1.0 - min(1.0, informal_count / max(1, len(words)) * 20)
        return round(formality, 3)

    def _compute_code_features(self, source_code: str,
                                language: str) -> dict:
        lines = source_code.splitlines()
        total_lines = len(lines)

        if total_lines == 0:
            return {"total_lines": 0}

        # Comment lines
        comment_lines = sum(
            1 for line in lines
            if line.strip().startswith("#")
        )
        comment_ratio = comment_lines / total_lines

        # Docstring detection (simplified but robust)
        in_docstring = False
        docstring_lines = 0
        triple_quote_count = 0
        for line in lines:
            stripped = line.strip()
            occurrences = stripped.count('"""') + stripped.count("'''")
            triple_quote_count += occurrences
            if occurrences % 2 == 1:
                in_docstring = not in_docstring
            if in_docstring or occurrences > 0:
                docstring_lines += 1

        # Generic exception handling
        generic_except_count = sum(
            1 for line in lines
            if re.search(
                r"except\s+(Exception|BaseException)\s+as", line
            )
            or line.strip() in ("except:", "except Exception:")
        )

        # Verbose identifiers
        identifiers = re.findall(
            r"\b([a-zA-Z_][a-zA-Z0-9_]*)\b", source_code
        )
        long_identifiers = [i for i in identifiers if len(i) > 20]
        verbosity_ratio = (
            len(long_identifiers) / len(identifiers)
            if identifiers else 0.0
        )

        # Print-based debugging
        print_debug_count = sum(
            1 for line in lines
            if re.search(r'\bprint\s*\(', line)
        )

        # Blank lines ratio (LLMs often add many blank lines)
        blank_lines = sum(1 for line in lines if not line.strip())
        blank_ratio = blank_lines / total_lines

        # Average function length (if parseable)
        avg_func_length = self._compute_avg_function_length(source_code)

        # Import count
        import_count = sum(
            1 for line in lines
            if line.strip().startswith(("import ", "from "))
        )

        return {
            "comment_ratio": round(comment_ratio, 3),
            "docstring_line_ratio": round(
                docstring_lines / total_lines, 3
            ),
            "generic_except_count": generic_except_count,
            "verbosity_ratio": round(verbosity_ratio, 3),
            "print_debug_count": print_debug_count,
            "blank_ratio": round(blank_ratio, 3),
            "avg_function_length": avg_func_length,
            "import_count": import_count,
            "total_lines": total_lines
        }

    def _compute_avg_function_length(self, source_code: str) -> float:
        """
        Computes the average length of functions in the source code
        using AST parsing. Returns -1.0 if parsing fails.
        """
        try:
            tree = ast.parse(source_code)
            lengths = []
            for node in ast.walk(tree):
                if isinstance(
                    node,
                    (ast.FunctionDef, ast.AsyncFunctionDef)
                ):
                    length = node.end_lineno - node.lineno + 1
                    lengths.append(length)
            return round(float(np.mean(lengths)), 1) if lengths else -1.0
        except SyntaxError:
            return -1.0

    def _extract_comments(self, source_code: str) -> str:
        """
        Extracts all comment and docstring text from source code
        for separate analysis.
        """
        lines = source_code.splitlines()
        comment_lines = [
            line.strip().lstrip("#").strip()
            for line in lines
            if line.strip().startswith("#")
        ]

        # Extract docstring content (simplified)
        docstring_content = re.findall(
            r'"""(.*?)"""', source_code, re.DOTALL
        )
        docstring_content += re.findall(
            r"'''(.*?)'''", source_code, re.DOTALL
        )

        all_text = " ".join(comment_lines)
        all_text += " " + " ".join(
            d.strip() for d in docstring_content
        )
        return all_text.strip()

# ---------------------------------------------------------------------------
# LLM Judge Engine
# ---------------------------------------------------------------------------

class LLMJudgeEngine:
    """
    Uses an LLM backend to analyze passages for signs of LLM generation.
    Handles prompt construction, response parsing, and error recovery.
    Implements retry logic for transient failures.
    """

    MAX_RETRIES = 3
    RETRY_DELAY = 2.0  # seconds

    def __init__(self, backend: LLMBackend):
        self.backend = backend
        self._log = logging.getLogger("llmsleuth.judge")

    def _extract_json(self, text: str) -> dict:
        """
        Robustly extracts a JSON object from the model's response.
        Handles cases where the model includes extra text, markdown
        code fences, or other formatting around the JSON.
        """
        if not text or not text.strip():
            return self._default_response("Empty response from model")

        # Remove markdown code fences if present
        text = re.sub(r"```(?:json)?\s*", "", text)
        text = re.sub(r"```\s*$", "", text, flags=re.MULTILINE)

        # Try direct parsing
        try:
            return json.loads(text.strip())
        except json.JSONDecodeError:
            pass

        # Try to find the outermost JSON object
        brace_start = text.find("{")
        brace_end = text.rfind("}")
        if brace_start != -1 and brace_end > brace_start:
            candidate = text[brace_start:brace_end + 1]
            try:
                return json.loads(candidate)
            except json.JSONDecodeError:
                pass

        self._log.warning(
            "Could not parse JSON from model response: %s...",
            text[:200]
        )
        return self._default_response(
            "Could not parse model response as JSON"
        )

    def _default_response(self, reason: str) -> dict:
        """Returns a neutral default response when parsing fails."""
        return {
            "likelihood_llm": 50,
            "confidence": "low",
            "indicators": [],
            "human_indicators": [],
            "reasoning": reason
        }

    def _call_with_retry(self, prompt: str) -> str:
        """
        Calls the backend with retry logic for transient failures.
        """
        for attempt in range(self.MAX_RETRIES):
            try:
                result = self.backend.generate(prompt)
                if result and result.strip():
                    return result
            except Exception as exc:
                self._log.warning(
                    "Backend call failed (attempt %d/%d): %s",
                    attempt + 1, self.MAX_RETRIES, exc
                )

            if attempt < self.MAX_RETRIES - 1:
                time.sleep(self.RETRY_DELAY)

        return "{}"

    def analyze_text(self, text: str) -> dict:
        """
        Analyzes a text passage using the LLM judge.
        Truncates very long passages to stay within context limits.
        """
        # Truncate to ~3000 characters to stay within context limits
        truncated = text[:3000] + "..." if len(text) > 3000 else text
        prompt = JUDGE_PROMPT_TEXT.format(text=truncated)
        response = self._call_with_retry(prompt)
        result = self._extract_json(response)

        # Validate and clamp the likelihood score
        result["likelihood_llm"] = max(
            0, min(100, int(result.get("likelihood_llm", 50)))
        )
        return result

    def analyze_code(self, code: str) -> dict:
        """
        Analyzes a code passage using the LLM judge.
        Truncates very long code blocks to stay within context limits.
        """
        truncated = code[:3000] + "..." if len(code) > 3000 else code
        prompt = JUDGE_PROMPT_CODE.format(code=truncated)
        response = self._call_with_retry(prompt)
        result = self._extract_json(response)

        result["likelihood_llm"] = max(
            0, min(100, int(result.get("likelihood_llm", 50)))
        )
        return result

# ---------------------------------------------------------------------------
# Ensemble Scoring Engine
# ---------------------------------------------------------------------------

class EnsembleScoringEngine:
    """
    Combines statistical features and LLM judge scores to produce
    a calibrated final likelihood estimate for each segment.
    """

    # Ensemble weights (must sum to 1.0)
    WEIGHT_LLM_JUDGE = 0.55
    WEIGHT_PERPLEXITY = 0.20
    WEIGHT_STYLOMETRIC = 0.25

    # Perplexity normalization thresholds
    LOW_PPL = 20.0    # Very likely LLM-generated
    HIGH_PPL = 200.0  # Very likely human-written

    def compute_perplexity_score(self, perplexity: float) -> float:
        """
        Maps raw perplexity to a 0-100 LLM-likelihood score.
        Lower perplexity = higher LLM likelihood.
        """
        if perplexity <= 0:
            return 50.0  # Unknown

        ppl = max(self.LOW_PPL, min(self.HIGH_PPL, perplexity))
        score = 90.0 - 80.0 * (
            (ppl - self.LOW_PPL) / (self.HIGH_PPL - self.LOW_PPL)
        )
        return round(float(score), 1)

    def compute_stylometric_score(self, features: dict,
                                   segment_type: str) -> float:
        """
        Converts the stylometric feature vector to a 0-100
        LLM-likelihood score.
        """
        scores = []

        if segment_type == "text":
            # Sentence length coefficient of variation
            cv = features.get("sentence_length_cv", 0.5)
            # CV < 0.2 -> score near 100; CV > 0.5 -> score near 0
            cv_score = max(0.0, min(100.0, (0.5 - cv) / 0.5 * 100))
            scores.append(cv_score)

            # LLM phrase density
            density = features.get("llm_phrase_density", 0.0)
            density_score = min(100.0, density / 2.0 * 100)
            scores.append(density_score)

            # Formality (high formality in informal contexts)
            formality = features.get("formality_score", 0.5)
            formality_score = formality * 60  # Max contribution: 60
            scores.append(formality_score)

            # Low vocabulary richness variation is LLM-like
            mattr = features.get("mattr", 0.7)
            # MATTR around 0.7-0.8 is typical for LLMs
            mattr_score = max(
                0.0,
                min(100.0, (1.0 - abs(mattr - 0.75) / 0.25) * 50)
            )
            scores.append(mattr_score)

        elif segment_type == "code":
            # Comment ratio
            comment_ratio = features.get("comment_ratio", 0.0)
            comment_score = min(100.0, comment_ratio / 0.4 * 100)
            scores.append(comment_score)

            # Generic exception handling
            generic_except = features.get("generic_except_count", 0)
            scores.append(min(100.0, generic_except * 25.0))

            # Verbose naming
            verbosity = features.get("verbosity_ratio", 0.0)
            scores.append(min(100.0, verbosity * 500.0))

            # LLM phrase density in comments
            density = features.get("llm_phrase_density", 0.0)
            scores.append(min(100.0, density / 2.0 * 100))

        if not scores:
            return 50.0

        return round(float(np.mean(scores)), 1)

    def compute_final_score(self, segment: Segment) -> float:
        """
        Computes the final ensemble score for a segment.
        """
        judge_score = float(
            segment.llm_judge_result.get("likelihood_llm", 50)
        )

        perplexity = segment.statistical_features.get(
            "perplexity", -1.0
        )
        ppl_score = self.compute_perplexity_score(perplexity)

        stylo_score = self.compute_stylometric_score(
            segment.statistical_features,
            segment.segment_type
        )

        final = (
            self.WEIGHT_LLM_JUDGE * judge_score
            + self.WEIGHT_PERPLEXITY * ppl_score
            + self.WEIGHT_STYLOMETRIC * stylo_score
        )

        return round(final, 1)

# ---------------------------------------------------------------------------
# File Ingestion Layer
# ---------------------------------------------------------------------------

class FileIngester:
    """
    Reads input files of various formats and converts them to a list
    of Segment objects for analysis. Supports .txt, .py, .pdf, .docx.
    """

    def __init__(self, min_segment_length: int = 50):
        self.min_length = min_segment_length
        self._log = logging.getLogger("llmsleuth.ingester")

    def ingest(self, file_path: str) -> tuple[list[Segment], int]:
        """
        Reads the file and returns (segments, total_char_count).
        The total_char_count is used to compute contribution percentages.
        """
        path = Path(file_path)
        suffix = path.suffix.lower()

        if suffix == ".txt":
            return self._ingest_text(file_path)
        elif suffix == ".py":
            return self._ingest_python(file_path)
        elif suffix == ".pdf":
            return self._ingest_pdf(file_path)
        elif suffix in (".docx", ".doc"):
            return self._ingest_docx(file_path)
        else:
            # Try to read as plain text
            self._log.warning(
                "Unknown file type %s, treating as plain text", suffix
            )
            return self._ingest_text(file_path)

    def _make_text_segment(self, text: str, location: str,
                            char_offset: int) -> Optional[Segment]:
        """Creates a text Segment if the content meets the minimum length."""
        text = text.strip()
        if len(text) < self.min_length:
            return None
        return Segment(
            segment_type="text",
            content=text,
            location=location,
            char_offset=char_offset,
            char_length=len(text)
        )

    def _make_code_segment(self, code: str, location: str,
                            char_offset: int,
                            name: Optional[str] = None) -> Optional[Segment]:
        """Creates a code Segment if the content meets the minimum length."""
        code = code.strip()
        if len(code) < self.min_length:
            return None
        return Segment(
            segment_type="code",
            content=code,
            location=location,
            char_offset=char_offset,
            char_length=len(code),
            code_unit_name=name
        )

    def _ingest_text(self, file_path: str) -> tuple[list[Segment], int]:
        """Reads a plain text file and segments it into paragraphs."""
        with open(file_path, "r", encoding="utf-8", errors="replace") as f:
            content = f.read()

        total_chars = len(content)
        normalized = content.replace("\r\n", "\n").replace("\r", "\n")
        raw_paragraphs = re.split(r"\n\s*\n", normalized)

        segments = []
        char_offset = 0
        para_index = 0

        for para in raw_paragraphs:
            stripped = para.strip()
            para_index += 1
            location = f"Paragraph {para_index}"
            seg = self._make_text_segment(stripped, location, char_offset)
            if seg:
                segments.append(seg)
            char_offset += len(para) + 2

        return segments, total_chars

    def _ingest_python(self, file_path: str) -> tuple[list[Segment], int]:
        """
        Reads a Python source file and extracts functions, classes,
        and module-level code as separate segments.
        """
        with open(file_path, "r", encoding="utf-8", errors="replace") as f:
            source = f.read()

        total_chars = len(source)
        segments = []

        try:
            tree = ast.parse(source)
        except SyntaxError as exc:
            self._log.error("Failed to parse Python file: %s", exc)
            # Fall back to treating the whole file as one code segment
            seg = self._make_code_segment(
                source, f"File: {Path(file_path).name}", 0
            )
            if seg:
                segments.append(seg)
            return segments, total_chars

        source_lines = source.splitlines()

        # Track which lines are covered by top-level nodes
        covered_lines = set()

        # Extract all top-level and nested functions and classes
        for node in ast.walk(tree):
            if isinstance(
                node,
                (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)
            ):
                start = node.lineno - 1
                end = node.end_lineno
                unit_lines = source_lines[start:end]
                unit_source = textwrap.dedent("\n".join(unit_lines))

                unit_type = (
                    "class" if isinstance(node, ast.ClassDef)
                    else "function"
                )
                location = (
                    f"{Path(file_path).name}, "
                    f"Lines {node.lineno}-{node.end_lineno} "
                    f"({unit_type}: {node.name})"
                )

                # Compute char offset
                char_offset = sum(
                    len(line) + 1 for line in source_lines[:start]
                )

                seg = self._make_code_segment(
                    unit_source, location, char_offset, node.name
                )
                if seg:
                    segments.append(seg)

                for line_no in range(start, end):
                    covered_lines.add(line_no)

        # Extract module-level code that is not inside any function/class
        module_level_lines = []
        module_level_start = None
        for i, line in enumerate(source_lines):
            if i not in covered_lines and line.strip():
                if module_level_start is None:
                    module_level_start = i
                module_level_lines.append(line)

        if module_level_lines:
            module_code = "\n".join(module_level_lines)
            char_offset = sum(
                len(line) + 1
                for line in source_lines[:module_level_start]
            ) if module_level_start else 0
            location = (
                f"{Path(file_path).name}, Module-level code"
            )
            seg = self._make_code_segment(
                module_code, location, char_offset
            )
            if seg:
                segments.append(seg)

        # Sort segments by char_offset for consistent ordering
        segments.sort(key=lambda s: s.char_offset)

        return segments, total_chars

    def _ingest_pdf(self, file_path: str) -> tuple[list[Segment], int]:
        """
        Reads a PDF file using PyMuPDF and extracts text blocks
        as separate segments.
        """
        try:
            import fitz
        except ImportError:
            raise RuntimeError(
                "PyMuPDF not installed. Run: pip install PyMuPDF"
            )

        doc = fitz.open(file_path)
        segments = []
        total_chars = 0
        global_char_offset = 0

        for page_num in range(len(doc)):
            page = doc[page_num]
            blocks = page.get_text("blocks")

            para_index = 0
            for block in blocks:
                # block: (x0, y0, x1, y1, text, block_no, block_type)
                if block[6] != 0:  # Skip non-text blocks
                    continue

                block_text = block[4].strip()
                if not block_text:
                    continue

                total_chars += len(block_text)
                para_index += 1
                location = (
                    f"Page {page_num + 1}, Block {para_index}"
                )

                seg = self._make_text_segment(
                    block_text, location, global_char_offset
                )
                if seg:
                    segments.append(seg)

                global_char_offset += len(block_text) + 1

        doc.close()
        return segments, total_chars

    def _ingest_docx(self, file_path: str) -> tuple[list[Segment], int]:
        """
        Reads a Word document using python-docx and extracts
        paragraphs and code blocks (if formatted with a code style).
        """
        try:
            from docx import Document
        except ImportError:
            raise RuntimeError(
                "python-docx not installed. Run: pip install python-docx"
            )

        doc = Document(file_path)
        segments = []
        total_chars = 0
        global_char_offset = 0

        # Code-style paragraph names in Word documents
        code_styles = {
            "code", "code block", "preformatted text",
            "html preformatted", "macro text"
        }

        para_index = 0
        code_buffer = []
        code_start_para = None

        for para in doc.paragraphs:
            text = para.text.strip()
            if not text:
                # Flush code buffer if we hit a blank paragraph
                if code_buffer:
                    code_text = "\n".join(code_buffer)
                    total_chars += len(code_text)
                    location = (
                        f"Paragraph {code_start_para} "
                        f"(code block)"
                    )
                    seg = self._make_code_segment(
                        code_text, location, global_char_offset
                    )
                    if seg:
                        segments.append(seg)
                    global_char_offset += len(code_text) + 1
                    code_buffer = []
                    code_start_para = None
                continue

            style_name = (
                para.style.name.lower() if para.style else ""
            )
            is_code = any(
                cs in style_name for cs in code_styles
            )

            para_index += 1

            if is_code:
                if code_start_para is None:
                    code_start_para = para_index
                code_buffer.append(text)
            else:
                # Flush any pending code buffer
                if code_buffer:
                    code_text = "\n".join(code_buffer)
                    total_chars += len(code_text)
                    location = (
                        f"Paragraph {code_start_para} (code block)"
                    )
                    seg = self._make_code_segment(
                        code_text, location, global_char_offset
                    )
                    if seg:
                        segments.append(seg)
                    global_char_offset += len(code_text) + 1
                    code_buffer = []
                    code_start_para = None

                total_chars += len(text)
                location = f"Paragraph {para_index}"
                seg = self._make_text_segment(
                    text, location, global_char_offset
                )
                if seg:
                    segments.append(seg)
                global_char_offset += len(text) + 1

        # Flush any remaining code buffer
        if code_buffer:
            code_text = "\n".join(code_buffer)
            total_chars += len(code_text)
            location = f"Paragraph {code_start_para} (code block)"
            seg = self._make_code_segment(
                code_text, location, global_char_offset
            )
            if seg:
                segments.append(seg)

        return segments, total_chars

# ---------------------------------------------------------------------------
# Excel Reporter
# ---------------------------------------------------------------------------

class ExcelReporter:
    """
    Generates a formatted Excel workbook from the analyzed segments.
    Includes color coding, frozen headers, a summary sheet, and
    auto-sized columns.
    """

    COLOR_HIGH_LLM = "FF4444"    # Red: >= 70%
    COLOR_MEDIUM_LLM = "FFAA00"  # Orange: 40-69%
    COLOR_LOW_LLM = "44BB44"     # Green: < 40%
    COLOR_HEADER = "1F4E79"      # Dark blue
    COLOR_HEADER_FONT = "FFFFFF" # White
    COLOR_ROW_ALT = "F2F7FF"     # Light blue for alternating rows

    def _likelihood_color(self, score: float) -> str:
        if score >= 70:
            return self.COLOR_HIGH_LLM
        elif score >= 40:
            return self.COLOR_MEDIUM_LLM
        return self.COLOR_LOW_LLM

    def _format_indicators(self, judge_result: dict,
                            stat_features: dict) -> str:
        """
        Builds the full indicators text for the Excel cell,
        combining statistical summary and LLM judge output.
        """
        parts = []

        # Statistical summary
        stat_lines = ["=== STATISTICAL ANALYSIS ==="]
        ppl = stat_features.get("perplexity", -1)
        if ppl > 0:
            stat_lines.append(
                f"Perplexity: {ppl:.1f} "
                f"(lower = more predictable = more LLM-like)"
            )
        cv = stat_features.get("sentence_length_cv", -1)
        if cv >= 0:
            stat_lines.append(
                f"Sentence length CV: {cv:.3f} "
                f"(< 0.2 is LLM-like, > 0.4 is human-like)"
            )
        density = stat_features.get("llm_phrase_density", -1)
        if density >= 0:
            stat_lines.append(
                f"LLM phrase density: {density:.2f} per 100 words"
            )
            matched = stat_features.get("matched_phrases", [])
            if matched:
                stat_lines.append(
                    f"Matched phrases: {', '.join(matched[:8])}"
                )
        comment_ratio = stat_features.get("comment_ratio", -1)
        if comment_ratio >= 0:
            stat_lines.append(
                f"Comment ratio: {comment_ratio:.1%} "
                f"(> 40% is LLM-like)"
            )
        generic_except = stat_features.get("generic_except_count", -1)
        if generic_except >= 0:
            stat_lines.append(
                f"Generic exception handlers: {generic_except}"
            )
        mattr = stat_features.get("mattr", -1)
        if mattr >= 0:
            stat_lines.append(f"Vocabulary richness (MATTR): {mattr:.3f}")
        formality = stat_features.get("formality_score", -1)
        if formality >= 0:
            stat_lines.append(f"Formality score: {formality:.3f}")

        parts.append("\n".join(stat_lines))

        # LLM judge output
        judge_lines = ["=== LLM JUDGE ANALYSIS ==="]
        confidence = judge_result.get("confidence", "unknown")
        judge_lines.append(f"Judge confidence: {confidence}")

        indicators = judge_result.get("indicators", [])
        if indicators:
            judge_lines.append("LLM Indicators:")
            for ind in indicators:
                sev = ind.get("severity", "?").upper()
                typ = ind.get("type", "?")
                ev = ind.get("evidence", "")
                judge_lines.append(f"  [{sev}] {typ}")
                if ev:
                    judge_lines.append(f"    Evidence: \"{ev[:150]}\"")

        human_inds = judge_result.get("human_indicators", [])
        if human_inds:
            judge_lines.append("Human-like Indicators:")
            for ind in human_inds:
                typ = ind.get("type", "?")
                ev = ind.get("evidence", "")
                judge_lines.append(f"  [HUMAN] {typ}")
                if ev:
                    judge_lines.append(f"    Evidence: \"{ev[:150]}\"")

        reasoning = judge_result.get("reasoning", "")
        if reasoning:
            judge_lines.append(f"Reasoning: {reasoning}")

        parts.append("\n".join(judge_lines))

        return "\n\n".join(parts)

    def generate(self, segments: list[Segment], output_path: str,
                 total_chars: int) -> None:
        """
        Generates and saves the Excel report.
        """
        wb = openpyxl.Workbook()
        ws = wb.active
        ws.title = "Detection Report"

        headers = [
            "Text / Code Excerpt",
            "Type",
            "Location",
            "LLM Likelihood (%)",
            "Proofs & Indicators",
            "Contribution to Document (%)"
        ]

        # Header row styling
        header_fill = PatternFill(
            start_color=self.COLOR_HEADER,
            end_color=self.COLOR_HEADER,
            fill_type="solid"
        )
        header_font = Font(
            color=self.COLOR_HEADER_FONT,
            bold=True,
            size=11,
            name="Calibri"
        )
        thin_border = Border(
            left=Side(style="thin"),
            right=Side(style="thin"),
            top=Side(style="thin"),
            bottom=Side(style="thin")
        )

        for col_idx, header in enumerate(headers, start=1):
            cell = ws.cell(row=1, column=col_idx, value=header)
            cell.fill = header_fill
            cell.font = header_font
            cell.border = thin_border
            cell.alignment = Alignment(
                horizontal="center",
                vertical="center",
                wrap_text=True
            )

        ws.freeze_panes = "A2"
        ws.row_dimensions[1].height = 30

        alt_fill = PatternFill(
            start_color=self.COLOR_ROW_ALT,
            end_color=self.COLOR_ROW_ALT,
            fill_type="solid"
        )

        for row_idx, segment in enumerate(segments, start=2):
            # Prepare cell values
            excerpt = segment.content
            if len(excerpt) > 400:
                excerpt = excerpt[:400] + "..."

            contribution = round(
                segment.char_length / total_chars * 100, 2
            ) if total_chars > 0 else 0.0

            indicators_text = self._format_indicators(
                segment.llm_judge_result,
                segment.statistical_features
            )

            row_values = [
                excerpt,
                segment.segment_type,
                segment.location,
                segment.final_score,
                indicators_text,
                contribution
            ]

            is_alt_row = (row_idx % 2 == 0)

            for col_idx, value in enumerate(row_values, start=1):
                cell = ws.cell(row=row_idx, column=col_idx, value=value)
                cell.border = thin_border

                if col_idx == 4:
                    # Likelihood column: color-coded
                    color = self._likelihood_color(segment.final_score)
                    cell.fill = PatternFill(
                        start_color=color,
                        end_color=color,
                        fill_type="solid"
                    )
                    cell.font = Font(bold=True, color="FFFFFF", size=12)
                    cell.alignment = Alignment(
                        horizontal="center",
                        vertical="center"
                    )
                elif col_idx == 5:
                    # Indicators column: left-aligned, wrapped
                    if is_alt_row:
                        cell.fill = alt_fill
                    cell.alignment = Alignment(
                        vertical="top",
                        wrap_text=True,
                        horizontal="left"
                    )
                    cell.font = Font(size=9, name="Courier New")
                else:
                    if is_alt_row:
                        cell.fill = alt_fill
                    cell.alignment = Alignment(
                        vertical="top",
                        wrap_text=True
                    )

            ws.row_dimensions[row_idx].height = 120

        # Set column widths
        col_widths = [55, 10, 35, 18, 90, 18]
        for col_idx, width in enumerate(col_widths, start=1):
            ws.column_dimensions[
                get_column_letter(col_idx)
            ].width = width

        # Add summary sheet
        self._add_summary_sheet(wb, segments, total_chars)

        wb.save(output_path)
        logger.info("Report saved to: %s", output_path)

    def _add_summary_sheet(self, wb: openpyxl.Workbook,
                            segments: list[Segment],
                            total_chars: int) -> None:
        """Creates the summary sheet with overall statistics."""
        ws = wb.create_sheet("Summary", 0)
        wb.active = ws

        if not segments:
            ws["A1"] = "No segments analyzed."
            return

        scores = [s.final_score for s in segments]
        weighted_score = (
            sum(s.final_score * s.char_length for s in segments)
            / total_chars
        ) if total_chars > 0 else float(np.mean(scores))

        high = [s for s in segments if s.final_score >= 70]
        medium = [s for s in segments if 40 <= s.final_score < 70]
        low = [s for s in segments if s.final_score < 40]

        if weighted_score >= 65:
            verdict = "LIKELY LLM-GENERATED"
            verdict_color = "FF4444"
        elif weighted_score >= 35:
            verdict = "MIXED (partial LLM use likely)"
            verdict_color = "FFAA00"
        else:
            verdict = "LIKELY HUMAN-WRITTEN"
            verdict_color = "44BB44"

        header_fill = PatternFill(
            start_color=self.COLOR_HEADER,
            end_color=self.COLOR_HEADER,
            fill_type="solid"
        )
        header_font = Font(
            color=self.COLOR_HEADER_FONT,
            bold=True,
            size=14
        )

        title_cell = ws.cell(row=1, column=1, value="LLMSleuth Analysis Summary")
        title_cell.fill = header_fill
        title_cell.font = header_font
        title_cell.alignment = Alignment(horizontal="center")
        ws.merge_cells("A1:B1")

        summary_rows = [
            ("", ""),
            ("OVERALL RESULTS", ""),
            ("Weighted LLM Likelihood", f"{weighted_score:.1f}%"),
            ("Average LLM Likelihood", f"{float(np.mean(scores)):.1f}%"),
            ("Maximum LLM Likelihood", f"{max(scores):.1f}%"),
            ("Minimum LLM Likelihood", f"{min(scores):.1f}%"),
            ("", ""),
            ("SEGMENT BREAKDOWN", ""),
            ("Total segments analyzed", len(segments)),
            ("High likelihood (>= 70%)", len(high)),
            ("Medium likelihood (40-69%)", len(medium)),
            ("Low likelihood (< 40%)", len(low)),
            ("", ""),
            ("VERDICT", verdict),
        ]

        for row_offset, (label, value) in enumerate(
            summary_rows, start=2
        ):
            label_cell = ws.cell(
                row=row_offset, column=1, value=label
            )
            value_cell = ws.cell(
                row=row_offset, column=2, value=value
            )

            if label in (
                "OVERALL RESULTS", "SEGMENT BREAKDOWN", "VERDICT"
            ):
                label_cell.font = Font(bold=True, size=11)

            if label == "VERDICT":
                value_cell.fill = PatternFill(
                    start_color=verdict_color,
                    end_color=verdict_color,
                    fill_type="solid"
                )
                value_cell.font = Font(
                    bold=True, color="FFFFFF", size=12
                )

        ws.column_dimensions["A"].width = 40
        ws.column_dimensions["B"].width = 35

# ---------------------------------------------------------------------------
# Analysis Pipeline
# ---------------------------------------------------------------------------

class AnalysisPipeline:
    """
    Orchestrates the full analysis pipeline:
    1. File ingestion and segmentation
    2. Statistical analysis
    3. LLM judge analysis
    4. Ensemble scoring
    """

    def __init__(self, backend: LLMBackend,
                 perplexity_model: Optional[str] = "gpt2",
                 min_segment_length: int = 50,
                 verbose: bool = False):
        self.backend = backend
        self.verbose = verbose
        self._log = logging.getLogger("llmsleuth.pipeline")

        if verbose:
            logging.getLogger("llmsleuth").setLevel(logging.DEBUG)

        # Initialize components
        self.ingester = FileIngester(
            min_segment_length=min_segment_length
        )

        if perplexity_model:
            self.ppl_scorer = PerplexityScorer(perplexity_model)
        else:
            self.ppl_scorer = None
            self._log.info(
                "Perplexity scoring disabled (--no-perplexity)"
            )

        self.stat_analyzer = StatisticalAnalyzer(self.ppl_scorer)
        self.judge = LLMJudgeEngine(backend)
        self.ensemble = EnsembleScoringEngine()

    def run(self, file_path: str) -> tuple[list[Segment], int]:
        """
        Runs the full analysis pipeline on the given file.
        Returns (analyzed_segments, total_char_count).
        """
        # Step 1: Ingest the file
        self._log.info("Step 1: Ingesting file: %s", file_path)
        segments, total_chars = self.ingester.ingest(file_path)
        self._log.info(
            "Ingested %d segments (%d total characters)",
            len(segments), total_chars
        )

        if not segments:
            self._log.warning("No segments found in the file.")
            return [], total_chars

        # Step 2: Statistical analysis
        self._log.info(
            "Step 2: Running statistical analysis on %d segments...",
            len(segments)
        )
        for i, segment in enumerate(segments):
            if self.verbose:
                self._log.debug(
                    "  Statistical analysis: segment %d/%d (%s)",
                    i + 1, len(segments), segment.location
                )
            if segment.segment_type == "text":
                segment.statistical_features = (
                    self.stat_analyzer.analyze_text(segment.content)
                )
            else:
                segment.statistical_features = (
                    self.stat_analyzer.analyze_code(segment.content)
                )

        # Step 3: LLM judge analysis
        self._log.info(
            "Step 3: Running LLM judge analysis "
            "(backend: %s)...",
            self.backend.backend_name()
        )
        for i, segment in enumerate(segments):
            self._log.info(
                "  Judging segment %d/%d: %s",
                i + 1, len(segments), segment.location
            )
            if segment.segment_type == "text":
                segment.llm_judge_result = (
                    self.judge.analyze_text(segment.content)
                )
            else:
                segment.llm_judge_result = (
                    self.judge.analyze_code(segment.content)
                )

            if self.verbose:
                score = segment.llm_judge_result.get(
                    "likelihood_llm", "?"
                )
                self._log.debug(
                    "    Judge score: %s%%", score
                )

        # Step 4: Ensemble scoring
        self._log.info("Step 4: Computing ensemble scores...")
        for segment in segments:
            segment.final_score = self.ensemble.compute_final_score(
                segment
            )

        # Sort by final score descending for the report
        segments.sort(key=lambda s: s.final_score, reverse=True)

        self._log.info("Pipeline complete.")
        return segments, total_chars

# ---------------------------------------------------------------------------
# Command-Line Interface
# ---------------------------------------------------------------------------

def create_argument_parser() -> argparse.ArgumentParser:
    """Creates and returns the CLI argument parser."""
    parser = argparse.ArgumentParser(
        prog="llmsleuth",
        description=(
            "LLMSleuth: Detect AI-generated content in documents and code.\n\n"
            "Analyzes text and code files for signs of LLM generation using\n"
            "a combination of statistical analysis, stylometry, perplexity\n"
            "scoring, and LLM-as-judge evaluation. Produces a detailed Excel\n"
            "report with per-passage likelihood scores and evidence.\n\n"
            "Supported input formats: .txt, .py, .pdf, .docx\n"
            "Supported backends: ollama, openai, anthropic, transformers"
        ),
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=(
            "Examples:\n"
            "  python llmsleuth.py essay.pdf --backend ollama\n"
            "  python llmsleuth.py solution.py --backend openai "
            "--model gpt-4o\n"
            "  python llmsleuth.py report.docx --backend anthropic\n"
            "  python llmsleuth.py notes.txt --backend transformers "
            "--model microsoft/phi-2\n"
            "  python llmsleuth.py code.py --backend ollama "
            "--model mistral --no-perplexity"
        )
    )

    parser.add_argument(
        "input_file",
        help="Path to the file to analyze"
    )

    parser.add_argument(
        "-o", "--output",
        default=None,
        help=(
            "Output Excel file path "
            "(default: <input_filename>_llmsleuth_report.xlsx)"
        )
    )

    parser.add_argument(
        "--backend",
        choices=["ollama", "openai", "anthropic", "transformers"],
        default="ollama",
        help="LLM backend for the judge engine (default: ollama)"
    )

    parser.add_argument(
        "--model",
        default=None,
        help=(
            "Model name for the selected backend. "
            "Defaults: ollama=llama3.2, openai=gpt-4o, "
            "anthropic=claude-3-5-sonnet-20241022, "
            "transformers=microsoft/phi-2"
        )
    )

    parser.add_argument(
        "--perplexity-model",
        default="gpt2",
        metavar="MODEL",
        help=(
            "HuggingFace model for perplexity scoring "
            "(default: gpt2). "
            "Larger models give better accuracy but are slower."
        )
    )

    parser.add_argument(
        "--no-perplexity",
        action="store_true",
        help="Disable perplexity scoring for faster analysis"
    )

    parser.add_argument(
        "--api-key",
        default=None,
        metavar="KEY",
        help=(
            "API key for remote backends. "
            "Alternatively, set OPENAI_API_KEY or ANTHROPIC_API_KEY."
        )
    )

    parser.add_argument(
        "--ollama-host",
        default="http://localhost:11434",
        metavar="URL",
        help="Ollama server URL (default: http://localhost:11434)"
    )

    parser.add_argument(
        "--min-segment-length",
        type=int,
        default=50,
        metavar="N",
        help=(
            "Minimum character length for a segment to be analyzed "
            "(default: 50)"
        )
    )

    parser.add_argument(
        "--verbose", "-v",
        action="store_true",
        help="Enable verbose logging output"
    )

    return parser


def main() -> int:
    """
    Main entry point. Returns 0 on success, 1 on error.
    """
    parser = create_argument_parser()
    args = parser.parse_args()

    if args.verbose:
        logging.getLogger().setLevel(logging.DEBUG)

    # Validate input file
    input_path = Path(args.input_file)
    if not input_path.exists():
        logger.error("Input file not found: %s", input_path)
        return 1

    if not input_path.is_file():
        logger.error("Input path is not a file: %s", input_path)
        return 1

    # Determine output path
    if args.output:
        output_path = args.output
    else:
        output_path = str(
            input_path.parent
            / f"{input_path.stem}_llmsleuth_report.xlsx"
        )

    # Print startup banner
    print("=" * 60)
    print("  LLMSleuth: AI Content Detection Tool")
    print("=" * 60)
    print(f"  Input:    {input_path}")
    print(f"  Output:   {output_path}")
    print(f"  Backend:  {args.backend}")
    if args.model:
        print(f"  Model:    {args.model}")
    if not args.no_perplexity:
        print(f"  PPL Model: {args.perplexity_model}")
    print("=" * 60)
    print()

    # Create and validate the backend
    try:
        backend = create_backend(args)
    except ValueError as exc:
        logger.error("Backend creation failed: %s", exc)
        return 1

    print(f"Checking backend availability ({backend.backend_name()})...")
    if not backend.is_available():
        logger.error(
            "Backend '%s' is not available. "
            "Check your configuration and try again.",
            args.backend
        )
        if args.backend == "ollama":
            print(
                "\nTip: Make sure Ollama is running and the model "
                "is downloaded.\n"
                "     Run: ollama serve\n"
                "     Run: ollama pull llama3.2"
            )
        elif args.backend in ("openai", "anthropic"):
            print(
                "\nTip: Make sure your API key is set correctly.\n"
                "     Use --api-key or set the environment variable."
            )
        return 1

    print(f"Backend OK: {backend.backend_name()}")
    print()

    # Run the analysis pipeline
    try:
        perplexity_model = (
            None if args.no_perplexity else args.perplexity_model
        )

        pipeline = AnalysisPipeline(
            backend=backend,
            perplexity_model=perplexity_model,
            min_segment_length=args.min_segment_length,
            verbose=args.verbose
        )

        start_time = time.time()
        segments, total_chars = pipeline.run(str(input_path))
        elapsed = time.time() - start_time

        print(f"\nAnalysis complete in {elapsed:.1f} seconds.")
        print(f"Analyzed {len(segments)} segments "
              f"({total_chars} total characters).")

    except Exception as exc:
        logger.error("Analysis pipeline failed: %s", exc, exc_info=True)
        return 1

    if not segments:
        print("No analyzable segments found in the file.")
        return 0

    # Generate the report
    try:
        reporter = ExcelReporter()
        reporter.generate(segments, output_path, total_chars)
    except Exception as exc:
        logger.error("Report generation failed: %s", exc, exc_info=True)
        return 1

    # Print a brief summary to the console
    scores = [s.final_score for s in segments]
    weighted = (
        sum(s.final_score * s.char_length for s in segments) / total_chars
    ) if total_chars > 0 else float(np.mean(scores))

    print()
    print("=" * 60)
    print("  SUMMARY")
    print("=" * 60)
    print(f"  Weighted LLM Likelihood: {weighted:.1f}%")
    print(f"  Average LLM Likelihood:  {float(np.mean(scores)):.1f}%")
    print(f"  Highest segment score:   {max(scores):.1f}%")
    high_count = sum(1 for s in scores if s >= 70)
    print(f"  High-likelihood segments: {high_count}/{len(segments)}")

    if weighted >= 65:
        verdict = "LIKELY LLM-GENERATED"
    elif weighted >= 35:
        verdict = "MIXED (partial LLM use likely)"
    else:
        verdict = "LIKELY HUMAN-WRITTEN"

    print(f"  Verdict: {verdict}")
    print("=" * 60)
    print(f"\nFull report: {output_path}")

    return 0


if __name__ == "__main__":
    sys.exit(main())

This completes the full production-ready implementation of LLMSleuth. The tool can be invoked directly from the command line and supports all four file formats, all four LLM backends, and all hardware acceleration options described in the article. The Excel report provides a comprehensive, evidence-based assessment of each passage in the document, with color coding, statistical summaries, and LLM judge reasoning.

To run the tool on a PDF essay using a local Ollama model, the command is:

python llmsleuth.py student_essay.pdf --backend ollama --model llama3.2

To run it on a Python assignment using the OpenAI API with verbose output:

python llmsleuth.py assignment.py --backend openai --model gpt-4o \
    --verbose

The resulting Excel file will open in Microsoft Excel, LibreOffice Calc, or any compatible spreadsheet application, and will immediately show which parts of the document are most likely to have been generated by an LLM, along with the specific evidence that supports that assessment.

No comments: