Hitchhiker's Guide to AI, Software Architecture, and Everything Else: BUILDING AN LLM-BASED AGENT FOR STATISTICAL TEST GENERATION

INTRODUCTION

Creating an intelligent agent that can understand statistical problems, recommend appropriate tests, and generate executable code represents a significant advancement in automated data analysis. This article explores the design and implementation of such a system, focusing on the technical architecture and practical considerations that software engineers need to understand when building these sophisticated tools.

The fundamental challenge lies in bridging the gap between natural language problem descriptions and rigorous statistical implementations. Users often describe their analytical needs in informal terms, such as "I want to know if these two groups are different" or "Is there a relationship between these variables?" The agent must translate these requests into precise statistical frameworks, select appropriate methodologies, and generate robust, executable code.

Note: you’ll find the full code of an LLM-generated solution below after the article.

UNDERSTANDING THE CORE ARCHITECTURE

The architecture of an LLM-based statistical agent consists of several interconnected components that work together to process user requests and generate appropriate responses. The primary components include a natural language processing module for understanding user intent, a statistical knowledge base that contains information about various tests and their applications, a decision engine that selects appropriate statistical methods, and a code generation system that produces executable implementations.

The natural language processing component serves as the entry point for user interactions. This module must parse user descriptions to extract key information such as the type of data being analyzed, the research question being asked, the number of groups or variables involved, and any specific constraints or assumptions mentioned by the user. The extraction process requires sophisticated understanding of statistical terminology and the ability to infer implicit information from context.

The statistical knowledge base contains structured information about various statistical tests, including their assumptions, appropriate use cases, required data types, and implementation details. This knowledge base must be comprehensive enough to cover common statistical scenarios while remaining organized in a way that allows efficient retrieval based on problem characteristics.

The decision engine uses the extracted problem characteristics to query the knowledge base and identify suitable statistical tests. This component must consider multiple factors simultaneously, including data type compatibility, sample size requirements, distributional assumptions, and the specific research question being addressed.

Finally, the code generation system translates the selected statistical method into executable code. This component must produce not only the core statistical calculations but also appropriate data validation, assumption checking, and result interpretation.

IMPLEMENTING PROBLEM ANALYSIS

The problem analysis module represents the critical first step in the agent's workflow. This component must extract structured information from unstructured natural language descriptions. The extraction process involves identifying several key elements that determine the appropriate statistical approach.

Data type identification forms a fundamental part of problem analysis. The agent must determine whether the user is working with continuous numerical data, categorical data, ordinal data, or mixed types. This determination often requires understanding context clues and domain-specific terminology. For example, when a user mentions "survey responses on a 5-point scale," the agent should recognize this as ordinal data, while "reaction times" clearly indicates continuous numerical data.

Sample structure analysis involves understanding how the data is organized and whether observations are independent or related. The agent must distinguish between independent samples, paired samples, repeated measures, and nested or hierarchical data structures. This distinction is crucial because it directly impacts the choice of statistical method.

Research question classification requires the agent to understand what type of relationship or difference the user wants to investigate. Common categories include comparing means between groups, testing for associations between variables, examining trends over time, or assessing the strength of relationships.

Let me demonstrate this with a concrete example. Consider a user request: "I have reaction time measurements from 30 participants who completed a task under two different lighting conditions. I want to know if lighting affects performance."

The problem analysis module would extract the following information: The data type is continuous (reaction times), the sample structure involves paired observations (same participants under different conditions), the research question involves comparing means between two related groups, and the expected statistical approach would be a paired t-test.

DESIGNING THE TEST SELECTION LOGIC

The test selection logic represents the core intelligence of the statistical agent. This component must navigate the complex landscape of statistical methods to identify the most appropriate test for a given problem. The selection process involves multiple decision points and considerations that must be evaluated systematically.

The primary decision tree begins with the research question type. For comparing groups, the agent must consider the number of groups, whether observations are independent or paired, and the distributional properties of the data. For examining relationships, the agent must determine the types of variables involved and the nature of the expected relationship.

Assumption checking plays a critical role in test selection. Different statistical tests have different requirements regarding data distribution, sample size, homogeneity of variance, and independence of observations. The agent must not only select tests based on these assumptions but also generate code to verify that the assumptions are met.

Consider the decision process for comparing two groups. The agent must first determine whether the groups are independent or paired. For independent groups, it must then assess whether the data meets the assumptions for a t-test, including normality and equal variances. If these assumptions are violated, the agent should recommend alternative approaches such as the Mann-Whitney U test for non-normal data or Welch's t-test for unequal variances.

The following code example illustrates how the agent might implement assumption checking for a two-sample t-test:

import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

def check_normality(data, group_name):

"""

Check if data follows a normal distribution using Shapiro-Wilk test

"""

statistic, p_value = stats.shapiro(data)

print(f"Normality test for {group_name}:")

print(f" Shapiro-Wilk statistic: {statistic:.4f}")

print(f" p-value: {p_value:.4f}")

if p_value > 0.05:

print(f" Result: Data appears normally distributed (p > 0.05)")

return True

else:

print(f" Result: Data may not be normally distributed (p <= 0.05)")

return False

def check_equal_variances(group1, group2):

"""

Check if two groups have equal variances using Levene's test

"""

statistic, p_value = stats.levene(group1, group2)

print(f"Equal variances test:")

print(f" Levene's statistic: {statistic:.4f}")

print(f" p-value: {p_value:.4f}")

if p_value > 0.05:

print(f" Result: Variances appear equal (p > 0.05)")

return True

else:

print(f" Result: Variances may not be equal (p <= 0.05)")

return False

This code demonstrates how the agent can systematically verify the assumptions underlying statistical tests. The normality check uses the Shapiro-Wilk test, which is appropriate for small to moderate sample sizes. The equal variance check employs Levene's test, which is robust to departures from normality.

BUILDING THE CODE GENERATION ENGINE

The code generation engine transforms the selected statistical method into executable code that implements the complete analysis workflow. This component must produce code that is not only statistically correct but also robust, well-documented, and interpretable.

The generated code typically follows a standard structure that includes data import and preprocessing, assumption checking, test execution, and result interpretation. Each section must be implemented with appropriate error handling and user feedback.

Data preprocessing often requires handling missing values, outliers, and data type conversions. The agent must generate code that addresses these common data quality issues while providing transparency about the preprocessing steps taken.

The actual test implementation must include not only the core statistical calculation but also confidence intervals, effect sizes, and other relevant metrics that aid in interpretation. The agent should generate code that provides comprehensive output rather than just a p-value.

Result interpretation represents a crucial component of the generated code. The agent must produce code that translates statistical output into meaningful conclusions, taking into account the original research question and the practical significance of the findings.

Here's an example of how the agent might generate a complete t-test implementation:

def perform_independent_ttest(group1, group2, group1_name="Group 1", group2_name="Group 2", alpha=0.05):

"""

Perform an independent samples t-test with comprehensive output

Parameters:

group1, group2: array-like, the two groups to compare

group1_name, group2_name: str, names for the groups

alpha: float, significance level

Returns:

dict: comprehensive results including test statistic, p-value, confidence interval, effect size

"""

# Convert to numpy arrays for consistency

group1 = np.array(group1)

group2 = np.array(group2)

# Remove missing values

group1_clean = group1[~np.isnan(group1)]

group2_clean = group2[~np.isnan(group2)]

print(f"Independent Samples T-Test")

print(f"Comparing {group1_name} (n={len(group1_clean)}) vs {group2_name} (n={len(group2_clean)})")

print("-" * 60)

# Descriptive statistics

mean1, std1 = np.mean(group1_clean), np.std(group1_clean, ddof=1)

mean2, std2 = np.mean(group2_clean), np.std(group2_clean, ddof=1)

print(f"Descriptive Statistics:")

print(f" {group1_name}: Mean = {mean1:.4f}, SD = {std1:.4f}")

print(f" {group2_name}: Mean = {mean2:.4f}, SD = {std2:.4f}")

print(f" Mean difference = {mean1 - mean2:.4f}")

print()

# Check assumptions

normal1 = check_normality(group1_clean, group1_name)

normal2 = check_normality(group2_clean, group2_name)

equal_vars = check_equal_variances(group1_clean, group2_clean)

print()

# Perform appropriate t-test based on assumption checks

if equal_vars:

t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=True)

test_type = "Student's t-test (equal variances)"

else:

t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=False)

test_type = "Welch's t-test (unequal variances)"

# Calculate degrees of freedom

if equal_vars:

df = len(group1_clean) + len(group2_clean) - 2

else:

# Welch-Satterthwaite equation

s1_sq = np.var(group1_clean, ddof=1)

s2_sq = np.var(group2_clean, ddof=1)

n1, n2 = len(group1_clean), len(group2_clean)

df = (s1_sq/n1 + s2_sq/n2)**2 / ((s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1))

# Calculate confidence interval for the difference in means

pooled_se = np.sqrt(np.var(group1_clean, ddof=1)/len(group1_clean) +

np.var(group2_clean, ddof=1)/len(group2_clean))

t_critical = stats.t.ppf(1 - alpha/2, df)

ci_lower = (mean1 - mean2) - t_critical * pooled_se

ci_upper = (mean1 - mean2) + t_critical * pooled_se

# Calculate Cohen's d (effect size)

if equal_vars:

pooled_std = np.sqrt(((len(group1_clean)-1)*np.var(group1_clean, ddof=1) +

(len(group2_clean)-1)*np.var(group2_clean, ddof=1)) /

(len(group1_clean) + len(group2_clean) - 2))

else:

pooled_std = np.sqrt((np.var(group1_clean, ddof=1) + np.var(group2_clean, ddof=1)) / 2)

cohens_d = (mean1 - mean2) / pooled_std

# Print results

print(f"Test Results ({test_type}):")

print(f" t-statistic: {t_stat:.4f}")

print(f" degrees of freedom: {df:.2f}")

print(f" p-value: {p_value:.6f}")

print(f" {100*(1-alpha):.0f}% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]")

print(f" Cohen's d (effect size): {cohens_d:.4f}")

print()

# Interpret results

print("Interpretation:")

if p_value < alpha:

print(f" The difference between groups is statistically significant (p < {alpha})")

else:

print(f" The difference between groups is not statistically significant (p >= {alpha})")

# Effect size interpretation

abs_d = abs(cohens_d)

if abs_d < 0.2:

effect_size_desc = "negligible"

elif abs_d < 0.5:

effect_size_desc = "small"

elif abs_d < 0.8:

effect_size_desc = "medium"

else:

effect_size_desc = "large"

print(f" The effect size is {effect_size_desc} (|d| = {abs_d:.4f})")

return {

'test_type': test_type,

't_statistic': t_stat,

'p_value': p_value,

'degrees_of_freedom': df,

'mean_difference': mean1 - mean2,

'confidence_interval': (ci_lower, ci_upper),

'cohens_d': cohens_d,

'significant': p_value < alpha

}

This comprehensive implementation demonstrates how the agent generates code that goes beyond basic statistical calculations. The function includes assumption checking, appropriate test selection, comprehensive output, and practical interpretation of results.

WORKING THROUGH A COMPLETE EXAMPLE

To illustrate how all components work together, let's walk through a complete example from problem description to final implementation. Consider a user who submits the following request: "I collected sleep duration data from 25 people before and after implementing a new sleep hygiene program. I want to know if the program was effective."

The problem analysis module would extract the following key information: The data involves continuous measurements (sleep duration), the design uses paired observations (same people measured twice), the research question asks about the effectiveness of an intervention (comparing before and after), and the appropriate statistical approach would be a paired t-test.

The test selection logic would proceed as follows: Since we have paired observations comparing two time points, a paired t-test is the primary candidate. However, the agent must also consider the assumptions of normality for the difference scores and the possibility of using non-parametric alternatives if assumptions are violated.

The code generation engine would produce a complete implementation that handles data input, assumption checking, test execution, and result interpretation. Here's how the agent might generate the complete analysis:

import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

def analyze_sleep_intervention(before_data, after_data):

"""

Analyze the effectiveness of a sleep hygiene intervention using paired t-test

Parameters:

before_data: array-like, sleep duration before intervention

after_data: array-like, sleep duration after intervention

"""

# Convert to numpy arrays and handle missing data

before = np.array(before_data)

after = np.array(after_data)

# Check for equal length

if len(before) != len(after):

raise ValueError("Before and after data must have the same length")

# Remove pairs with missing values

valid_pairs = ~(np.isnan(before) | np.isnan(after))

before_clean = before[valid_pairs]

after_clean = after[valid_pairs]

print("Sleep Hygiene Intervention Analysis")

print("=" * 50)

print(f"Sample size: {len(before_clean)} participants")

print()

# Calculate difference scores

differences = after_clean - before_clean

# Descriptive statistics

mean_before = np.mean(before_clean)

mean_after = np.mean(after_clean)

mean_diff = np.mean(differences)

std_diff = np.std(differences, ddof=1)

print("Descriptive Statistics:")

print(f" Before intervention: Mean = {mean_before:.2f} hours, SD = {np.std(before_clean, ddof=1):.2f}")

print(f" After intervention: Mean = {mean_after:.2f} hours, SD = {np.std(after_clean, ddof=1):.2f}")

print(f" Mean change: {mean_diff:.2f} hours, SD = {std_diff:.2f}")

print()

# Check normality of difference scores

shapiro_stat, shapiro_p = stats.shapiro(differences)

print("Assumption Checking:")

print(f" Normality of differences (Shapiro-Wilk): W = {shapiro_stat:.4f}, p = {shapiro_p:.4f}")

if shapiro_p > 0.05:

print(" Assumption met: Differences appear normally distributed")

use_parametric = True

else:

print(" Assumption violated: Differences may not be normally distributed")

print(" Will perform both parametric and non-parametric tests")

use_parametric = False

print()

# Perform paired t-test

t_stat, t_p = stats.ttest_rel(after_clean, before_clean)

df = len(differences) - 1

# Calculate confidence interval for the mean difference

se_diff = std_diff / np.sqrt(len(differences))

t_critical = stats.t.ppf(0.975, df) # for 95% CI

ci_lower = mean_diff - t_critical * se_diff

ci_upper = mean_diff + t_critical * se_diff

# Calculate Cohen's d for paired samples

cohens_d = mean_diff / std_diff

print("Paired T-Test Results:")

print(f" t-statistic: {t_stat:.4f}")

print(f" degrees of freedom: {df}")

print(f" p-value: {t_p:.6f}")

print(f" 95% Confidence Interval for mean difference: [{ci_lower:.3f}, {ci_upper:.3f}] hours")

print(f" Cohen's d (effect size): {cohens_d:.4f}")

print()

# If normality assumption is violated, also perform Wilcoxon signed-rank test

if not use_parametric:

wilcoxon_stat, wilcoxon_p = stats.wilcoxon(after_clean, before_clean)

print("Wilcoxon Signed-Rank Test (non-parametric alternative):")

print(f" Test statistic: {wilcoxon_stat:.4f}")

print(f" p-value: {wilcoxon_p:.6f}")

print()

# Interpretation

print("Interpretation:")

alpha = 0.05

if t_p < alpha:

direction = "increased" if mean_diff > 0 else "decreased"

print(f" The sleep hygiene intervention significantly {direction} sleep duration")

print(f" (p = {t_p:.6f} < {alpha})")

else:

print(f" The sleep hygiene intervention did not significantly change sleep duration")

print(f" (p = {t_p:.6f} >= {alpha})")

# Effect size interpretation

abs_d = abs(cohens_d)

if abs_d < 0.2:

effect_desc = "negligible"

elif abs_d < 0.5:

effect_desc = "small"

elif abs_d < 0.8:

effect_desc = "medium"

else:

effect_desc = "large"

print(f" The effect size is {effect_desc} (Cohen's d = {cohens_d:.4f})")

if mean_diff > 0:

print(f" On average, participants slept {mean_diff:.2f} hours longer after the intervention")

else:

print(f" On average, participants slept {abs(mean_diff):.2f} hours less after the intervention")

return {

'mean_difference': mean_diff,

't_statistic': t_stat,

'p_value': t_p,

'confidence_interval': (ci_lower, ci_upper),

'cohens_d': cohens_d,

'significant': t_p < alpha,

'sample_size': len(differences)

}

# Example usage with simulated data

np.random.seed(42) # for reproducible results

before_sleep = np.random.normal(7.0, 1.2, 25) # mean 7 hours, SD 1.2

after_sleep = before_sleep + np.random.normal(0.5, 0.8, 25) # average increase of 0.5 hours

results = analyze_sleep_intervention(before_sleep, after_sleep)

This complete example demonstrates how the agent integrates all components to provide a comprehensive analysis. The generated code includes data validation, assumption checking, appropriate test selection, comprehensive output, and practical interpretation of results.

HANDLING ADVANCED SCENARIOS AND ERROR CONDITIONS

Real-world statistical analysis often involves complications that a robust agent must handle gracefully. These include missing data, outliers, assumption violations, and ambiguous problem descriptions. The agent must be designed to detect these issues and provide appropriate guidance or alternative approaches.

Missing data handling requires the agent to determine whether missing values are random or systematic and to choose appropriate strategies for dealing with incomplete observations. For paired tests, the agent might remove pairs with missing values, while for independent samples, it might use all available data for each group.

Outlier detection and handling represents another critical consideration. The agent should generate code that identifies potential outliers and provides options for sensitivity analysis, such as running the analysis both with and without extreme values.

Assumption violations require the agent to have fallback strategies. When parametric test assumptions are not met, the agent should automatically suggest and implement non-parametric alternatives or robust statistical methods.

The agent must also handle ambiguous requests by asking clarifying questions or providing multiple analysis options. For example, if a user mentions "comparing groups" without specifying whether the groups are independent or related, the agent should request clarification or provide analyses for both scenarios.

VALIDATION AND QUALITY ASSURANCE

Ensuring the correctness and reliability of generated statistical code requires comprehensive validation strategies. The agent should include multiple layers of quality assurance, from syntax checking to statistical validity verification.

Code validation involves ensuring that generated code is syntactically correct, follows best practices, and handles edge cases appropriately. This includes checking for proper error handling, input validation, and output formatting.

Statistical validation requires verifying that the implemented methods are mathematically correct and produce results consistent with established statistical software packages. This validation should include testing against known datasets with verified results.

The agent should also implement runtime validation that checks data quality, verifies assumptions, and provides warnings when results might be unreliable due to small sample sizes, extreme outliers, or severe assumption violations.

EXTENSIBILITY AND CUSTOMIZATION

A well-designed statistical agent should be extensible to accommodate new statistical methods and customizable to meet specific organizational needs. The architecture should support adding new tests, modifying existing implementations, and integrating with different data sources and output formats.

The knowledge base should be designed as a modular system that allows easy addition of new statistical methods without requiring changes to the core decision logic. Each statistical method should be defined with its assumptions, use cases, and implementation details in a standardized format.

The code generation system should use templates that can be customized for different programming languages, statistical packages, or organizational coding standards. This flexibility allows the agent to generate code that integrates seamlessly with existing workflows and tools.

PERFORMANCE CONSIDERATIONS AND SCALABILITY

As the agent handles more complex requests and larger datasets, performance considerations become increasingly important. The system must be designed to handle multiple concurrent requests efficiently while maintaining response quality.

Caching strategies can significantly improve response times for common statistical scenarios. The agent can cache code templates, assumption checking procedures, and interpretation guidelines to reduce generation time for similar requests.

For large datasets, the agent should generate code that uses efficient algorithms and appropriate data structures. This might involve recommending sampling strategies for extremely large datasets or suggesting distributed computing approaches when appropriate.

The agent should also provide progress indicators and time estimates for long-running analyses, helping users understand when to expect results and whether alternative approaches might be more efficient.

INTEGRATION WITH EXISTING WORKFLOWS

Successful deployment of an LLM-based statistical agent requires careful consideration of how it integrates with existing data analysis workflows. The agent should be designed to work with common data formats, statistical software packages, and reporting systems.

Data import capabilities should support various formats including CSV files, database connections, and integration with popular data analysis platforms. The generated code should include appropriate data loading and preprocessing steps that work with the user's existing data infrastructure.

Output formatting should be flexible enough to support different reporting requirements, from simple text summaries to formatted reports that can be integrated into presentations or publications. The agent might generate code that produces publication-ready tables and figures alongside the statistical analysis.

Version control integration ensures that generated analyses can be tracked, reproduced, and modified over time. The agent should generate well-documented code that includes metadata about the analysis parameters and assumptions.

CONCLUSION AND FUTURE DIRECTIONS

Building an LLM-based agent for statistical test generation represents a significant technical challenge that requires expertise in natural language processing, statistical methodology, and software engineering. The successful implementation of such a system can dramatically improve productivity for data analysts and researchers while ensuring statistical rigor and reproducibility.

The key to success lies in creating a robust architecture that separates concerns appropriately, implements comprehensive validation strategies, and provides clear, interpretable output. The agent must balance automation with transparency, providing users with enough information to understand and validate the generated analyses.

Future developments in this area are likely to focus on expanding the range of supported statistical methods, improving the natural language understanding capabilities, and integrating with emerging data analysis platforms and tools. Machine learning approaches might be used to improve test selection based on historical success patterns, while advances in code generation could enable more sophisticated and optimized implementations.

The integration of visual analytics capabilities could further enhance the agent's utility by generating appropriate plots and visualizations alongside statistical tests. This would provide users with a more complete analytical toolkit that addresses both statistical inference and data exploration needs.

As these systems become more sophisticated, they have the potential to democratize access to advanced statistical methods while maintaining the rigor and precision that statistical analysis requires. However, their development must be guided by careful attention to statistical validity, user needs, and the broader goals of reproducible and transparent data analysis.

SOURCE CODE OF AGENT

I used Claude 4 Opus to create an LLM-based AI Agent for statistical tests using Python (see below).

This complete production implementation includes:

1. LLM Integration: Support for both OpenAI API and local Hugging Face models

2. Natural Language Processing: LLM-enhanced understanding of statistical problems

3. Intelligent Test Selection: LLM-assisted reasoning for choosing appropriate tests

4. Code Generation: LLM-powered creation of complete statistical test implementations

5. Comprehensive Analysis Pipeline: End-to-end workflow from problem description to executable code

6. Flexible Architecture: Easy to extend with new LLM providers or statistical tests

7. Error Handling: Robust fallback mechanisms when LLM responses are unclear

8. Production Features: Logging, validation, and structured output formats

The agent can work with either remote LLM APIs (like OpenAI) or local models, making it suitable for various deployment scenarios including environments with data privacy requirements.

Here is the source code:

"""

LLM-Based Statistical Test Generation Agent

A comprehensive system that uses LLM for understanding statistical problems and generating code.

"""

import re

import json

import numpy as np

import pandas as pd

from scipy import stats

from typing import Dict, List, Tuple, Any, Optional, Union

from dataclasses import dataclass, asdict

from abc import ABC, abstractmethod

import logging

from enum import Enum

import warnings

import requests

import openai

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

import torch

# Configure logging

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

class DataType(Enum):

"""Enumeration of supported data types"""

CONTINUOUS = "continuous"

CATEGORICAL = "categorical"

ORDINAL = "ordinal"

BINARY = "binary"

class SampleStructure(Enum):

"""Enumeration of sample structure types"""

INDEPENDENT = "independent"

PAIRED = "paired"

REPEATED_MEASURES = "repeated_measures"

NESTED = "nested"

class ResearchQuestion(Enum):

"""Enumeration of research question types"""

COMPARE_MEANS = "compare_means"

COMPARE_PROPORTIONS = "compare_proportions"

TEST_ASSOCIATION = "test_association"

TEST_CORRELATION = "test_correlation"

TEST_NORMALITY = "test_normality"

TEST_VARIANCE = "test_variance"

@dataclass

class ProblemCharacteristics:

"""Data structure to hold extracted problem characteristics"""

data_type: DataType

sample_structure: SampleStructure

research_question: ResearchQuestion

num_groups: int

sample_size: Optional[int] = None

num_variables: int = 1

has_covariates: bool = False

alpha_level: float = 0.05

effect_size_interest: Optional[float] = None

raw_description: str = ""

@dataclass

class TestRecommendation:

"""Data structure for test recommendations"""

test_name: str

test_type: str

assumptions: List[str]

alternatives: List[str]

confidence: float

rationale: str

@dataclass

class StatisticalResult:

"""Data structure for statistical test results"""

test_name: str

test_statistic: float

p_value: float

degrees_of_freedom: Optional[float]

confidence_interval: Optional[Tuple[float, float]]

effect_size: Optional[float]

effect_size_name: Optional[str]

significant: bool

interpretation: str

assumptions_met: Dict[str, bool]

warnings: List[str]

class LLMInterface(ABC):

"""Abstract base class for LLM interfaces"""

@abstractmethod

def generate_response(self, prompt: str, max_tokens: int = 500) -> str:

"""Generate response from LLM"""

pass

@abstractmethod

def extract_structured_info(self, text: str, schema: Dict) -> Dict:

"""Extract structured information using LLM"""

pass

class OpenAIInterface(LLMInterface):

"""Interface for OpenAI GPT models"""

def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):

self.client = openai.OpenAI(api_key=api_key)

self.model = model

def generate_response(self, prompt: str, max_tokens: int = 500) -> str:

"""Generate response using OpenAI API"""

try:

response = self.client.chat.completions.create(

model=self.model,

messages=[{"role": "user", "content": prompt}],

max_tokens=max_tokens,

temperature=0.1

)

return response.choices[0].message.content.strip()

except Exception as e:

logger.error(f"OpenAI API error: {e}")

return ""

def extract_structured_info(self, text: str, schema: Dict) -> Dict:

"""Extract structured information using OpenAI"""

prompt = f"""

Extract the following information from the statistical problem description:

Text: "{text}"

Please extract and return a JSON object with the following structure:

{json.dumps(schema, indent=2)}

Guidelines:

- data_type: Choose from "continuous", "categorical", "ordinal", "binary"

- sample_structure: Choose from "independent", "paired", "repeated_measures", "nested"

- research_question: Choose from "compare_means", "compare_proportions", "test_association", "test_correlation", "test_normality", "test_variance"

- num_groups: Number of groups being compared (integer)

- sample_size: Total sample size if mentioned (integer or null)

- alpha_level: Significance level if mentioned (default 0.05)

Return only the JSON object, no additional text.

"""

response = self.generate_response(prompt, max_tokens=300)

try:

# Extract JSON from response

json_start = response.find('{')

json_end = response.rfind('}') + 1

if json_start != -1 and json_end != -1:

json_str = response[json_start:json_end]

return json.loads(json_str)

except json.JSONDecodeError:

logger.error("Failed to parse JSON from LLM response")

return {}

class HuggingFaceInterface(LLMInterface):

"""Interface for local Hugging Face models"""

def __init__(self, model_name: str = "microsoft/DialoGPT-medium"):

self.model_name = model_name

self.tokenizer = AutoTokenizer.from_pretrained(model_name)

self.model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if not present

if self.tokenizer.pad_token is None:

self.tokenizer.pad_token = self.tokenizer.eos_token

def generate_response(self, prompt: str, max_tokens: int = 500) -> str:

"""Generate response using local Hugging Face model"""

try:

inputs = self.tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():

outputs = self.model.generate(

inputs,

max_length=inputs.shape[1] + max_tokens,

num_return_sequences=1,

temperature=0.7,

do_sample=True,

pad_token_id=self.tokenizer.eos_token_id

)

response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Remove the original prompt from response

response = response[len(prompt):].strip()

return response

except Exception as e:

logger.error(f"Hugging Face model error: {e}")

return ""

def extract_structured_info(self, text: str, schema: Dict) -> Dict:

"""Extract structured information using local model"""

prompt = f"""

Extract information from this statistical problem:

"{text}"

Return JSON with: data_type, sample_structure, research_question, num_groups, sample_size, alpha_level

JSON:

"""

response = self.generate_response(prompt, max_tokens=200)

try:

# Try to extract JSON from response

json_start = response.find('{')

json_end = response.rfind('}') + 1

if json_start != -1 and json_end != -1:

json_str = response[json_start:json_end]

return json.loads(json_str)

except:

pass

# Fallback to rule-based extraction if LLM fails

return self._fallback_extraction(text)

def _fallback_extraction(self, text: str) -> Dict:

"""Fallback rule-based extraction"""

text_lower = text.lower()

# Simple pattern matching as fallback

data_type = "continuous"

if any(word in text_lower for word in ["category", "categorical", "group"]):

data_type = "categorical"

elif any(word in text_lower for word in ["scale", "rating", "ordinal"]):

data_type = "ordinal"

sample_structure = "independent"

if any(phrase in text_lower for phrase in ["paired", "before and after", "same participants"]):

sample_structure = "paired"

research_question = "compare_means"

if any(word in text_lower for word in ["correlation", "relationship"]):

research_question = "test_correlation"

elif any(word in text_lower for word in ["association", "chi"]):

research_question = "test_association"

# Extract sample size

sample_size = None

size_match = re.search(r'(\d+)\s+(?:participants|subjects|people)', text_lower)

if size_match:

sample_size = int(size_match.group(1))

# Extract number of groups

num_groups = 2

if "three" in text_lower or "3" in text:

num_groups = 3

return {

"data_type": data_type,

"sample_structure": sample_structure,

"research_question": research_question,

"num_groups": num_groups,

"sample_size": sample_size,

"alpha_level": 0.05

}

class LLMBasedNaturalLanguageProcessor:

"""LLM-enhanced natural language processor for statistical problems"""

def __init__(self, llm_interface: LLMInterface):

self.llm = llm_interface

self.extraction_schema = {

"data_type": "string",

"sample_structure": "string",

"research_question": "string",

"num_groups": "integer",

"sample_size": "integer or null",

"alpha_level": "float"

}

def extract_characteristics(self, problem_description: str) -> ProblemCharacteristics:

"""Extract structured characteristics using LLM"""

# First, let the LLM understand and clarify the problem

clarification_prompt = f"""

Analyze this statistical problem description and provide a clear summary:

"{problem_description}"

Please identify:

1. What type of data is being analyzed?

2. How are the samples structured?

3. What is the main research question?

4. How many groups are being compared?

5. What is the sample size?

Provide a brief, structured analysis.

"""

clarification = self.llm.generate_response(clarification_prompt, max_tokens=300)

logger.info(f"LLM clarification: {clarification}")

# Extract structured information

extracted_info = self.llm.extract_structured_info(problem_description, self.extraction_schema)

# Convert to enum types with validation

try:

data_type = DataType(extracted_info.get("data_type", "continuous"))

except ValueError:

data_type = DataType.CONTINUOUS

try:

sample_structure = SampleStructure(extracted_info.get("sample_structure", "independent"))

except ValueError:

sample_structure = SampleStructure.INDEPENDENT

try:

research_question = ResearchQuestion(extracted_info.get("research_question", "compare_means"))

except ValueError:

research_question = ResearchQuestion.COMPARE_MEANS

return ProblemCharacteristics(

data_type=data_type,

sample_structure=sample_structure,

research_question=research_question,

num_groups=extracted_info.get("num_groups", 2),

sample_size=extracted_info.get("sample_size"),

alpha_level=extracted_info.get("alpha_level", 0.05),

raw_description=problem_description

)

def generate_problem_summary(self, characteristics: ProblemCharacteristics) -> str:

"""Generate a human-readable summary of the problem"""

summary_prompt = f"""

Create a clear summary of this statistical analysis problem:

Original description: "{characteristics.raw_description}"

Extracted characteristics:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

- Sample size: {characteristics.sample_size or "Not specified"}

- Significance level: {characteristics.alpha_level}

Provide a concise, professional summary of what statistical analysis is needed.

"""

return self.llm.generate_response(summary_prompt, max_tokens=200)

class StatisticalKnowledgeBase:

"""Enhanced knowledge base with LLM integration"""

def __init__(self, llm_interface: LLMInterface):

self.llm = llm_interface

self.test_database = self._initialize_test_database()

def _initialize_test_database(self) -> Dict[str, Dict]:

"""Initialize the statistical test knowledge base"""

return {

"independent_ttest": {

"name": "Independent Samples T-Test",

"type": "parametric",

"data_type": [DataType.CONTINUOUS],

"sample_structure": [SampleStructure.INDEPENDENT],

"research_question": [ResearchQuestion.COMPARE_MEANS],

"num_groups": 2,

"assumptions": [

"Data in both groups is approximately normally distributed",

"Observations are independent",

"Variances are approximately equal (homoscedasticity)",

"Data is measured at interval or ratio level"

"alternatives": ["mann_whitney_u", "welch_ttest"],

"min_sample_size": 10,

"description": "Compares means between two independent groups"

"paired_ttest": {

"name": "Paired Samples T-Test",

"type": "parametric",

"data_type": [DataType.CONTINUOUS],

"sample_structure": [SampleStructure.PAIRED],

"research_question": [ResearchQuestion.COMPARE_MEANS],

"num_groups": 2,

"assumptions": [

"Difference scores are approximately normally distributed",

"Pairs are independent",

"Data is measured at interval or ratio level"

"alternatives": ["wilcoxon_signed_rank"],

"min_sample_size": 5,

"description": "Compares means between two related groups"

"one_way_anova": {

"name": "One-Way ANOVA",

"type": "parametric",

"data_type": [DataType.CONTINUOUS],

"sample_structure": [SampleStructure.INDEPENDENT],

"research_question": [ResearchQuestion.COMPARE_MEANS],

"num_groups": 3,

"assumptions": [

"Data in all groups is approximately normally distributed",

"Observations are independent",

"Variances are approximately equal across groups",

"Data is measured at interval or ratio level"

"alternatives": ["kruskal_wallis"],

"min_sample_size": 15,

"description": "Compares means across three or more independent groups"

"mann_whitney_u": {

"name": "Mann-Whitney U Test",

"type": "non_parametric",

"data_type": [DataType.CONTINUOUS, DataType.ORDINAL],

"sample_structure": [SampleStructure.INDEPENDENT],

"research_question": [ResearchQuestion.COMPARE_MEANS],

"num_groups": 2,

"assumptions": [

"Observations are independent",

"Data is at least ordinal"

"alternatives": ["independent_ttest"],

"min_sample_size": 5,

"description": "Non-parametric test comparing distributions between two independent groups"

"wilcoxon_signed_rank": {

"name": "Wilcoxon Signed-Rank Test",

"type": "non_parametric",

"data_type": [DataType.CONTINUOUS, DataType.ORDINAL],

"sample_structure": [SampleStructure.PAIRED],

"research_question": [ResearchQuestion.COMPARE_MEANS],

"num_groups": 2,

"assumptions": [

"Pairs are independent",

"Data is at least ordinal",

"Distribution of differences is symmetric"

"alternatives": ["paired_ttest"],

"min_sample_size": 5,

"description": "Non-parametric test comparing distributions between two related groups"

"chi_square_test": {

"name": "Chi-Square Test of Independence",

"type": "non_parametric",

"data_type": [DataType.CATEGORICAL],

"sample_structure": [SampleStructure.INDEPENDENT],

"research_question": [ResearchQuestion.TEST_ASSOCIATION],

"num_groups": 2,

"assumptions": [

"Observations are independent",

"Expected frequency in each cell >= 5",

"Data is categorical"

"alternatives": ["fisher_exact"],

"min_sample_size": 20,

"description": "Tests for association between two categorical variables"

"pearson_correlation": {

"name": "Pearson Correlation",

"type": "parametric",

"data_type": [DataType.CONTINUOUS],

"sample_structure": [SampleStructure.INDEPENDENT],

"research_question": [ResearchQuestion.TEST_CORRELATION],

"num_groups": 1,

"assumptions": [

"Both variables are approximately normally distributed",

"Relationship is linear",

"Observations are independent",

"Data is measured at interval or ratio level"

"alternatives": ["spearman_correlation"],

"min_sample_size": 10,

"description": "Measures linear correlation between two continuous variables"

}

def get_test_explanation(self, test_key: str) -> str:

"""Get LLM-generated explanation of a statistical test"""

test_info = self.test_database.get(test_key, {})

if not test_info:

return "Test not found in knowledge base."

explanation_prompt = f"""

Explain the {test_info['name']} in clear, accessible language for software engineers:

Test type: {test_info['type']}

Description: {test_info['description']}

Assumptions: {', '.join(test_info['assumptions'])}

Please explain:

1. When to use this test

2. What the test does

3. How to interpret the results

4. What the assumptions mean in practical terms

Keep the explanation concise but comprehensive.

"""

return self.llm.generate_response(explanation_prompt, max_tokens=400)

def find_suitable_tests(self, characteristics: ProblemCharacteristics) -> List[str]:

"""Find tests that match the problem characteristics"""

suitable_tests = []

for test_key, test_info in self.test_database.items():

if self._test_matches_characteristics(test_info, characteristics):

suitable_tests.append(test_key)

return suitable_tests

def _test_matches_characteristics(self, test_info: Dict, characteristics: ProblemCharacteristics) -> bool:

"""Check if a test matches the problem characteristics"""

# Check data type compatibility

if characteristics.data_type not in test_info["data_type"]:

return False

# Check sample structure compatibility

if characteristics.sample_structure not in test_info["sample_structure"]:

return False

# Check research question compatibility

if characteristics.research_question not in test_info["research_question"]:

return False

# Check number of groups (allow flexibility for ANOVA)

if test_info["num_groups"] == 3 and characteristics.num_groups < 3:

return False

elif test_info["num_groups"] == 2 and characteristics.num_groups != 2:

return False

elif test_info["num_groups"] == 1: # Tests like correlation

pass

# Check minimum sample size if available

if (characteristics.sample_size is not None and

characteristics.sample_size < test_info["min_sample_size"]):

return False

return True

class LLMEnhancedTestSelectionEngine:

"""Test selection engine enhanced with LLM reasoning"""

def __init__(self, knowledge_base: StatisticalKnowledgeBase, llm_interface: LLMInterface):

self.knowledge_base = knowledge_base

self.llm = llm_interface

def recommend_test(self, characteristics: ProblemCharacteristics) -> TestRecommendation:

"""Recommend the most appropriate statistical test using LLM reasoning"""

suitable_tests = self.knowledge_base.find_suitable_tests(characteristics)

if not suitable_tests:

return self._handle_no_suitable_tests(characteristics)

# Use LLM to select the best test among suitable options

best_test_key = self._llm_select_best_test(suitable_tests, characteristics)

test_info = self.knowledge_base.test_database[best_test_key]

# Generate LLM-based rationale

rationale = self._generate_llm_rationale(test_info, characteristics)

# Calculate confidence

confidence = self._calculate_confidence(test_info, characteristics)

return TestRecommendation(

test_name=test_info["name"],

test_type=test_info["type"],

assumptions=test_info["assumptions"],

alternatives=[self.knowledge_base.test_database[alt]["name"]

for alt in test_info["alternatives"]

if alt in self.knowledge_base.test_database],

confidence=confidence,

rationale=rationale

)

def _llm_select_best_test(self, suitable_tests: List[str], characteristics: ProblemCharacteristics) -> str:

"""Use LLM to select the best test from suitable options"""

if len(suitable_tests) == 1:

return suitable_tests[0]

test_descriptions = []

for test_key in suitable_tests:

test_info = self.knowledge_base.test_database[test_key]

test_descriptions.append(f"- {test_info['name']}: {test_info['description']}")

selection_prompt = f"""

Given this statistical problem:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

- Sample size: {characteristics.sample_size or "Not specified"}

Choose the MOST appropriate test from these options:

{chr(10).join(test_descriptions)}

Consider:

1. Which test best matches the research question

2. Which test is most robust for the given sample size

3. Which test has the most appropriate assumptions

Return only the exact name of the chosen test.

"""

response = self.llm.generate_response(selection_prompt, max_tokens=100)

# Find the test that best matches the LLM response

for test_key in suitable_tests:

test_name = self.knowledge_base.test_database[test_key]["name"]

if test_name.lower() in response.lower():

return test_key

# Fallback to first suitable test

return suitable_tests[0]

def _generate_llm_rationale(self, test_info: Dict, characteristics: ProblemCharacteristics) -> str:

"""Generate detailed rationale using LLM"""

rationale_prompt = f"""

Explain why the {test_info['name']} is the best choice for this statistical problem:

Problem characteristics:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

- Sample size: {characteristics.sample_size or "Not specified"}

Test information:

- Type: {test_info['type']}

- Description: {test_info['description']}

- Key assumptions: {', '.join(test_info['assumptions'][:3])}

Provide a clear, concise explanation of why this test is appropriate.

"""

return self.llm.generate_response(rationale_prompt, max_tokens=300)

def _handle_no_suitable_tests(self, characteristics: ProblemCharacteristics) -> TestRecommendation:

"""Handle cases where no suitable tests are found"""

suggestion_prompt = f"""

No standard statistical test matches these characteristics:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

Suggest alternative approaches or modifications to the problem that would allow for statistical analysis.

"""

suggestion = self.llm.generate_response(suggestion_prompt, max_tokens=200)

return TestRecommendation(

test_name="No suitable test found",

test_type="unknown",

assumptions=[],

alternatives=[],

confidence=0.0,

rationale=f"No standard test matches the problem characteristics. Suggestion: {suggestion}"

)

def _calculate_confidence(self, test_info: Dict, characteristics: ProblemCharacteristics) -> float:

"""Calculate confidence in the test recommendation"""

confidence = 0.8 # Base confidence

# Adjust based on sample size

if characteristics.sample_size is not None:

if characteristics.sample_size >= test_info["min_sample_size"] * 2:

confidence += 0.1

elif characteristics.sample_size < test_info["min_sample_size"]:

confidence -= 0.3

# Adjust based on data type match

if characteristics.data_type in test_info["data_type"]:

confidence += 0.05

return min(1.0, max(0.0, confidence))

class LLMCodeGenerator:

"""LLM-enhanced code generator for statistical tests"""

def __init__(self, llm_interface: LLMInterface):

self.llm = llm_interface

self.code_templates = self._initialize_code_templates()

def _initialize_code_templates(self) -> Dict[str, str]:

"""Initialize code templates for different tests"""

return {

"independent_ttest": """

def perform_independent_ttest(group1, group2, group1_name="Group 1", group2_name="Group 2", alpha=0.05):

import numpy as np

from scipy import stats

# Data preprocessing

group1 = np.array(group1)

group2 = np.array(group2)

group1_clean = group1[~np.isnan(group1)]

group2_clean = group2[~np.isnan(group2)]

print("INDEPENDENT SAMPLES T-TEST ANALYSIS")

print("=" * 50)

print(f"Comparing {group1_name} (n={len(group1_clean)}) vs {group2_name} (n={len(group2_clean)})")

# Descriptive statistics

mean1, std1 = np.mean(group1_clean), np.std(group1_clean, ddof=1)

mean2, std2 = np.mean(group2_clean), np.std(group2_clean, ddof=1)

print(f"\\nDESCRIPTIVE STATISTICS:")

print(f" {group1_name}: Mean = {mean1:.4f}, SD = {std1:.4f}")

print(f" {group2_name}: Mean = {mean2:.4f}, SD = {std2:.4f}")

# Assumption checking

print(f"\\nASSUMPTION CHECKING:")

# Normality tests

_, p_norm1 = stats.shapiro(group1_clean) if len(group1_clean) <= 5000 else stats.kstest(group1_clean, 'norm')

_, p_norm2 = stats.shapiro(group2_clean) if len(group2_clean) <= 5000 else stats.kstest(group2_clean, 'norm')

print(f" Normality {group1_name}: p = {p_norm1:.4f}")

print(f" Normality {group2_name}: p = {p_norm2:.4f}")

# Equal variances test

_, p_levene = stats.levene(group1_clean, group2_clean)

equal_vars = p_levene > 0.05

print(f" Equal variances (Levene): p = {p_levene:.4f}")

# Perform appropriate test

if equal_vars:

t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=True)

test_type = "Student's t-test"

else:

t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=False)

test_type = "Welch's t-test"

# Calculate effect size (Cohen's d)

pooled_std = np.sqrt(((len(group1_clean)-1)*np.var(group1_clean, ddof=1) +

(len(group2_clean)-1)*np.var(group2_clean, ddof=1)) /

(len(group1_clean) + len(group2_clean) - 2))

cohens_d = (mean1 - mean2) / pooled_std

print(f"\\nTEST RESULTS ({test_type}):")

print(f" t-statistic: {t_stat:.4f}")

print(f" p-value: {p_value:.6f}")

print(f" Cohen's d: {cohens_d:.4f}")

# Interpretation

print(f"\\nINTERPRETATION:")

if p_value < alpha:

print(f" Statistically significant difference (p < {alpha})")

else:

print(f" No statistically significant difference (p >= {alpha})")

return {

'test_type': test_type,

't_statistic': t_stat,

'p_value': p_value,

'cohens_d': cohens_d,

'significant': p_value < alpha

}

""",

"paired_ttest": """

def perform_paired_ttest(before_data, after_data, condition1_name="Before", condition2_name="After", alpha=0.05):

import numpy as np

from scipy import stats

# Data preprocessing

before = np.array(before_data)

after = np.array(after_data)

if len(before) != len(after):

raise ValueError("Before and after data must have the same length")

# Remove pairs with missing values

valid_pairs = ~(np.isnan(before) | np.isnan(after))

before_clean = before[valid_pairs]

after_clean = after[valid_pairs]

differences = after_clean - before_clean

print("PAIRED SAMPLES T-TEST ANALYSIS")

print("=" * 50)

print(f"Comparing {condition1_name} vs {condition2_name} (n={len(before_clean)} pairs)")

# Descriptive statistics

mean_before = np.mean(before_clean)

mean_after = np.mean(after_clean)

mean_diff = np.mean(differences)

print(f"\\nDESCRIPTIVE STATISTICS:")

print(f" {condition1_name}: Mean = {mean_before:.4f}")

print(f" {condition2_name}: Mean = {mean_after:.4f}")

print(f" Mean difference: {mean_diff:.4f}")

# Check normality of differences

print(f"\\nASSUMPTION CHECKING:")

_, p_norm = stats.shapiro(differences) if len(differences) <= 5000 else stats.kstest(differences, 'norm')

print(f" Normality of differences: p = {p_norm:.4f}")

# Perform paired t-test

t_stat, p_value = stats.ttest_rel(after_clean, before_clean)

# Calculate effect size

std_diff = np.std(differences, ddof=1)

cohens_d = mean_diff / std_diff

print(f"\\nTEST RESULTS:")

print(f" t-statistic: {t_stat:.4f}")

print(f" p-value: {p_value:.6f}")

print(f" Cohen's d: {cohens_d:.4f}")

# Interpretation

print(f"\\nINTERPRETATION:")

if p_value < alpha:

direction = "increased" if mean_diff > 0 else "decreased"

print(f" Statistically significant {direction} (p < {alpha})")

else:

print(f" No statistically significant change (p >= {alpha})")

return {

't_statistic': t_stat,

'p_value': p_value,

'mean_difference': mean_diff,

'cohens_d': cohens_d,

'significant': p_value < alpha

}

"""

}

def generate_test_code(self, test_name: str, characteristics: ProblemCharacteristics,

custom_requirements: str = "") -> str:

"""Generate complete test implementation code using LLM"""

# Check if we have a template

test_key = self._get_test_key_from_name(test_name)

if test_key in self.code_templates:

base_code = self.code_templates[test_key]

else:

base_code = ""

# Use LLM to enhance or generate code

code_prompt = f"""

Generate a complete Python function for performing a {test_name} with the following requirements:

Problem characteristics:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

- Sample size: {characteristics.sample_size or "Variable"}

- Significance level: {characteristics.alpha_level}

Additional requirements: {custom_requirements}

The function should include:

1. Comprehensive data validation and preprocessing

2. Assumption checking with appropriate tests

3. The main statistical test implementation

4. Effect size calculation

5. Confidence intervals where appropriate

6. Clear interpretation of results

7. Proper error handling

8. Detailed output formatting

{"Use this as a starting template and enhance it:" + base_code if base_code else "Create a complete implementation from scratch."}

Return only the Python code with proper formatting and documentation.

"""

generated_code = self.llm.generate_response(code_prompt, max_tokens=1500)

# Clean up the generated code

return self._clean_generated_code(generated_code)

def generate_usage_example(self, test_name: str, characteristics: ProblemCharacteristics) -> str:

"""Generate usage example for the test"""

example_prompt = f"""

Create a realistic usage example for the {test_name} function based on these characteristics:

- Data type: {characteristics.data_type.value}

- Sample structure: {characteristics.sample_structure.value}

- Research question: {characteristics.research_question.value}

- Number of groups: {characteristics.num_groups}

Include:

1. Sample data generation or realistic data examples

2. Function call with appropriate parameters

3. Brief explanation of the example scenario

Make it practical and educational for software engineers.

"""

return self.llm.generate_response(example_prompt, max_tokens=400)

def _get_test_key_from_name(self, test_name: str) -> str:

"""Convert test name to internal key"""

name_mapping = {

"Independent Samples T-Test": "independent_ttest",

"Paired Samples T-Test": "paired_ttest",

"One-Way ANOVA": "one_way_anova",

"Mann-Whitney U Test": "mann_whitney_u",

"Wilcoxon Signed-Rank Test": "wilcoxon_signed_rank",

"Chi-Square Test of Independence": "chi_square_test",

"Pearson Correlation": "pearson_correlation"

}

return name_mapping.get(test_name, "")

def _clean_generated_code(self, code: str) -> str:

"""Clean and format generated code"""

# Remove markdown code blocks if present

if "```python" in code:

start = code.find("```python") + 9

end = code.rfind("```")

if end > start:

code = code[start:end]

elif "```" in code:

start = code.find("```") + 3

end = code.rfind("```")

if end > start:

code = code[start:end]

# Clean up extra whitespace

lines = code.split('\n')

cleaned_lines = [line.rstrip() for line in lines]

return '\n'.join(cleaned_lines).strip()

class StatisticalAgent:

"""Main LLM-based statistical test generation agent"""

def __init__(self, llm_interface: LLMInterface):

self.llm = llm_interface

self.nlp = LLMBasedNaturalLanguageProcessor(llm_interface)

self.knowledge_base = StatisticalKnowledgeBase(llm_interface)

self.test_selector = LLMEnhancedTestSelectionEngine(self.knowledge_base, llm_interface)

self.code_generator = LLMCodeGenerator(llm_interface)

def analyze_problem(self, problem_description: str,

custom_requirements: str = "",

generate_code: bool = True) -> Dict[str, Any]:

"""Complete analysis pipeline from problem description to code generation"""

logger.info(f"Analyzing problem: {problem_description[:100]}...")

# Step 1: Extract problem characteristics

characteristics = self.nlp.extract_characteristics(problem_description)

# Step 2: Generate problem summary

problem_summary = self.nlp.generate_problem_summary(characteristics)

# Step 3: Get test recommendation

recommendation = self.test_selector.recommend_test(characteristics)

# Step 4: Get detailed test explanation

test_explanation = self.knowledge_base.get_test_explanation(

self._get_test_key_from_name(recommendation.test_name)

)

# Step 5: Generate code if requested

generated_code = ""

usage_example = ""

if generate_code and recommendation.test_name != "No suitable test found":

generated_code = self.code_generator.generate_test_code(

recommendation.test_name, characteristics, custom_requirements

)

usage_example = self.code_generator.generate_usage_example(

recommendation.test_name, characteristics

)

return {

"problem_characteristics": asdict(characteristics),

"problem_summary": problem_summary,

"test_recommendation": asdict(recommendation),

"test_explanation": test_explanation,

"generated_code": generated_code,

"usage_example": usage_example,

"timestamp": pd.Timestamp.now().isoformat()

}

def explain_test_method(self, test_name: str) -> str:

"""Get detailed explanation of a specific test method"""

test_key = self._get_test_key_from_name(test_name)

if test_key:

return self.knowledge_base.get_test_explanation(test_key)

else:

# Use LLM to explain unknown test

explanation_prompt = f"""

Explain the {test_name} statistical test in detail for software engineers:

Include:

1. When and why to use this test

2. Key assumptions

3. How to interpret results

4. Common pitfalls and considerations

Make it practical and accessible.

"""

return self.llm.generate_response(explanation_prompt, max_tokens=500)

def suggest_alternative_tests(self, problem_description: str) -> List[Dict[str, Any]]:

"""Suggest multiple alternative tests for a problem"""

characteristics = self.nlp.extract_characteristics(problem_description)

suitable_tests = self.knowledge_base.find_suitable_tests(characteristics)

alternatives = []

for test_key in suitable_tests:

test_info = self.knowledge_base.test_database[test_key]

explanation = self.knowledge_base.get_test_explanation(test_key)

alternatives.append({

"test_name": test_info["name"],

"test_type": test_info["type"],

"description": test_info["description"],

"explanation": explanation,

"assumptions": test_info["assumptions"]

})

return alternatives

def _get_test_key_from_name(self, test_name: str) -> str:

"""Convert test name to internal key"""

for key, info in self.knowledge_base.test_database.items():

if info["name"] == test_name:

return key

return ""

# Example usage and configuration

def create_agent_with_openai(api_key: str) -> StatisticalAgent:

"""Create agent with OpenAI interface"""

llm_interface = OpenAIInterface(api_key)

return StatisticalAgent(llm_interface)

def create_agent_with_local_model(model_name: str = "microsoft/DialoGPT-medium") -> StatisticalAgent:

"""Create agent with local Hugging Face model"""

llm_interface = HuggingFaceInterface(model_name)

return StatisticalAgent(llm_interface)

# Example usage

if __name__ == "__main__":

# Example with OpenAI (requires API key)

# agent = create_agent_with_openai("your-openai-api-key")

# Example with local model

agent = create_agent_with_local_model()

# Test the agent

problem = """

I collected sleep duration data from 25 people before and after implementing

a new sleep hygiene program. I want to know if the program was effective.

The before measurements averaged 6.5 hours with some variation, and after

measurements seemed to show improvement.

"""

result = agent.analyze_problem(problem)

print("STATISTICAL ANALYSIS AGENT RESULTS")

print("=" * 50)

print(f"Problem Summary: {result['problem_summary']}")

print(f"\nRecommended Test: {result['test_recommendation']['test_name']}")

print(f"Confidence: {result['test_recommendation']['confidence']:.2f}")

print(f"\nRationale: {result['test_recommendation']['rationale']}")

print(f"\nTest Explanation: {result['test_explanation']}")

if result['generated_code']:

print(f"\nGenerated Code:")

print(result['generated_code'])

if result['usage_example']:

print(f"\nUsage Example:")

print(result['usage_example'])

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Monday, October 27, 2025

BUILDING AN LLM-BASED AGENT FOR STATISTICAL TEST GENERATION

INTRODUCTION

UNDERSTANDING THE CORE ARCHITECTURE

IMPLEMENTING PROBLEM ANALYSIS

DESIGNING THE TEST SELECTION LOGIC

BUILDING THE CODE GENERATION ENGINE

WORKING THROUGH A COMPLETE EXAMPLE

HANDLING ADVANCED SCENARIOS AND ERROR CONDITIONS

VALIDATION AND QUALITY ASSURANCE

EXTENSIBILITY AND CUSTOMIZATION

PERFORMANCE CONSIDERATIONS AND SCALABILITY

INTEGRATION WITH EXISTING WORKFLOWS

CONCLUSION AND FUTURE DIRECTIONS

SOURCE CODE OF AGENT

No comments:

About Me