INTRODUCTION
Creating an intelligent agent that can understand statistical problems, recommend appropriate tests, and generate executable code represents a significant advancement in automated data analysis. This article explores the design and implementation of such a system, focusing on the technical architecture and practical considerations that software engineers need to understand when building these sophisticated tools.
The fundamental challenge lies in bridging the gap between natural language problem descriptions and rigorous statistical implementations. Users often describe their analytical needs in informal terms, such as "I want to know if these two groups are different" or "Is there a relationship between these variables?" The agent must translate these requests into precise statistical frameworks, select appropriate methodologies, and generate robust, executable code.
Note: you’ll find the full code of an LLM-generated solution below after the article.
UNDERSTANDING THE CORE ARCHITECTURE
The architecture of an LLM-based statistical agent consists of several interconnected components that work together to process user requests and generate appropriate responses. The primary components include a natural language processing module for understanding user intent, a statistical knowledge base that contains information about various tests and their applications, a decision engine that selects appropriate statistical methods, and a code generation system that produces executable implementations.
The natural language processing component serves as the entry point for user interactions. This module must parse user descriptions to extract key information such as the type of data being analyzed, the research question being asked, the number of groups or variables involved, and any specific constraints or assumptions mentioned by the user. The extraction process requires sophisticated understanding of statistical terminology and the ability to infer implicit information from context.
The statistical knowledge base contains structured information about various statistical tests, including their assumptions, appropriate use cases, required data types, and implementation details. This knowledge base must be comprehensive enough to cover common statistical scenarios while remaining organized in a way that allows efficient retrieval based on problem characteristics.
The decision engine uses the extracted problem characteristics to query the knowledge base and identify suitable statistical tests. This component must consider multiple factors simultaneously, including data type compatibility, sample size requirements, distributional assumptions, and the specific research question being addressed.
Finally, the code generation system translates the selected statistical method into executable code. This component must produce not only the core statistical calculations but also appropriate data validation, assumption checking, and result interpretation.
IMPLEMENTING PROBLEM ANALYSIS
The problem analysis module represents the critical first step in the agent's workflow. This component must extract structured information from unstructured natural language descriptions. The extraction process involves identifying several key elements that determine the appropriate statistical approach.
Data type identification forms a fundamental part of problem analysis. The agent must determine whether the user is working with continuous numerical data, categorical data, ordinal data, or mixed types. This determination often requires understanding context clues and domain-specific terminology. For example, when a user mentions "survey responses on a 5-point scale," the agent should recognize this as ordinal data, while "reaction times" clearly indicates continuous numerical data.
Sample structure analysis involves understanding how the data is organized and whether observations are independent or related. The agent must distinguish between independent samples, paired samples, repeated measures, and nested or hierarchical data structures. This distinction is crucial because it directly impacts the choice of statistical method.
Research question classification requires the agent to understand what type of relationship or difference the user wants to investigate. Common categories include comparing means between groups, testing for associations between variables, examining trends over time, or assessing the strength of relationships.
Let me demonstrate this with a concrete example. Consider a user request: "I have reaction time measurements from 30 participants who completed a task under two different lighting conditions. I want to know if lighting affects performance."
The problem analysis module would extract the following information: The data type is continuous (reaction times), the sample structure involves paired observations (same participants under different conditions), the research question involves comparing means between two related groups, and the expected statistical approach would be a paired t-test.
DESIGNING THE TEST SELECTION LOGIC
The test selection logic represents the core intelligence of the statistical agent. This component must navigate the complex landscape of statistical methods to identify the most appropriate test for a given problem. The selection process involves multiple decision points and considerations that must be evaluated systematically.
The primary decision tree begins with the research question type. For comparing groups, the agent must consider the number of groups, whether observations are independent or paired, and the distributional properties of the data. For examining relationships, the agent must determine the types of variables involved and the nature of the expected relationship.
Assumption checking plays a critical role in test selection. Different statistical tests have different requirements regarding data distribution, sample size, homogeneity of variance, and independence of observations. The agent must not only select tests based on these assumptions but also generate code to verify that the assumptions are met.
Consider the decision process for comparing two groups. The agent must first determine whether the groups are independent or paired. For independent groups, it must then assess whether the data meets the assumptions for a t-test, including normality and equal variances. If these assumptions are violated, the agent should recommend alternative approaches such as the Mann-Whitney U test for non-normal data or Welch's t-test for unequal variances.
The following code example illustrates how the agent might implement assumption checking for a two-sample t-test:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def check_normality(data, group_name):
"""
Check if data follows a normal distribution using Shapiro-Wilk test
"""
statistic, p_value = stats.shapiro(data)
print(f"Normality test for {group_name}:")
print(f" Shapiro-Wilk statistic: {statistic:.4f}")
print(f" p-value: {p_value:.4f}")
if p_value > 0.05:
print(f" Result: Data appears normally distributed (p > 0.05)")
return True
else:
print(f" Result: Data may not be normally distributed (p <= 0.05)")
return False
def check_equal_variances(group1, group2):
"""
Check if two groups have equal variances using Levene's test
"""
statistic, p_value = stats.levene(group1, group2)
print(f"Equal variances test:")
print(f" Levene's statistic: {statistic:.4f}")
print(f" p-value: {p_value:.4f}")
if p_value > 0.05:
print(f" Result: Variances appear equal (p > 0.05)")
return True
else:
print(f" Result: Variances may not be equal (p <= 0.05)")
return False
This code demonstrates how the agent can systematically verify the assumptions underlying statistical tests. The normality check uses the Shapiro-Wilk test, which is appropriate for small to moderate sample sizes. The equal variance check employs Levene's test, which is robust to departures from normality.
BUILDING THE CODE GENERATION ENGINE
The code generation engine transforms the selected statistical method into executable code that implements the complete analysis workflow. This component must produce code that is not only statistically correct but also robust, well-documented, and interpretable.
The generated code typically follows a standard structure that includes data import and preprocessing, assumption checking, test execution, and result interpretation. Each section must be implemented with appropriate error handling and user feedback.
Data preprocessing often requires handling missing values, outliers, and data type conversions. The agent must generate code that addresses these common data quality issues while providing transparency about the preprocessing steps taken.
The actual test implementation must include not only the core statistical calculation but also confidence intervals, effect sizes, and other relevant metrics that aid in interpretation. The agent should generate code that provides comprehensive output rather than just a p-value.
Result interpretation represents a crucial component of the generated code. The agent must produce code that translates statistical output into meaningful conclusions, taking into account the original research question and the practical significance of the findings.
Here's an example of how the agent might generate a complete t-test implementation:
def perform_independent_ttest(group1, group2, group1_name="Group 1", group2_name="Group 2", alpha=0.05):
"""
Perform an independent samples t-test with comprehensive output
Parameters:
group1, group2: array-like, the two groups to compare
group1_name, group2_name: str, names for the groups
alpha: float, significance level
Returns:
dict: comprehensive results including test statistic, p-value, confidence interval, effect size
"""
# Convert to numpy arrays for consistency
group1 = np.array(group1)
group2 = np.array(group2)
# Remove missing values
group1_clean = group1[~np.isnan(group1)]
group2_clean = group2[~np.isnan(group2)]
print(f"Independent Samples T-Test")
print(f"Comparing {group1_name} (n={len(group1_clean)}) vs {group2_name} (n={len(group2_clean)})")
print("-" * 60)
# Descriptive statistics
mean1, std1 = np.mean(group1_clean), np.std(group1_clean, ddof=1)
mean2, std2 = np.mean(group2_clean), np.std(group2_clean, ddof=1)
print(f"Descriptive Statistics:")
print(f" {group1_name}: Mean = {mean1:.4f}, SD = {std1:.4f}")
print(f" {group2_name}: Mean = {mean2:.4f}, SD = {std2:.4f}")
print(f" Mean difference = {mean1 - mean2:.4f}")
print()
# Check assumptions
normal1 = check_normality(group1_clean, group1_name)
normal2 = check_normality(group2_clean, group2_name)
equal_vars = check_equal_variances(group1_clean, group2_clean)
print()
# Perform appropriate t-test based on assumption checks
if equal_vars:
t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=True)
test_type = "Student's t-test (equal variances)"
else:
t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=False)
test_type = "Welch's t-test (unequal variances)"
# Calculate degrees of freedom
if equal_vars:
df = len(group1_clean) + len(group2_clean) - 2
else:
# Welch-Satterthwaite equation
s1_sq = np.var(group1_clean, ddof=1)
s2_sq = np.var(group2_clean, ddof=1)
n1, n2 = len(group1_clean), len(group2_clean)
df = (s1_sq/n1 + s2_sq/n2)**2 / ((s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1))
# Calculate confidence interval for the difference in means
pooled_se = np.sqrt(np.var(group1_clean, ddof=1)/len(group1_clean) +
np.var(group2_clean, ddof=1)/len(group2_clean))
t_critical = stats.t.ppf(1 - alpha/2, df)
ci_lower = (mean1 - mean2) - t_critical * pooled_se
ci_upper = (mean1 - mean2) + t_critical * pooled_se
# Calculate Cohen's d (effect size)
if equal_vars:
pooled_std = np.sqrt(((len(group1_clean)-1)*np.var(group1_clean, ddof=1) +
(len(group2_clean)-1)*np.var(group2_clean, ddof=1)) /
(len(group1_clean) + len(group2_clean) - 2))
else:
pooled_std = np.sqrt((np.var(group1_clean, ddof=1) + np.var(group2_clean, ddof=1)) / 2)
cohens_d = (mean1 - mean2) / pooled_std
# Print results
print(f"Test Results ({test_type}):")
print(f" t-statistic: {t_stat:.4f}")
print(f" degrees of freedom: {df:.2f}")
print(f" p-value: {p_value:.6f}")
print(f" {100*(1-alpha):.0f}% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f" Cohen's d (effect size): {cohens_d:.4f}")
print()
# Interpret results
print("Interpretation:")
if p_value < alpha:
print(f" The difference between groups is statistically significant (p < {alpha})")
else:
print(f" The difference between groups is not statistically significant (p >= {alpha})")
# Effect size interpretation
abs_d = abs(cohens_d)
if abs_d < 0.2:
effect_size_desc = "negligible"
elif abs_d < 0.5:
effect_size_desc = "small"
elif abs_d < 0.8:
effect_size_desc = "medium"
else:
effect_size_desc = "large"
print(f" The effect size is {effect_size_desc} (|d| = {abs_d:.4f})")
return {
'test_type': test_type,
't_statistic': t_stat,
'p_value': p_value,
'degrees_of_freedom': df,
'mean_difference': mean1 - mean2,
'confidence_interval': (ci_lower, ci_upper),
'cohens_d': cohens_d,
'significant': p_value < alpha
}
This comprehensive implementation demonstrates how the agent generates code that goes beyond basic statistical calculations. The function includes assumption checking, appropriate test selection, comprehensive output, and practical interpretation of results.
WORKING THROUGH A COMPLETE EXAMPLE
To illustrate how all components work together, let's walk through a complete example from problem description to final implementation. Consider a user who submits the following request: "I collected sleep duration data from 25 people before and after implementing a new sleep hygiene program. I want to know if the program was effective."
The problem analysis module would extract the following key information: The data involves continuous measurements (sleep duration), the design uses paired observations (same people measured twice), the research question asks about the effectiveness of an intervention (comparing before and after), and the appropriate statistical approach would be a paired t-test.
The test selection logic would proceed as follows: Since we have paired observations comparing two time points, a paired t-test is the primary candidate. However, the agent must also consider the assumptions of normality for the difference scores and the possibility of using non-parametric alternatives if assumptions are violated.
The code generation engine would produce a complete implementation that handles data input, assumption checking, test execution, and result interpretation. Here's how the agent might generate the complete analysis:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def analyze_sleep_intervention(before_data, after_data):
"""
Analyze the effectiveness of a sleep hygiene intervention using paired t-test
Parameters:
before_data: array-like, sleep duration before intervention
after_data: array-like, sleep duration after intervention
"""
# Convert to numpy arrays and handle missing data
before = np.array(before_data)
after = np.array(after_data)
# Check for equal length
if len(before) != len(after):
raise ValueError("Before and after data must have the same length")
# Remove pairs with missing values
valid_pairs = ~(np.isnan(before) | np.isnan(after))
before_clean = before[valid_pairs]
after_clean = after[valid_pairs]
print("Sleep Hygiene Intervention Analysis")
print("=" * 50)
print(f"Sample size: {len(before_clean)} participants")
print()
# Calculate difference scores
differences = after_clean - before_clean
# Descriptive statistics
mean_before = np.mean(before_clean)
mean_after = np.mean(after_clean)
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)
print("Descriptive Statistics:")
print(f" Before intervention: Mean = {mean_before:.2f} hours, SD = {np.std(before_clean, ddof=1):.2f}")
print(f" After intervention: Mean = {mean_after:.2f} hours, SD = {np.std(after_clean, ddof=1):.2f}")
print(f" Mean change: {mean_diff:.2f} hours, SD = {std_diff:.2f}")
print()
# Check normality of difference scores
shapiro_stat, shapiro_p = stats.shapiro(differences)
print("Assumption Checking:")
print(f" Normality of differences (Shapiro-Wilk): W = {shapiro_stat:.4f}, p = {shapiro_p:.4f}")
if shapiro_p > 0.05:
print(" Assumption met: Differences appear normally distributed")
use_parametric = True
else:
print(" Assumption violated: Differences may not be normally distributed")
print(" Will perform both parametric and non-parametric tests")
use_parametric = False
print()
# Perform paired t-test
t_stat, t_p = stats.ttest_rel(after_clean, before_clean)
df = len(differences) - 1
# Calculate confidence interval for the mean difference
se_diff = std_diff / np.sqrt(len(differences))
t_critical = stats.t.ppf(0.975, df) # for 95% CI
ci_lower = mean_diff - t_critical * se_diff
ci_upper = mean_diff + t_critical * se_diff
# Calculate Cohen's d for paired samples
cohens_d = mean_diff / std_diff
print("Paired T-Test Results:")
print(f" t-statistic: {t_stat:.4f}")
print(f" degrees of freedom: {df}")
print(f" p-value: {t_p:.6f}")
print(f" 95% Confidence Interval for mean difference: [{ci_lower:.3f}, {ci_upper:.3f}] hours")
print(f" Cohen's d (effect size): {cohens_d:.4f}")
print()
# If normality assumption is violated, also perform Wilcoxon signed-rank test
if not use_parametric:
wilcoxon_stat, wilcoxon_p = stats.wilcoxon(after_clean, before_clean)
print("Wilcoxon Signed-Rank Test (non-parametric alternative):")
print(f" Test statistic: {wilcoxon_stat:.4f}")
print(f" p-value: {wilcoxon_p:.6f}")
print()
# Interpretation
print("Interpretation:")
alpha = 0.05
if t_p < alpha:
direction = "increased" if mean_diff > 0 else "decreased"
print(f" The sleep hygiene intervention significantly {direction} sleep duration")
print(f" (p = {t_p:.6f} < {alpha})")
else:
print(f" The sleep hygiene intervention did not significantly change sleep duration")
print(f" (p = {t_p:.6f} >= {alpha})")
# Effect size interpretation
abs_d = abs(cohens_d)
if abs_d < 0.2:
effect_desc = "negligible"
elif abs_d < 0.5:
effect_desc = "small"
elif abs_d < 0.8:
effect_desc = "medium"
else:
effect_desc = "large"
print(f" The effect size is {effect_desc} (Cohen's d = {cohens_d:.4f})")
if mean_diff > 0:
print(f" On average, participants slept {mean_diff:.2f} hours longer after the intervention")
else:
print(f" On average, participants slept {abs(mean_diff):.2f} hours less after the intervention")
return {
'mean_difference': mean_diff,
't_statistic': t_stat,
'p_value': t_p,
'confidence_interval': (ci_lower, ci_upper),
'cohens_d': cohens_d,
'significant': t_p < alpha,
'sample_size': len(differences)
}
# Example usage with simulated data
np.random.seed(42) # for reproducible results
before_sleep = np.random.normal(7.0, 1.2, 25) # mean 7 hours, SD 1.2
after_sleep = before_sleep + np.random.normal(0.5, 0.8, 25) # average increase of 0.5 hours
results = analyze_sleep_intervention(before_sleep, after_sleep)
This complete example demonstrates how the agent integrates all components to provide a comprehensive analysis. The generated code includes data validation, assumption checking, appropriate test selection, comprehensive output, and practical interpretation of results.
HANDLING ADVANCED SCENARIOS AND ERROR CONDITIONS
Real-world statistical analysis often involves complications that a robust agent must handle gracefully. These include missing data, outliers, assumption violations, and ambiguous problem descriptions. The agent must be designed to detect these issues and provide appropriate guidance or alternative approaches.
Missing data handling requires the agent to determine whether missing values are random or systematic and to choose appropriate strategies for dealing with incomplete observations. For paired tests, the agent might remove pairs with missing values, while for independent samples, it might use all available data for each group.
Outlier detection and handling represents another critical consideration. The agent should generate code that identifies potential outliers and provides options for sensitivity analysis, such as running the analysis both with and without extreme values.
Assumption violations require the agent to have fallback strategies. When parametric test assumptions are not met, the agent should automatically suggest and implement non-parametric alternatives or robust statistical methods.
The agent must also handle ambiguous requests by asking clarifying questions or providing multiple analysis options. For example, if a user mentions "comparing groups" without specifying whether the groups are independent or related, the agent should request clarification or provide analyses for both scenarios.
VALIDATION AND QUALITY ASSURANCE
Ensuring the correctness and reliability of generated statistical code requires comprehensive validation strategies. The agent should include multiple layers of quality assurance, from syntax checking to statistical validity verification.
Code validation involves ensuring that generated code is syntactically correct, follows best practices, and handles edge cases appropriately. This includes checking for proper error handling, input validation, and output formatting.
Statistical validation requires verifying that the implemented methods are mathematically correct and produce results consistent with established statistical software packages. This validation should include testing against known datasets with verified results.
The agent should also implement runtime validation that checks data quality, verifies assumptions, and provides warnings when results might be unreliable due to small sample sizes, extreme outliers, or severe assumption violations.
EXTENSIBILITY AND CUSTOMIZATION
A well-designed statistical agent should be extensible to accommodate new statistical methods and customizable to meet specific organizational needs. The architecture should support adding new tests, modifying existing implementations, and integrating with different data sources and output formats.
The knowledge base should be designed as a modular system that allows easy addition of new statistical methods without requiring changes to the core decision logic. Each statistical method should be defined with its assumptions, use cases, and implementation details in a standardized format.
The code generation system should use templates that can be customized for different programming languages, statistical packages, or organizational coding standards. This flexibility allows the agent to generate code that integrates seamlessly with existing workflows and tools.
PERFORMANCE CONSIDERATIONS AND SCALABILITY
As the agent handles more complex requests and larger datasets, performance considerations become increasingly important. The system must be designed to handle multiple concurrent requests efficiently while maintaining response quality.
Caching strategies can significantly improve response times for common statistical scenarios. The agent can cache code templates, assumption checking procedures, and interpretation guidelines to reduce generation time for similar requests.
For large datasets, the agent should generate code that uses efficient algorithms and appropriate data structures. This might involve recommending sampling strategies for extremely large datasets or suggesting distributed computing approaches when appropriate.
The agent should also provide progress indicators and time estimates for long-running analyses, helping users understand when to expect results and whether alternative approaches might be more efficient.
INTEGRATION WITH EXISTING WORKFLOWS
Successful deployment of an LLM-based statistical agent requires careful consideration of how it integrates with existing data analysis workflows. The agent should be designed to work with common data formats, statistical software packages, and reporting systems.
Data import capabilities should support various formats including CSV files, database connections, and integration with popular data analysis platforms. The generated code should include appropriate data loading and preprocessing steps that work with the user's existing data infrastructure.
Output formatting should be flexible enough to support different reporting requirements, from simple text summaries to formatted reports that can be integrated into presentations or publications. The agent might generate code that produces publication-ready tables and figures alongside the statistical analysis.
Version control integration ensures that generated analyses can be tracked, reproduced, and modified over time. The agent should generate well-documented code that includes metadata about the analysis parameters and assumptions.
CONCLUSION AND FUTURE DIRECTIONS
Building an LLM-based agent for statistical test generation represents a significant technical challenge that requires expertise in natural language processing, statistical methodology, and software engineering. The successful implementation of such a system can dramatically improve productivity for data analysts and researchers while ensuring statistical rigor and reproducibility.
The key to success lies in creating a robust architecture that separates concerns appropriately, implements comprehensive validation strategies, and provides clear, interpretable output. The agent must balance automation with transparency, providing users with enough information to understand and validate the generated analyses.
Future developments in this area are likely to focus on expanding the range of supported statistical methods, improving the natural language understanding capabilities, and integrating with emerging data analysis platforms and tools. Machine learning approaches might be used to improve test selection based on historical success patterns, while advances in code generation could enable more sophisticated and optimized implementations.
The integration of visual analytics capabilities could further enhance the agent's utility by generating appropriate plots and visualizations alongside statistical tests. This would provide users with a more complete analytical toolkit that addresses both statistical inference and data exploration needs.
As these systems become more sophisticated, they have the potential to democratize access to advanced statistical methods while maintaining the rigor and precision that statistical analysis requires. However, their development must be guided by careful attention to statistical validity, user needs, and the broader goals of reproducible and transparent data analysis.
SOURCE CODE OF AGENT
I used Claude 4 Opus to create an LLM-based AI Agent for statistical tests using Python (see below).
This complete production implementation includes:
1. LLM Integration: Support for both OpenAI API and local Hugging Face models
2. Natural Language Processing: LLM-enhanced understanding of statistical problems
3. Intelligent Test Selection: LLM-assisted reasoning for choosing appropriate tests
4. Code Generation: LLM-powered creation of complete statistical test implementations
5. Comprehensive Analysis Pipeline: End-to-end workflow from problem description to executable code
6. Flexible Architecture: Easy to extend with new LLM providers or statistical tests
7. Error Handling: Robust fallback mechanisms when LLM responses are unclear
8. Production Features: Logging, validation, and structured output formats
The agent can work with either remote LLM APIs (like OpenAI) or local models, making it suitable for various deployment scenarios including environments with data privacy requirements.
Here is the source code:
"""
LLM-Based Statistical Test Generation Agent
A comprehensive system that uses LLM for understanding statistical problems and generating code.
"""
import re
import json
import numpy as np
import pandas as pd
from scipy import stats
from typing import Dict, List, Tuple, Any, Optional, Union
from dataclasses import dataclass, asdict
from abc import ABC, abstractmethod
import logging
from enum import Enum
import warnings
import requests
import openai
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DataType(Enum):
"""Enumeration of supported data types"""
CONTINUOUS = "continuous"
CATEGORICAL = "categorical"
ORDINAL = "ordinal"
BINARY = "binary"
class SampleStructure(Enum):
"""Enumeration of sample structure types"""
INDEPENDENT = "independent"
PAIRED = "paired"
REPEATED_MEASURES = "repeated_measures"
NESTED = "nested"
class ResearchQuestion(Enum):
"""Enumeration of research question types"""
COMPARE_MEANS = "compare_means"
COMPARE_PROPORTIONS = "compare_proportions"
TEST_ASSOCIATION = "test_association"
TEST_CORRELATION = "test_correlation"
TEST_NORMALITY = "test_normality"
TEST_VARIANCE = "test_variance"
@dataclass
class ProblemCharacteristics:
"""Data structure to hold extracted problem characteristics"""
data_type: DataType
sample_structure: SampleStructure
research_question: ResearchQuestion
num_groups: int
sample_size: Optional[int] = None
num_variables: int = 1
has_covariates: bool = False
alpha_level: float = 0.05
effect_size_interest: Optional[float] = None
raw_description: str = ""
@dataclass
class TestRecommendation:
"""Data structure for test recommendations"""
test_name: str
test_type: str
assumptions: List[str]
alternatives: List[str]
confidence: float
rationale: str
@dataclass
class StatisticalResult:
"""Data structure for statistical test results"""
test_name: str
test_statistic: float
p_value: float
degrees_of_freedom: Optional[float]
confidence_interval: Optional[Tuple[float, float]]
effect_size: Optional[float]
effect_size_name: Optional[str]
significant: bool
interpretation: str
assumptions_met: Dict[str, bool]
warnings: List[str]
class LLMInterface(ABC):
"""Abstract base class for LLM interfaces"""
@abstractmethod
def generate_response(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate response from LLM"""
pass
@abstractmethod
def extract_structured_info(self, text: str, schema: Dict) -> Dict:
"""Extract structured information using LLM"""
pass
class OpenAIInterface(LLMInterface):
"""Interface for OpenAI GPT models"""
def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"):
self.client = openai.OpenAI(api_key=api_key)
self.model = model
def generate_response(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate response using OpenAI API"""
try:
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.1
)
return response.choices[0].message.content.strip()
except Exception as e:
logger.error(f"OpenAI API error: {e}")
return ""
def extract_structured_info(self, text: str, schema: Dict) -> Dict:
"""Extract structured information using OpenAI"""
prompt = f"""
Extract the following information from the statistical problem description:
Text: "{text}"
Please extract and return a JSON object with the following structure:
{json.dumps(schema, indent=2)}
Guidelines:
- data_type: Choose from "continuous", "categorical", "ordinal", "binary"
- sample_structure: Choose from "independent", "paired", "repeated_measures", "nested"
- research_question: Choose from "compare_means", "compare_proportions", "test_association", "test_correlation", "test_normality", "test_variance"
- num_groups: Number of groups being compared (integer)
- sample_size: Total sample size if mentioned (integer or null)
- alpha_level: Significance level if mentioned (default 0.05)
Return only the JSON object, no additional text.
"""
response = self.generate_response(prompt, max_tokens=300)
try:
# Extract JSON from response
json_start = response.find('{')
json_end = response.rfind('}') + 1
if json_start != -1 and json_end != -1:
json_str = response[json_start:json_end]
return json.loads(json_str)
except json.JSONDecodeError:
logger.error("Failed to parse JSON from LLM response")
return {}
class HuggingFaceInterface(LLMInterface):
"""Interface for local Hugging Face models"""
def __init__(self, model_name: str = "microsoft/DialoGPT-medium"):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
# Add padding token if not present
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def generate_response(self, prompt: str, max_tokens: int = 500) -> str:
"""Generate response using local Hugging Face model"""
try:
inputs = self.tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = self.model.generate(
inputs,
max_length=inputs.shape[1] + max_tokens,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Remove the original prompt from response
response = response[len(prompt):].strip()
return response
except Exception as e:
logger.error(f"Hugging Face model error: {e}")
return ""
def extract_structured_info(self, text: str, schema: Dict) -> Dict:
"""Extract structured information using local model"""
prompt = f"""
Extract information from this statistical problem:
"{text}"
Return JSON with: data_type, sample_structure, research_question, num_groups, sample_size, alpha_level
JSON:
"""
response = self.generate_response(prompt, max_tokens=200)
try:
# Try to extract JSON from response
json_start = response.find('{')
json_end = response.rfind('}') + 1
if json_start != -1 and json_end != -1:
json_str = response[json_start:json_end]
return json.loads(json_str)
except:
pass
# Fallback to rule-based extraction if LLM fails
return self._fallback_extraction(text)
def _fallback_extraction(self, text: str) -> Dict:
"""Fallback rule-based extraction"""
text_lower = text.lower()
# Simple pattern matching as fallback
data_type = "continuous"
if any(word in text_lower for word in ["category", "categorical", "group"]):
data_type = "categorical"
elif any(word in text_lower for word in ["scale", "rating", "ordinal"]):
data_type = "ordinal"
sample_structure = "independent"
if any(phrase in text_lower for phrase in ["paired", "before and after", "same participants"]):
sample_structure = "paired"
research_question = "compare_means"
if any(word in text_lower for word in ["correlation", "relationship"]):
research_question = "test_correlation"
elif any(word in text_lower for word in ["association", "chi"]):
research_question = "test_association"
# Extract sample size
sample_size = None
size_match = re.search(r'(\d+)\s+(?:participants|subjects|people)', text_lower)
if size_match:
sample_size = int(size_match.group(1))
# Extract number of groups
num_groups = 2
if "three" in text_lower or "3" in text:
num_groups = 3
return {
"data_type": data_type,
"sample_structure": sample_structure,
"research_question": research_question,
"num_groups": num_groups,
"sample_size": sample_size,
"alpha_level": 0.05
}
class LLMBasedNaturalLanguageProcessor:
"""LLM-enhanced natural language processor for statistical problems"""
def __init__(self, llm_interface: LLMInterface):
self.llm = llm_interface
self.extraction_schema = {
"data_type": "string",
"sample_structure": "string",
"research_question": "string",
"num_groups": "integer",
"sample_size": "integer or null",
"alpha_level": "float"
}
def extract_characteristics(self, problem_description: str) -> ProblemCharacteristics:
"""Extract structured characteristics using LLM"""
# First, let the LLM understand and clarify the problem
clarification_prompt = f"""
Analyze this statistical problem description and provide a clear summary:
"{problem_description}"
Please identify:
1. What type of data is being analyzed?
2. How are the samples structured?
3. What is the main research question?
4. How many groups are being compared?
5. What is the sample size?
Provide a brief, structured analysis.
"""
clarification = self.llm.generate_response(clarification_prompt, max_tokens=300)
logger.info(f"LLM clarification: {clarification}")
# Extract structured information
extracted_info = self.llm.extract_structured_info(problem_description, self.extraction_schema)
# Convert to enum types with validation
try:
data_type = DataType(extracted_info.get("data_type", "continuous"))
except ValueError:
data_type = DataType.CONTINUOUS
try:
sample_structure = SampleStructure(extracted_info.get("sample_structure", "independent"))
except ValueError:
sample_structure = SampleStructure.INDEPENDENT
try:
research_question = ResearchQuestion(extracted_info.get("research_question", "compare_means"))
except ValueError:
research_question = ResearchQuestion.COMPARE_MEANS
return ProblemCharacteristics(
data_type=data_type,
sample_structure=sample_structure,
research_question=research_question,
num_groups=extracted_info.get("num_groups", 2),
sample_size=extracted_info.get("sample_size"),
alpha_level=extracted_info.get("alpha_level", 0.05),
raw_description=problem_description
)
def generate_problem_summary(self, characteristics: ProblemCharacteristics) -> str:
"""Generate a human-readable summary of the problem"""
summary_prompt = f"""
Create a clear summary of this statistical analysis problem:
Original description: "{characteristics.raw_description}"
Extracted characteristics:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
- Sample size: {characteristics.sample_size or "Not specified"}
- Significance level: {characteristics.alpha_level}
Provide a concise, professional summary of what statistical analysis is needed.
"""
return self.llm.generate_response(summary_prompt, max_tokens=200)
class StatisticalKnowledgeBase:
"""Enhanced knowledge base with LLM integration"""
def __init__(self, llm_interface: LLMInterface):
self.llm = llm_interface
self.test_database = self._initialize_test_database()
def _initialize_test_database(self) -> Dict[str, Dict]:
"""Initialize the statistical test knowledge base"""
return {
"independent_ttest": {
"name": "Independent Samples T-Test",
"type": "parametric",
"data_type": [DataType.CONTINUOUS],
"sample_structure": [SampleStructure.INDEPENDENT],
"research_question": [ResearchQuestion.COMPARE_MEANS],
"num_groups": 2,
"assumptions": [
"Data in both groups is approximately normally distributed",
"Observations are independent",
"Variances are approximately equal (homoscedasticity)",
"Data is measured at interval or ratio level"
],
"alternatives": ["mann_whitney_u", "welch_ttest"],
"min_sample_size": 10,
"description": "Compares means between two independent groups"
},
"paired_ttest": {
"name": "Paired Samples T-Test",
"type": "parametric",
"data_type": [DataType.CONTINUOUS],
"sample_structure": [SampleStructure.PAIRED],
"research_question": [ResearchQuestion.COMPARE_MEANS],
"num_groups": 2,
"assumptions": [
"Difference scores are approximately normally distributed",
"Pairs are independent",
"Data is measured at interval or ratio level"
],
"alternatives": ["wilcoxon_signed_rank"],
"min_sample_size": 5,
"description": "Compares means between two related groups"
},
"one_way_anova": {
"name": "One-Way ANOVA",
"type": "parametric",
"data_type": [DataType.CONTINUOUS],
"sample_structure": [SampleStructure.INDEPENDENT],
"research_question": [ResearchQuestion.COMPARE_MEANS],
"num_groups": 3,
"assumptions": [
"Data in all groups is approximately normally distributed",
"Observations are independent",
"Variances are approximately equal across groups",
"Data is measured at interval or ratio level"
],
"alternatives": ["kruskal_wallis"],
"min_sample_size": 15,
"description": "Compares means across three or more independent groups"
},
"mann_whitney_u": {
"name": "Mann-Whitney U Test",
"type": "non_parametric",
"data_type": [DataType.CONTINUOUS, DataType.ORDINAL],
"sample_structure": [SampleStructure.INDEPENDENT],
"research_question": [ResearchQuestion.COMPARE_MEANS],
"num_groups": 2,
"assumptions": [
"Observations are independent",
"Data is at least ordinal"
],
"alternatives": ["independent_ttest"],
"min_sample_size": 5,
"description": "Non-parametric test comparing distributions between two independent groups"
},
"wilcoxon_signed_rank": {
"name": "Wilcoxon Signed-Rank Test",
"type": "non_parametric",
"data_type": [DataType.CONTINUOUS, DataType.ORDINAL],
"sample_structure": [SampleStructure.PAIRED],
"research_question": [ResearchQuestion.COMPARE_MEANS],
"num_groups": 2,
"assumptions": [
"Pairs are independent",
"Data is at least ordinal",
"Distribution of differences is symmetric"
],
"alternatives": ["paired_ttest"],
"min_sample_size": 5,
"description": "Non-parametric test comparing distributions between two related groups"
},
"chi_square_test": {
"name": "Chi-Square Test of Independence",
"type": "non_parametric",
"data_type": [DataType.CATEGORICAL],
"sample_structure": [SampleStructure.INDEPENDENT],
"research_question": [ResearchQuestion.TEST_ASSOCIATION],
"num_groups": 2,
"assumptions": [
"Observations are independent",
"Expected frequency in each cell >= 5",
"Data is categorical"
],
"alternatives": ["fisher_exact"],
"min_sample_size": 20,
"description": "Tests for association between two categorical variables"
},
"pearson_correlation": {
"name": "Pearson Correlation",
"type": "parametric",
"data_type": [DataType.CONTINUOUS],
"sample_structure": [SampleStructure.INDEPENDENT],
"research_question": [ResearchQuestion.TEST_CORRELATION],
"num_groups": 1,
"assumptions": [
"Both variables are approximately normally distributed",
"Relationship is linear",
"Observations are independent",
"Data is measured at interval or ratio level"
],
"alternatives": ["spearman_correlation"],
"min_sample_size": 10,
"description": "Measures linear correlation between two continuous variables"
}
}
def get_test_explanation(self, test_key: str) -> str:
"""Get LLM-generated explanation of a statistical test"""
test_info = self.test_database.get(test_key, {})
if not test_info:
return "Test not found in knowledge base."
explanation_prompt = f"""
Explain the {test_info['name']} in clear, accessible language for software engineers:
Test type: {test_info['type']}
Description: {test_info['description']}
Assumptions: {', '.join(test_info['assumptions'])}
Please explain:
1. When to use this test
2. What the test does
3. How to interpret the results
4. What the assumptions mean in practical terms
Keep the explanation concise but comprehensive.
"""
return self.llm.generate_response(explanation_prompt, max_tokens=400)
def find_suitable_tests(self, characteristics: ProblemCharacteristics) -> List[str]:
"""Find tests that match the problem characteristics"""
suitable_tests = []
for test_key, test_info in self.test_database.items():
if self._test_matches_characteristics(test_info, characteristics):
suitable_tests.append(test_key)
return suitable_tests
def _test_matches_characteristics(self, test_info: Dict, characteristics: ProblemCharacteristics) -> bool:
"""Check if a test matches the problem characteristics"""
# Check data type compatibility
if characteristics.data_type not in test_info["data_type"]:
return False
# Check sample structure compatibility
if characteristics.sample_structure not in test_info["sample_structure"]:
return False
# Check research question compatibility
if characteristics.research_question not in test_info["research_question"]:
return False
# Check number of groups (allow flexibility for ANOVA)
if test_info["num_groups"] == 3 and characteristics.num_groups < 3:
return False
elif test_info["num_groups"] == 2 and characteristics.num_groups != 2:
return False
elif test_info["num_groups"] == 1: # Tests like correlation
pass
# Check minimum sample size if available
if (characteristics.sample_size is not None and
characteristics.sample_size < test_info["min_sample_size"]):
return False
return True
class LLMEnhancedTestSelectionEngine:
"""Test selection engine enhanced with LLM reasoning"""
def __init__(self, knowledge_base: StatisticalKnowledgeBase, llm_interface: LLMInterface):
self.knowledge_base = knowledge_base
self.llm = llm_interface
def recommend_test(self, characteristics: ProblemCharacteristics) -> TestRecommendation:
"""Recommend the most appropriate statistical test using LLM reasoning"""
suitable_tests = self.knowledge_base.find_suitable_tests(characteristics)
if not suitable_tests:
return self._handle_no_suitable_tests(characteristics)
# Use LLM to select the best test among suitable options
best_test_key = self._llm_select_best_test(suitable_tests, characteristics)
test_info = self.knowledge_base.test_database[best_test_key]
# Generate LLM-based rationale
rationale = self._generate_llm_rationale(test_info, characteristics)
# Calculate confidence
confidence = self._calculate_confidence(test_info, characteristics)
return TestRecommendation(
test_name=test_info["name"],
test_type=test_info["type"],
assumptions=test_info["assumptions"],
alternatives=[self.knowledge_base.test_database[alt]["name"]
for alt in test_info["alternatives"]
if alt in self.knowledge_base.test_database],
confidence=confidence,
rationale=rationale
)
def _llm_select_best_test(self, suitable_tests: List[str], characteristics: ProblemCharacteristics) -> str:
"""Use LLM to select the best test from suitable options"""
if len(suitable_tests) == 1:
return suitable_tests[0]
test_descriptions = []
for test_key in suitable_tests:
test_info = self.knowledge_base.test_database[test_key]
test_descriptions.append(f"- {test_info['name']}: {test_info['description']}")
selection_prompt = f"""
Given this statistical problem:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
- Sample size: {characteristics.sample_size or "Not specified"}
Choose the MOST appropriate test from these options:
{chr(10).join(test_descriptions)}
Consider:
1. Which test best matches the research question
2. Which test is most robust for the given sample size
3. Which test has the most appropriate assumptions
Return only the exact name of the chosen test.
"""
response = self.llm.generate_response(selection_prompt, max_tokens=100)
# Find the test that best matches the LLM response
for test_key in suitable_tests:
test_name = self.knowledge_base.test_database[test_key]["name"]
if test_name.lower() in response.lower():
return test_key
# Fallback to first suitable test
return suitable_tests[0]
def _generate_llm_rationale(self, test_info: Dict, characteristics: ProblemCharacteristics) -> str:
"""Generate detailed rationale using LLM"""
rationale_prompt = f"""
Explain why the {test_info['name']} is the best choice for this statistical problem:
Problem characteristics:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
- Sample size: {characteristics.sample_size or "Not specified"}
Test information:
- Type: {test_info['type']}
- Description: {test_info['description']}
- Key assumptions: {', '.join(test_info['assumptions'][:3])}
Provide a clear, concise explanation of why this test is appropriate.
"""
return self.llm.generate_response(rationale_prompt, max_tokens=300)
def _handle_no_suitable_tests(self, characteristics: ProblemCharacteristics) -> TestRecommendation:
"""Handle cases where no suitable tests are found"""
suggestion_prompt = f"""
No standard statistical test matches these characteristics:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
Suggest alternative approaches or modifications to the problem that would allow for statistical analysis.
"""
suggestion = self.llm.generate_response(suggestion_prompt, max_tokens=200)
return TestRecommendation(
test_name="No suitable test found",
test_type="unknown",
assumptions=[],
alternatives=[],
confidence=0.0,
rationale=f"No standard test matches the problem characteristics. Suggestion: {suggestion}"
)
def _calculate_confidence(self, test_info: Dict, characteristics: ProblemCharacteristics) -> float:
"""Calculate confidence in the test recommendation"""
confidence = 0.8 # Base confidence
# Adjust based on sample size
if characteristics.sample_size is not None:
if characteristics.sample_size >= test_info["min_sample_size"] * 2:
confidence += 0.1
elif characteristics.sample_size < test_info["min_sample_size"]:
confidence -= 0.3
# Adjust based on data type match
if characteristics.data_type in test_info["data_type"]:
confidence += 0.05
return min(1.0, max(0.0, confidence))
class LLMCodeGenerator:
"""LLM-enhanced code generator for statistical tests"""
def __init__(self, llm_interface: LLMInterface):
self.llm = llm_interface
self.code_templates = self._initialize_code_templates()
def _initialize_code_templates(self) -> Dict[str, str]:
"""Initialize code templates for different tests"""
return {
"independent_ttest": """
def perform_independent_ttest(group1, group2, group1_name="Group 1", group2_name="Group 2", alpha=0.05):
import numpy as np
from scipy import stats
# Data preprocessing
group1 = np.array(group1)
group2 = np.array(group2)
group1_clean = group1[~np.isnan(group1)]
group2_clean = group2[~np.isnan(group2)]
print("INDEPENDENT SAMPLES T-TEST ANALYSIS")
print("=" * 50)
print(f"Comparing {group1_name} (n={len(group1_clean)}) vs {group2_name} (n={len(group2_clean)})")
# Descriptive statistics
mean1, std1 = np.mean(group1_clean), np.std(group1_clean, ddof=1)
mean2, std2 = np.mean(group2_clean), np.std(group2_clean, ddof=1)
print(f"\\nDESCRIPTIVE STATISTICS:")
print(f" {group1_name}: Mean = {mean1:.4f}, SD = {std1:.4f}")
print(f" {group2_name}: Mean = {mean2:.4f}, SD = {std2:.4f}")
# Assumption checking
print(f"\\nASSUMPTION CHECKING:")
# Normality tests
_, p_norm1 = stats.shapiro(group1_clean) if len(group1_clean) <= 5000 else stats.kstest(group1_clean, 'norm')
_, p_norm2 = stats.shapiro(group2_clean) if len(group2_clean) <= 5000 else stats.kstest(group2_clean, 'norm')
print(f" Normality {group1_name}: p = {p_norm1:.4f}")
print(f" Normality {group2_name}: p = {p_norm2:.4f}")
# Equal variances test
_, p_levene = stats.levene(group1_clean, group2_clean)
equal_vars = p_levene > 0.05
print(f" Equal variances (Levene): p = {p_levene:.4f}")
# Perform appropriate test
if equal_vars:
t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=True)
test_type = "Student's t-test"
else:
t_stat, p_value = stats.ttest_ind(group1_clean, group2_clean, equal_var=False)
test_type = "Welch's t-test"
# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(group1_clean)-1)*np.var(group1_clean, ddof=1) +
(len(group2_clean)-1)*np.var(group2_clean, ddof=1)) /
(len(group1_clean) + len(group2_clean) - 2))
cohens_d = (mean1 - mean2) / pooled_std
print(f"\\nTEST RESULTS ({test_type}):")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.6f}")
print(f" Cohen's d: {cohens_d:.4f}")
# Interpretation
print(f"\\nINTERPRETATION:")
if p_value < alpha:
print(f" Statistically significant difference (p < {alpha})")
else:
print(f" No statistically significant difference (p >= {alpha})")
return {
'test_type': test_type,
't_statistic': t_stat,
'p_value': p_value,
'cohens_d': cohens_d,
'significant': p_value < alpha
}
""",
"paired_ttest": """
def perform_paired_ttest(before_data, after_data, condition1_name="Before", condition2_name="After", alpha=0.05):
import numpy as np
from scipy import stats
# Data preprocessing
before = np.array(before_data)
after = np.array(after_data)
if len(before) != len(after):
raise ValueError("Before and after data must have the same length")
# Remove pairs with missing values
valid_pairs = ~(np.isnan(before) | np.isnan(after))
before_clean = before[valid_pairs]
after_clean = after[valid_pairs]
differences = after_clean - before_clean
print("PAIRED SAMPLES T-TEST ANALYSIS")
print("=" * 50)
print(f"Comparing {condition1_name} vs {condition2_name} (n={len(before_clean)} pairs)")
# Descriptive statistics
mean_before = np.mean(before_clean)
mean_after = np.mean(after_clean)
mean_diff = np.mean(differences)
print(f"\\nDESCRIPTIVE STATISTICS:")
print(f" {condition1_name}: Mean = {mean_before:.4f}")
print(f" {condition2_name}: Mean = {mean_after:.4f}")
print(f" Mean difference: {mean_diff:.4f}")
# Check normality of differences
print(f"\\nASSUMPTION CHECKING:")
_, p_norm = stats.shapiro(differences) if len(differences) <= 5000 else stats.kstest(differences, 'norm')
print(f" Normality of differences: p = {p_norm:.4f}")
# Perform paired t-test
t_stat, p_value = stats.ttest_rel(after_clean, before_clean)
# Calculate effect size
std_diff = np.std(differences, ddof=1)
cohens_d = mean_diff / std_diff
print(f"\\nTEST RESULTS:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.6f}")
print(f" Cohen's d: {cohens_d:.4f}")
# Interpretation
print(f"\\nINTERPRETATION:")
if p_value < alpha:
direction = "increased" if mean_diff > 0 else "decreased"
print(f" Statistically significant {direction} (p < {alpha})")
else:
print(f" No statistically significant change (p >= {alpha})")
return {
't_statistic': t_stat,
'p_value': p_value,
'mean_difference': mean_diff,
'cohens_d': cohens_d,
'significant': p_value < alpha
}
"""
}
def generate_test_code(self, test_name: str, characteristics: ProblemCharacteristics,
custom_requirements: str = "") -> str:
"""Generate complete test implementation code using LLM"""
# Check if we have a template
test_key = self._get_test_key_from_name(test_name)
if test_key in self.code_templates:
base_code = self.code_templates[test_key]
else:
base_code = ""
# Use LLM to enhance or generate code
code_prompt = f"""
Generate a complete Python function for performing a {test_name} with the following requirements:
Problem characteristics:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
- Sample size: {characteristics.sample_size or "Variable"}
- Significance level: {characteristics.alpha_level}
Additional requirements: {custom_requirements}
The function should include:
1. Comprehensive data validation and preprocessing
2. Assumption checking with appropriate tests
3. The main statistical test implementation
4. Effect size calculation
5. Confidence intervals where appropriate
6. Clear interpretation of results
7. Proper error handling
8. Detailed output formatting
{"Use this as a starting template and enhance it:" + base_code if base_code else "Create a complete implementation from scratch."}
Return only the Python code with proper formatting and documentation.
"""
generated_code = self.llm.generate_response(code_prompt, max_tokens=1500)
# Clean up the generated code
return self._clean_generated_code(generated_code)
def generate_usage_example(self, test_name: str, characteristics: ProblemCharacteristics) -> str:
"""Generate usage example for the test"""
example_prompt = f"""
Create a realistic usage example for the {test_name} function based on these characteristics:
- Data type: {characteristics.data_type.value}
- Sample structure: {characteristics.sample_structure.value}
- Research question: {characteristics.research_question.value}
- Number of groups: {characteristics.num_groups}
Include:
1. Sample data generation or realistic data examples
2. Function call with appropriate parameters
3. Brief explanation of the example scenario
Make it practical and educational for software engineers.
"""
return self.llm.generate_response(example_prompt, max_tokens=400)
def _get_test_key_from_name(self, test_name: str) -> str:
"""Convert test name to internal key"""
name_mapping = {
"Independent Samples T-Test": "independent_ttest",
"Paired Samples T-Test": "paired_ttest",
"One-Way ANOVA": "one_way_anova",
"Mann-Whitney U Test": "mann_whitney_u",
"Wilcoxon Signed-Rank Test": "wilcoxon_signed_rank",
"Chi-Square Test of Independence": "chi_square_test",
"Pearson Correlation": "pearson_correlation"
}
return name_mapping.get(test_name, "")
def _clean_generated_code(self, code: str) -> str:
"""Clean and format generated code"""
# Remove markdown code blocks if present
if "```python" in code:
start = code.find("```python") + 9
end = code.rfind("```")
if end > start:
code = code[start:end]
elif "```" in code:
start = code.find("```") + 3
end = code.rfind("```")
if end > start:
code = code[start:end]
# Clean up extra whitespace
lines = code.split('\n')
cleaned_lines = [line.rstrip() for line in lines]
return '\n'.join(cleaned_lines).strip()
class StatisticalAgent:
"""Main LLM-based statistical test generation agent"""
def __init__(self, llm_interface: LLMInterface):
self.llm = llm_interface
self.nlp = LLMBasedNaturalLanguageProcessor(llm_interface)
self.knowledge_base = StatisticalKnowledgeBase(llm_interface)
self.test_selector = LLMEnhancedTestSelectionEngine(self.knowledge_base, llm_interface)
self.code_generator = LLMCodeGenerator(llm_interface)
def analyze_problem(self, problem_description: str,
custom_requirements: str = "",
generate_code: bool = True) -> Dict[str, Any]:
"""Complete analysis pipeline from problem description to code generation"""
logger.info(f"Analyzing problem: {problem_description[:100]}...")
# Step 1: Extract problem characteristics
characteristics = self.nlp.extract_characteristics(problem_description)
# Step 2: Generate problem summary
problem_summary = self.nlp.generate_problem_summary(characteristics)
# Step 3: Get test recommendation
recommendation = self.test_selector.recommend_test(characteristics)
# Step 4: Get detailed test explanation
test_explanation = self.knowledge_base.get_test_explanation(
self._get_test_key_from_name(recommendation.test_name)
)
# Step 5: Generate code if requested
generated_code = ""
usage_example = ""
if generate_code and recommendation.test_name != "No suitable test found":
generated_code = self.code_generator.generate_test_code(
recommendation.test_name, characteristics, custom_requirements
)
usage_example = self.code_generator.generate_usage_example(
recommendation.test_name, characteristics
)
return {
"problem_characteristics": asdict(characteristics),
"problem_summary": problem_summary,
"test_recommendation": asdict(recommendation),
"test_explanation": test_explanation,
"generated_code": generated_code,
"usage_example": usage_example,
"timestamp": pd.Timestamp.now().isoformat()
}
def explain_test_method(self, test_name: str) -> str:
"""Get detailed explanation of a specific test method"""
test_key = self._get_test_key_from_name(test_name)
if test_key:
return self.knowledge_base.get_test_explanation(test_key)
else:
# Use LLM to explain unknown test
explanation_prompt = f"""
Explain the {test_name} statistical test in detail for software engineers:
Include:
1. When and why to use this test
2. Key assumptions
3. How to interpret results
4. Common pitfalls and considerations
Make it practical and accessible.
"""
return self.llm.generate_response(explanation_prompt, max_tokens=500)
def suggest_alternative_tests(self, problem_description: str) -> List[Dict[str, Any]]:
"""Suggest multiple alternative tests for a problem"""
characteristics = self.nlp.extract_characteristics(problem_description)
suitable_tests = self.knowledge_base.find_suitable_tests(characteristics)
alternatives = []
for test_key in suitable_tests:
test_info = self.knowledge_base.test_database[test_key]
explanation = self.knowledge_base.get_test_explanation(test_key)
alternatives.append({
"test_name": test_info["name"],
"test_type": test_info["type"],
"description": test_info["description"],
"explanation": explanation,
"assumptions": test_info["assumptions"]
})
return alternatives
def _get_test_key_from_name(self, test_name: str) -> str:
"""Convert test name to internal key"""
for key, info in self.knowledge_base.test_database.items():
if info["name"] == test_name:
return key
return ""
# Example usage and configuration
def create_agent_with_openai(api_key: str) -> StatisticalAgent:
"""Create agent with OpenAI interface"""
llm_interface = OpenAIInterface(api_key)
return StatisticalAgent(llm_interface)
def create_agent_with_local_model(model_name: str = "microsoft/DialoGPT-medium") -> StatisticalAgent:
"""Create agent with local Hugging Face model"""
llm_interface = HuggingFaceInterface(model_name)
return StatisticalAgent(llm_interface)
# Example usage
if __name__ == "__main__":
# Example with OpenAI (requires API key)
# agent = create_agent_with_openai("your-openai-api-key")
# Example with local model
agent = create_agent_with_local_model()
# Test the agent
problem = """
I collected sleep duration data from 25 people before and after implementing
a new sleep hygiene program. I want to know if the program was effective.
The before measurements averaged 6.5 hours with some variation, and after
measurements seemed to show improvement.
"""
result = agent.analyze_problem(problem)
print("STATISTICAL ANALYSIS AGENT RESULTS")
print("=" * 50)
print(f"Problem Summary: {result['problem_summary']}")
print(f"\nRecommended Test: {result['test_recommendation']['test_name']}")
print(f"Confidence: {result['test_recommendation']['confidence']:.2f}")
print(f"\nRationale: {result['test_recommendation']['rationale']}")
print(f"\nTest Explanation: {result['test_explanation']}")
if result['generated_code']:
print(f"\nGenerated Code:")
print(result['generated_code'])
if result['usage_example']:
print(f"\nUsage Example:")
print(result['usage_example'])
No comments:
Post a Comment