Introduction
The development of domain-specific languages and parsers has traditionally required deep expertise in compiler construction and formal language theory. This article presents a comprehensive approach to building an intelligent chatbot system that leverages Large Language Models to automatically generate ANTLR v4 parsers and lexers based on user specifications. The system can process both concrete language requests and Backus-Naur Form grammar definitions, utilizing available GPU resources for optimal performance.
The proposed architecture combines the power of modern LLMs with established parsing technologies to democratize parser generation. Users can simply describe their parsing needs in natural language or provide formal grammar specifications, and the system will automatically generate complete parser implementations with detailed guidance for refinement and deployment.
System Architecture Overview
The LLM-powered ANTLR generator consists of several interconnected components that work together to transform user requests into functional parsers. The core architecture follows clean architecture principles with clear separation of concerns and dependency inversion.
The primary components include the LLM Interface Layer, which handles communication with both local and remote language models, the Grammar Search Engine for discovering existing ANTLR grammars, the BNF Conversion Module for transforming Backus-Naur Form specifications into ANTLR syntax, the ANTLR Generation Engine for producing parser code, and the Result Summarization Component for providing user guidance.
The system leverages GPU acceleration through a unified GPU abstraction layer that supports NVIDIA CUDA, AMD ROCm, and Apple Metal Performance Shaders. This enables efficient processing of large language models while maintaining compatibility across different hardware platforms.
Core Component Design
The LLM Interface Layer serves as the central communication hub between user requests and the language model. This component abstracts the differences between local and remote LLM deployments, providing a consistent interface for natural language processing tasks.
class LLMInterface:
def __init__(self, model_config, gpu_config):
self.model_config = model_config
self.gpu_accelerator = GPUAccelerator(gpu_config)
self.tokenizer = self._initialize_tokenizer()
self.model = self._load_model()
def _initialize_tokenizer(self):
# Initialize tokenizer based on model configuration
if self.model_config.model_type == "local":
return AutoTokenizer.from_pretrained(self.model_config.model_path)
else:
return RemoteTokenizer(self.model_config.api_endpoint)
def process_request(self, user_input, context):
# Process user input and generate appropriate response
tokens = self.tokenizer.encode(user_input)
with self.gpu_accelerator.context():
response = self.model.generate(tokens, context)
return self.tokenizer.decode(response)
The LLM Interface Layer handles the complexity of model loading and GPU memory management. When processing requests, it ensures optimal utilization of available hardware resources while maintaining consistent response quality across different deployment scenarios.
The Grammar Search Engine implements intelligent web search capabilities specifically designed for discovering ANTLR grammar files. This component uses sophisticated search strategies to locate high-quality grammar definitions for requested programming languages.
class GrammarSearchEngine:
def __init__(self, search_config):
self.search_providers = self._initialize_providers(search_config)
self.grammar_validator = ANTLRGrammarValidator()
self.cache = GrammarCache()
def search_grammar(self, language_name):
# Search for existing ANTLR grammars for the specified language
cached_result = self.cache.get(language_name)
if cached_result:
return cached_result
search_terms = self._generate_search_terms(language_name)
results = []
for provider in self.search_providers:
provider_results = provider.search(search_terms)
validated_results = self._validate_grammars(provider_results)
results.extend(validated_results)
best_grammar = self._rank_and_select(results)
self.cache.store(language_name, best_grammar)
return best_grammar
The search engine employs multiple search strategies including GitHub repository searches, academic paper repositories, and specialized ANTLR grammar collections. Each discovered grammar undergoes validation to ensure syntactic correctness and completeness before being considered for use.
The BNF Conversion Module represents one of the most sophisticated components in the system. It transforms Backus-Naur Form specifications into valid ANTLR v4 grammar syntax while preserving the semantic meaning of the original specification.
class BNFConverter:
def __init__(self):
self.bnf_parser = BNFParser()
self.antlr_generator = ANTLRGrammarGenerator()
self.semantic_analyzer = SemanticAnalyzer()
def convert_bnf_to_antlr(self, bnf_specification):
# Parse BNF specification and convert to ANTLR grammar
bnf_ast = self.bnf_parser.parse(bnf_specification)
semantic_model = self.semantic_analyzer.analyze(bnf_ast)
antlr_grammar = self.antlr_generator.generate(semantic_model)
return antlr_grammar
def _handle_bnf_constructs(self, bnf_node):
# Convert specific BNF constructs to ANTLR equivalents
if bnf_node.type == "ALTERNATIVE":
return self._convert_alternatives(bnf_node)
elif bnf_node.type == "SEQUENCE":
return self._convert_sequence(bnf_node)
elif bnf_node.type == "OPTIONAL":
return self._convert_optional(bnf_node)
elif bnf_node.type == "REPETITION":
return self._convert_repetition(bnf_node)
The BNF Converter handles the nuanced differences between BNF notation and ANTLR syntax. It recognizes common BNF patterns and transforms them into idiomatic ANTLR constructs while maintaining the original language semantics.
GPU Acceleration Framework
The GPU acceleration framework provides a unified interface for leveraging different GPU architectures. This abstraction layer enables the system to automatically detect and utilize available GPU resources regardless of the underlying hardware platform.
class GPUAccelerator:
def __init__(self, gpu_config):
self.gpu_type = self._detect_gpu_type()
self.device_manager = self._create_device_manager()
self.memory_manager = GPUMemoryManager(self.gpu_type)
def _detect_gpu_type(self):
# Automatically detect available GPU hardware
if torch.cuda.is_available():
return "NVIDIA_CUDA"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
return "APPLE_MPS"
elif self._check_rocm_availability():
return "AMD_ROCM"
else:
return "CPU_FALLBACK"
def context(self):
# Provide GPU context for model operations
return self.device_manager.get_context()
The GPU acceleration framework automatically optimizes memory allocation and computation scheduling based on the detected hardware. For NVIDIA GPUs, it utilizes CUDA cores and Tensor cores when available. For AMD hardware, it leverages ROCm for compute acceleration. Apple Silicon devices benefit from Metal Performance Shaders integration for efficient neural network operations.
The framework also implements intelligent memory management to handle large language models efficiently. It employs techniques such as gradient checkpointing, model sharding, and dynamic batching to maximize throughput while preventing out-of-memory errors.
ANTLR Generation Pipeline
The ANTLR Generation Pipeline orchestrates the entire process from user input to final parser generation. This component coordinates between all other modules to ensure smooth execution and proper error handling throughout the generation process.
class ANTLRGenerationPipeline:
def __init__(self, config):
self.llm_interface = LLMInterface(config.llm_config, config.gpu_config)
self.grammar_search = GrammarSearchEngine(config.search_config)
self.bnf_converter = BNFConverter()
self.antlr_compiler = ANTLRCompiler()
self.result_summarizer = ResultSummarizer()
def generate_parser(self, user_request):
# Main pipeline for parser generation
request_analysis = self.llm_interface.analyze_request(user_request)
if request_analysis.input_type == "LANGUAGE_NAME":
grammar = self.grammar_search.search_grammar(request_analysis.language)
elif request_analysis.input_type == "BNF_SPECIFICATION":
grammar = self.bnf_converter.convert_bnf_to_antlr(request_analysis.bnf)
else:
raise UnsupportedRequestTypeError("Unknown request type")
parser_code = self.antlr_compiler.compile_grammar(
grammar,
request_analysis.target_language
)
summary = self.result_summarizer.create_summary(
grammar,
parser_code,
request_analysis
)
return GenerationResult(grammar, parser_code, summary)
The pipeline implements comprehensive error handling and recovery mechanisms. When grammar search fails, it can fall back to LLM-generated grammars. If BNF conversion encounters ambiguities, it requests clarification from the user through natural language interaction.
The ANTLR Compiler component wraps the standard ANTLR tool chain and provides additional functionality for multi-language code generation. It supports Java, Python, C#, JavaScript, Go, and C++ target languages with appropriate runtime library integration.
class ANTLRCompiler:
def __init__(self):
self.antlr_tool = ANTLRTool()
self.code_generators = self._initialize_generators()
def compile_grammar(self, grammar, target_language):
# Compile ANTLR grammar to target language
grammar_file = self._write_grammar_file(grammar)
compilation_result = self.antlr_tool.compile(
grammar_file,
target_language
)
if compilation_result.has_errors():
return self._handle_compilation_errors(compilation_result)
generated_code = self._collect_generated_files(compilation_result)
return self._package_result(generated_code, target_language)
The compiler automatically handles ANTLR tool invocation, manages temporary files, and collects all generated artifacts. It also performs post-processing to integrate runtime libraries and generate example usage code.
Natural Language Processing Integration
The natural language processing capabilities enable the system to understand complex user requests and provide intelligent responses. The LLM integration goes beyond simple text generation to include semantic understanding of grammar specifications and parser requirements.
class RequestAnalyzer:
def __init__(self, llm_interface):
self.llm_interface = llm_interface
self.intent_classifier = IntentClassifier()
self.entity_extractor = EntityExtractor()
def analyze_request(self, user_input):
# Analyze user request to determine processing strategy
intent = self.intent_classifier.classify(user_input)
entities = self.entity_extractor.extract(user_input)
if intent == "LANGUAGE_PARSER_REQUEST":
return LanguageRequest(
language=entities.get("language_name"),
target_language=entities.get("target_language", "Java"),
features=entities.get("language_features", [])
)
elif intent == "BNF_CONVERSION_REQUEST":
return BNFRequest(
bnf_specification=entities.get("bnf_text"),
target_language=entities.get("target_language", "Java"),
grammar_name=entities.get("grammar_name", "CustomGrammar")
)
The request analyzer employs sophisticated natural language understanding techniques to extract structured information from user queries. It can handle ambiguous requests by asking clarifying questions and maintains conversation context across multiple interactions.
The system also implements advanced prompt engineering techniques to ensure consistent and accurate responses from the underlying language model. These prompts are carefully crafted to elicit specific types of information while maintaining natural conversation flow.
Result Summarization and User Guidance
The Result Summarization Component generates comprehensive reports that help users understand what was created and how to proceed with their generated parsers. This component leverages the LLM's natural language generation capabilities to produce clear, actionable guidance.
class ResultSummarizer:
def __init__(self, llm_interface):
self.llm_interface = llm_interface
self.template_engine = SummaryTemplateEngine()
def create_summary(self, grammar, parser_code, request_analysis):
# Generate comprehensive summary of generation results
grammar_analysis = self._analyze_grammar_structure(grammar)
code_analysis = self._analyze_generated_code(parser_code)
summary_context = {
"grammar_structure": grammar_analysis,
"code_structure": code_analysis,
"target_language": request_analysis.target_language,
"original_request": request_analysis.original_text
}
summary_text = self.llm_interface.generate_summary(summary_context)
refinement_suggestions = self._generate_refinement_suggestions(
grammar_analysis,
code_analysis
)
return ParserSummary(summary_text, refinement_suggestions)
The summarization component analyzes the generated grammar and code to identify potential areas for improvement. It provides specific suggestions for enhancing parser performance, adding error handling, and extending functionality.
The component also generates example usage code and integration instructions tailored to the target programming language. This includes dependency management, build configuration, and testing strategies appropriate for the generated parser.
Error Handling and Recovery Strategies
Robust error handling is essential for a production-ready parser generation system. The architecture implements multiple layers of error detection and recovery to ensure graceful handling of various failure scenarios.
class ErrorHandler:
def __init__(self):
self.error_strategies = {
"GRAMMAR_SEARCH_FAILED": self._handle_search_failure,
"BNF_CONVERSION_ERROR": self._handle_conversion_error,
"ANTLR_COMPILATION_ERROR": self._handle_compilation_error,
"GPU_MEMORY_ERROR": self._handle_gpu_error
}
def handle_error(self, error_type, error_context):
# Route errors to appropriate handling strategies
if error_type in self.error_strategies:
return self.error_strategies[error_type](error_context)
else:
return self._handle_unknown_error(error_type, error_context)
def _handle_search_failure(self, context):
# Fallback to LLM-generated grammar when search fails
fallback_grammar = self._generate_grammar_from_llm(context.language)
return RecoveryResult("GRAMMAR_GENERATED", fallback_grammar)
The error handling system implements progressive fallback strategies. When automated grammar search fails, the system can generate grammars using the LLM's knowledge of programming languages. If BNF conversion encounters ambiguities, it engages in clarifying dialogue with the user.
For GPU-related errors, the system automatically falls back to CPU processing while notifying the user of reduced performance. Memory management errors trigger automatic model optimization and batch size adjustment.
Performance Optimization Techniques
The system employs various optimization techniques to ensure responsive performance even when processing complex grammar specifications or large language models. These optimizations span multiple system layers from GPU utilization to caching strategies.
class PerformanceOptimizer:
def __init__(self, system_config):
self.gpu_optimizer = GPUOptimizer()
self.cache_manager = CacheManager()
self.model_optimizer = ModelOptimizer()
def optimize_inference(self, model, input_data):
# Apply various optimization techniques for model inference
optimized_model = self.model_optimizer.optimize(model)
if self.gpu_optimizer.supports_mixed_precision():
optimized_model = self.gpu_optimizer.enable_mixed_precision(optimized_model)
batch_size = self.gpu_optimizer.calculate_optimal_batch_size(
optimized_model,
input_data
)
return self._run_optimized_inference(optimized_model, input_data, batch_size)
The performance optimization framework implements dynamic batching to maximize GPU utilization, mixed-precision training for supported hardware, and intelligent caching of frequently requested grammars and model outputs.
The system also employs model quantization techniques when appropriate to reduce memory usage while maintaining output quality. For local model deployments, it supports model sharding across multiple GPUs when available.
Security and Privacy Considerations
Security and privacy are paramount when building systems that process user code and grammar specifications. The architecture implements multiple security layers to protect user data and prevent malicious code execution.
class SecurityManager:
def __init__(self):
self.input_sanitizer = InputSanitizer()
self.code_analyzer = CodeSecurityAnalyzer()
self.sandbox_manager = SandboxManager()
def validate_user_input(self, user_input):
# Sanitize and validate user input for security threats
sanitized_input = self.input_sanitizer.sanitize(user_input)
if self.input_sanitizer.contains_malicious_patterns(sanitized_input):
raise SecurityViolationError("Potentially malicious input detected")
return sanitized_input
def execute_antlr_compilation(self, grammar_file):
# Execute ANTLR compilation in sandboxed environment
with self.sandbox_manager.create_sandbox() as sandbox:
return sandbox.execute_antlr(grammar_file)
The security framework implements input sanitization to prevent injection attacks, sandboxed execution environments for ANTLR compilation, and comprehensive logging for security auditing. All generated code undergoes static analysis to identify potential security vulnerabilities.
For remote LLM deployments, the system implements secure communication protocols and ensures that sensitive grammar specifications are not inadvertently stored or logged by external services.
Testing and Quality Assurance
Comprehensive testing ensures the reliability and correctness of generated parsers. The system implements automated testing frameworks that validate both the generation process and the resulting parser implementations.
class QualityAssuranceFramework:
def __init__(self):
self.grammar_tester = GrammarTester()
self.parser_validator = ParserValidator()
self.performance_profiler = PerformanceProfiler()
def validate_generated_parser(self, grammar, parser_code, test_cases):
# Comprehensive validation of generated parser
grammar_validation = self.grammar_tester.validate_grammar(grammar)
if not grammar_validation.is_valid():
return ValidationResult(False, grammar_validation.errors)
parser_validation = self.parser_validator.validate_parser(
parser_code,
test_cases
)
performance_metrics = self.performance_profiler.profile_parser(
parser_code,
test_cases
)
return ValidationResult(
parser_validation.is_valid(),
parser_validation.errors,
performance_metrics
)
The quality assurance framework automatically generates test cases based on grammar specifications, validates parser correctness against known language samples, and profiles performance characteristics to identify potential bottlenecks.
The testing system also includes regression testing capabilities to ensure that system updates do not break existing functionality. It maintains a comprehensive test suite covering various programming languages and grammar patterns.
Deployment and Scaling Considerations
The system architecture supports various deployment scenarios from single-user desktop applications to large-scale cloud services. The modular design enables flexible scaling strategies based on usage patterns and resource requirements.
class DeploymentManager:
def __init__(self, deployment_config):
self.config = deployment_config
self.resource_manager = ResourceManager()
self.load_balancer = LoadBalancer()
def deploy_system(self):
# Deploy system components based on configuration
if self.config.deployment_type == "SINGLE_USER":
return self._deploy_standalone()
elif self.config.deployment_type == "MULTI_USER":
return self._deploy_distributed()
elif self.config.deployment_type == "CLOUD_SERVICE":
return self._deploy_cloud_native()
def _deploy_distributed(self):
# Deploy distributed system with load balancing
llm_cluster = self.resource_manager.create_llm_cluster()
grammar_service = self.resource_manager.create_grammar_service()
compilation_service = self.resource_manager.create_compilation_service()
self.load_balancer.configure_routing(
llm_cluster,
grammar_service,
compilation_service
)
The deployment framework supports horizontal scaling of individual components based on demand. LLM inference can be distributed across multiple GPU nodes, while grammar search and compilation services can scale independently.
For cloud deployments, the system integrates with container orchestration platforms and implements auto-scaling policies based on request volume and resource utilization metrics.
Future Enhancement Opportunities
The current architecture provides a solid foundation for future enhancements and feature additions. Several areas present opportunities for significant capability improvements and user experience enhancements.
Advanced grammar optimization techniques could automatically refine generated grammars for better performance and maintainability. Machine learning models could learn from user feedback to improve grammar quality over time.
Integration with version control systems would enable collaborative grammar development and change tracking. Advanced IDE plugins could provide real-time grammar assistance and parser debugging capabilities.
Multi-modal input support could allow users to provide grammar specifications through diagrams, flowcharts, or other visual representations. This would make the system accessible to users who prefer visual specification methods.
Running Example Implementation
The following complete implementation demonstrates a calculator language parser generator that showcases all the key concepts discussed in this article. This example processes a user request for a simple arithmetic expression parser and generates a complete ANTLR grammar with Java parser code.
If you want to see the source code of the full and general LLM Agent, see below!
"""
Complete LLM-Powered ANTLR Parser Generator
Calculator Language Example Implementation
"""
import torch
import requests
import subprocess
import tempfile
import os
import json
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
from abc import ABC, abstractmethod
# Configuration classes for system setup
@dataclass
class LLMConfig:
model_type: str # "local" or "remote"
model_path: str
api_endpoint: Optional[str] = None
api_key: Optional[str] = None
@dataclass
class GPUConfig:
enable_gpu: bool = True
memory_limit: Optional[int] = None
mixed_precision: bool = True
@dataclass
class SystemConfig:
llm_config: LLMConfig
gpu_config: GPUConfig
antlr_jar_path: str
temp_directory: str
# Core domain models
@dataclass
class UserRequest:
original_text: str
request_type: str # "LANGUAGE" or "BNF"
language_name: Optional[str] = None
bnf_specification: Optional[str] = None
target_language: str = "Java"
@dataclass
class GenerationResult:
grammar_content: str
parser_code: Dict[str, str] # filename -> content mapping
summary: str
refinement_suggestions: List[str]
# GPU Acceleration Framework
class GPUAccelerator:
def __init__(self, config: GPUConfig):
self.config = config
self.device = self._detect_and_configure_device()
def _detect_and_configure_device(self):
"""Detect and configure the best available GPU device"""
if not self.config.enable_gpu:
return torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda")
if self.config.memory_limit:
torch.cuda.set_per_process_memory_fraction(
self.config.memory_limit / torch.cuda.get_device_properties(0).total_memory
)
return device
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
return torch.device("mps")
else:
print("No GPU available, falling back to CPU")
return torch.device("cpu")
def get_device(self):
"""Get the configured device for tensor operations"""
return self.device
# LLM Interface Implementation
class LLMInterface:
def __init__(self, config: LLMConfig, gpu_accelerator: GPUAccelerator):
self.config = config
self.gpu_accelerator = gpu_accelerator
self.device = gpu_accelerator.get_device()
def analyze_request(self, user_input: str) -> UserRequest:
"""Analyze user input to determine request type and extract parameters"""
# Simplified analysis for demonstration
user_input_lower = user_input.lower()
if "calculator" in user_input_lower or "arithmetic" in user_input_lower:
return UserRequest(
original_text=user_input,
request_type="LANGUAGE",
language_name="calculator",
target_language="Java"
)
elif "bnf" in user_input_lower or "::=" in user_input:
return UserRequest(
original_text=user_input,
request_type="BNF",
bnf_specification=self._extract_bnf_from_input(user_input),
target_language="Java"
)
else:
# Default to language request
return UserRequest(
original_text=user_input,
request_type="LANGUAGE",
language_name=self._extract_language_name(user_input),
target_language="Java"
)
def _extract_bnf_from_input(self, user_input: str) -> str:
"""Extract BNF specification from user input"""
# Simple extraction - in production this would be more sophisticated
lines = user_input.split('\n')
bnf_lines = [line for line in lines if '::=' in line or line.strip().startswith('<')]
return '\n'.join(bnf_lines)
def _extract_language_name(self, user_input: str) -> str:
"""Extract language name from user input"""
# Simple keyword extraction - in production this would use NLP
common_languages = ["java", "python", "c++", "javascript", "calculator", "json", "xml"]
user_input_lower = user_input.lower()
for lang in common_languages:
if lang in user_input_lower:
return lang
return "unknown"
def generate_summary(self, generation_result: GenerationResult) -> str:
"""Generate a comprehensive summary of the generation process"""
summary = f"""
ANTLR Parser Generation Summary
==============================
Generated Grammar: {len(generation_result.grammar_content)} characters
Target Language: Java
Generated Files: {len(generation_result.parser_code)} files
Grammar Structure Analysis:
- The grammar defines a complete parser for the requested language
- Lexer rules handle tokenization of input text
- Parser rules define the syntactic structure
Generated Files:
"""
for filename in generation_result.parser_code.keys():
summary += f"- {filename}\n"
summary += """
Next Steps:
1. Compile the generated Java files with ANTLR runtime dependency
2. Create a main class to instantiate and use the parser
3. Add error handling and semantic actions as needed
4. Test with sample input files
Integration Instructions:
- Add ANTLR runtime JAR to your classpath
- Import the generated parser classes
- Create parser instances and call parse methods
"""
return summary
# Grammar Search Engine
class GrammarSearchEngine:
def __init__(self):
self.known_grammars = {
"calculator": self._get_calculator_grammar(),
"json": self._get_json_grammar(),
"arithmetic": self._get_calculator_grammar()
}
def search_grammar(self, language_name: str) -> Optional[str]:
"""Search for existing ANTLR grammar for the specified language"""
if language_name.lower() in self.known_grammars:
return self.known_grammars[language_name.lower()]
# In production, this would search GitHub, ANTLR grammar repositories, etc.
print(f"No known grammar found for {language_name}, generating basic template")
return None
def _get_calculator_grammar(self) -> str:
"""Return a complete calculator grammar"""
return """
grammar Calculator;
// Parser rules
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| '(' expr ')'
| NUMBER
;
// Lexer rules
NUMBER: [0-9]+ ('.' [0-9]+)?;
WS: [ \\t\\r\\n]+ -> skip;
"""
def _get_json_grammar(self) -> str:
"""Return a basic JSON grammar"""
return """
grammar JSON;
json: value;
value: STRING
| NUMBER
| 'true'
| 'false'
| 'null'
| object
| array
;
object: '{' pair (',' pair)* '}'
| '{' '}'
;
pair: STRING ':' value;
array: '[' value (',' value)* ']'
| '[' ']'
;
STRING: '"' (~[\\r\\n"] | '\\\\' .)* '"';
NUMBER: '-'? [0-9]+ ('.' [0-9]+)?;
WS: [ \\t\\r\\n]+ -> skip;
"""
# BNF to ANTLR Converter
class BNFConverter:
def __init__(self):
self.conversion_rules = {
"::=": ":",
"<": "",
">": "",
"|": "|"
}
def convert_bnf_to_antlr(self, bnf_specification: str) -> str:
"""Convert BNF specification to ANTLR grammar"""
lines = bnf_specification.strip().split('\n')
antlr_lines = []
# Add grammar header
antlr_lines.append("grammar GeneratedGrammar;")
antlr_lines.append("")
# Convert each BNF rule
for line in lines:
if '::=' in line:
antlr_line = self._convert_bnf_rule(line)
antlr_lines.append(antlr_line)
# Add basic lexer rules
antlr_lines.extend([
"",
"// Basic lexer rules",
"ID: [a-zA-Z][a-zA-Z0-9]*;",
"NUMBER: [0-9]+;",
"WS: [ \\t\\r\\n]+ -> skip;"
])
return '\n'.join(antlr_lines)
def _convert_bnf_rule(self, bnf_rule: str) -> str:
"""Convert a single BNF rule to ANTLR syntax"""
# Remove angle brackets and convert assignment operator
converted = bnf_rule.replace('<', '').replace('>', '').replace('::=', ':')
# Add semicolon if not present
if not converted.strip().endswith(';'):
converted += ';'
return converted
# ANTLR Compiler Wrapper
class ANTLRCompiler:
def __init__(self, antlr_jar_path: str, temp_directory: str):
self.antlr_jar_path = antlr_jar_path
self.temp_directory = temp_directory
def compile_grammar(self, grammar_content: str, target_language: str = "Java") -> Dict[str, str]:
"""Compile ANTLR grammar and return generated code"""
# Create temporary grammar file
grammar_file = os.path.join(self.temp_directory, "Grammar.g4")
with open(grammar_file, 'w') as f:
f.write(grammar_content)
# Run ANTLR compiler
cmd = [
"java", "-jar", self.antlr_jar_path,
"-Dlanguage=" + target_language,
"-o", self.temp_directory,
grammar_file
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
print("ANTLR compilation successful")
except subprocess.CalledProcessError as e:
print(f"ANTLR compilation failed: {e.stderr}")
return {}
# Collect generated files
generated_files = {}
for filename in os.listdir(self.temp_directory):
if filename.endswith('.java') or filename.endswith('.py') or filename.endswith('.cpp'):
filepath = os.path.join(self.temp_directory, filename)
with open(filepath, 'r') as f:
generated_files[filename] = f.read()
return generated_files
# Main Pipeline Orchestrator
class ANTLRGenerationPipeline:
def __init__(self, config: SystemConfig):
self.config = config
self.gpu_accelerator = GPUAccelerator(config.gpu_config)
self.llm_interface = LLMInterface(config.llm_config, self.gpu_accelerator)
self.grammar_search = GrammarSearchEngine()
self.bnf_converter = BNFConverter()
self.antlr_compiler = ANTLRCompiler(config.antlr_jar_path, config.temp_directory)
def generate_parser(self, user_input: str) -> GenerationResult:
"""Main pipeline for generating ANTLR parsers from user input"""
print(f"Processing request: {user_input}")
# Analyze user request
request = self.llm_interface.analyze_request(user_input)
print(f"Request type: {request.request_type}")
# Generate or find grammar
if request.request_type == "LANGUAGE":
grammar_content = self.grammar_search.search_grammar(request.language_name)
if not grammar_content:
grammar_content = self._generate_default_grammar(request.language_name)
elif request.request_type == "BNF":
grammar_content = self.bnf_converter.convert_bnf_to_antlr(request.bnf_specification)
else:
raise ValueError(f"Unsupported request type: {request.request_type}")
print("Grammar generated successfully")
# Compile grammar to target language
parser_code = self.antlr_compiler.compile_grammar(grammar_content, request.target_language)
# Generate result summary
result = GenerationResult(
grammar_content=grammar_content,
parser_code=parser_code,
summary="",
refinement_suggestions=[]
)
result.summary = self.llm_interface.generate_summary(result)
result.refinement_suggestions = self._generate_refinement_suggestions(result)
return result
def _generate_default_grammar(self, language_name: str) -> str:
"""Generate a basic grammar template for unknown languages"""
return f"""
grammar {language_name.capitalize()};
// Main entry point
start: statement+;
statement: expression ';';
expression: ID | NUMBER | STRING;
// Lexer rules
ID: [a-zA-Z][a-zA-Z0-9]*;
NUMBER: [0-9]+;
STRING: '"' (~[\\r\\n"] | '\\\\' .)* '"';
WS: [ \\t\\r\\n]+ -> skip;
"""
def _generate_refinement_suggestions(self, result: GenerationResult) -> List[str]:
"""Generate suggestions for improving the generated parser"""
suggestions = [
"Add semantic actions to build an Abstract Syntax Tree (AST)",
"Implement error recovery strategies for better error messages",
"Add support for comments in the language specification",
"Consider adding operator precedence rules for mathematical expressions",
"Implement visitor or listener patterns for tree traversal",
"Add comprehensive unit tests for the parser"
]
return suggestions
# Example usage and demonstration
def main():
"""Demonstrate the complete ANTLR generation pipeline"""
# Configuration setup
config = SystemConfig(
llm_config=LLMConfig(
model_type="local",
model_path="gpt2" # Placeholder for actual model
),
gpu_config=GPUConfig(
enable_gpu=True,
mixed_precision=True
),
antlr_jar_path="/path/to/antlr-4.9.2-complete.jar", # Update with actual path
temp_directory=tempfile.mkdtemp()
)
# Create pipeline instance
pipeline = ANTLRGenerationPipeline(config)
# Example 1: Generate calculator parser
print("Example 1: Calculator Language Parser")
print("=" * 50)
calculator_request = "Generate a parser for a simple calculator language that supports arithmetic expressions with numbers, parentheses, and basic operators"
try:
result = pipeline.generate_parser(calculator_request)
print("Generated Grammar:")
print("-" * 20)
print(result.grammar_content)
print()
print("Generated Files:")
print("-" * 20)
for filename, content in result.parser_code.items():
print(f"File: {filename}")
print(f"Size: {len(content)} characters")
print()
print("Summary:")
print("-" * 20)
print(result.summary)
print()
print("Refinement Suggestions:")
print("-" * 20)
for i, suggestion in enumerate(result.refinement_suggestions, 1):
print(f"{i}. {suggestion}")
except Exception as e:
print(f"Error generating parser: {e}")
# Example 2: BNF conversion
print("\n\nExample 2: BNF to ANTLR Conversion")
print("=" * 50)
bnf_request = """
Convert this BNF to ANTLR:
<expr> ::= <term> | <expr> '+' <term> | <expr> '-' <term>
<term> ::= <factor> | <term> '*' <factor> | <term> '/' <factor>
<factor> ::= <number> | '(' <expr> ')'
"""
try:
result = pipeline.generate_parser(bnf_request)
print("Converted Grammar:")
print("-" * 20)
print(result.grammar_content)
except Exception as e:
print(f"Error converting BNF: {e}")
# Cleanup
import shutil
shutil.rmtree(config.temp_directory)
if __name__ == "__main__":
main()
This complete implementation demonstrates all the key concepts discussed in the article. The system can process natural language requests for parser generation, search for existing grammars, convert BNF specifications to ANTLR syntax, compile grammars using the ANTLR tool, and provide comprehensive summaries with refinement suggestions.
The example showcases GPU acceleration support, modular architecture with clean separation of concerns, comprehensive error handling, and extensible design patterns. The calculator language example provides a concrete demonstration of the entire pipeline from user request to generated parser code.
The implementation follows clean architecture principles with dependency injection, abstract interfaces, and clear separation between domain logic and infrastructure concerns. Each component can be independently tested and extended without affecting other parts of the system.
System Overview
This implementation provides a complete, general-purpose LLM Agent that processes arbitrary user prompts to generate ANTLR v4 parsers. The agent intelligently analyzes user requests, searches for existing grammars when appropriate, converts BNF specifications, generates custom grammars, and executes the complete ANTLR toolchain to produce working parsers.
The system is designed to handle any language specification or parsing requirement without being limited to predefined examples or templates.
COMPLETE IMPLEMENTATION
import os
import sys
import json
import subprocess
import tempfile
import shutil
import requests
import re
import logging
import asyncio
import aiohttp
from typing import Dict, List, Optional, Tuple, Union, Any
from dataclasses import dataclass, asdict, field
from abc import ABC, abstractmethod
from pathlib import Path
from datetime import datetime, timedelta
import hashlib
import yaml
from urllib.parse import quote_plus, urljoin
import xml.etree.ElementTree as ET
# Configure comprehensive logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('antlr_agent.log'),
logging.StreamHandler(sys.stdout)
]
)
logger = logging.getLogger(__name__)
# Core Configuration Classes
@dataclass
class LLMConfig:
"""Configuration for Language Model integration"""
provider: str # "openai", "anthropic", "huggingface", "local", "ollama"
model_name: str
api_key: Optional[str] = None
api_endpoint: Optional[str] = None
max_tokens: int = 8000
temperature: float = 0.3
local_model_path: Optional[str] = None
timeout: int = 120
@dataclass
class GPUConfig:
"""GPU acceleration configuration"""
enable_gpu: bool = True
gpu_type: str = "auto" # "nvidia", "amd", "apple", "auto"
memory_limit_gb: Optional[float] = None
mixed_precision: bool = True
device_id: int = 0
@dataclass
class SearchConfig:
"""Web search configuration for grammar discovery"""
enable_web_search: bool = True
github_token: Optional[str] = None
search_engines: List[str] = field(default_factory=lambda: ["github", "antlr-grammars"])
max_results_per_source: int = 5
timeout: int = 30
cache_duration_hours: int = 24
@dataclass
class ANTLRConfig:
"""ANTLR tool configuration"""
jar_path: str
version: str = "4.13.1"
java_path: str = "java"
target_languages: List[str] = field(default_factory=lambda: ["Java", "Python3", "Cpp", "CSharp", "JavaScript", "Go"])
generate_visitor: bool = True
generate_listener: bool = True
@dataclass
class SystemConfig:
"""Main system configuration"""
llm_config: LLMConfig
gpu_config: GPUConfig
search_config: SearchConfig
antlr_config: ANTLRConfig
output_base_directory: str
temp_directory: str
enable_caching: bool = True
max_concurrent_requests: int = 3
# Data Models
@dataclass
class ParsedRequest:
"""Structured representation of user request"""
original_prompt: str
intent: str # "generate_parser", "convert_bnf", "find_grammar", "custom_language"
language_name: Optional[str] = None
language_description: Optional[str] = None
bnf_specification: Optional[str] = None
ebnf_specification: Optional[str] = None
target_language: str = "Java"
grammar_name: Optional[str] = None
special_requirements: List[str] = field(default_factory=list)
examples: List[str] = field(default_factory=list)
confidence: float = 0.0
@dataclass
class GrammarSource:
"""Information about a grammar source"""
content: str
source_type: str # "web", "built-in", "generated"
url: Optional[str] = None
quality_score: float = 0.0
language: str = ""
description: str = ""
license: Optional[str] = None
@dataclass
class GenerationResult:
"""Complete result of parser generation process"""
request: ParsedRequest
grammar_file_path: str
generated_files: Dict[str, str] # relative_path -> absolute_path
output_directory: str
compilation_success: bool
compilation_log: str
antlr_version: str
target_language: str
generation_time: float
summary: str
next_steps: List[str]
performance_notes: List[str]
# GPU Detection and Acceleration
class GPUManager:
"""Manages GPU detection and optimization across different vendors"""
def __init__(self, config: GPUConfig):
self.config = config
self.device_info = self._detect_hardware()
self.device = self._configure_device()
def _detect_hardware(self) -> Dict[str, Any]:
"""Comprehensive GPU hardware detection"""
info = {
"type": "cpu",
"name": "CPU",
"memory_gb": 0,
"compute_capability": None,
"driver_version": None
}
if not self.config.enable_gpu:
return info
# NVIDIA CUDA Detection
if self._check_nvidia():
try:
import torch
if torch.cuda.is_available():
device_props = torch.cuda.get_device_properties(self.config.device_id)
info.update({
"type": "nvidia",
"name": device_props.name,
"memory_gb": device_props.total_memory / (1024**3),
"compute_capability": f"{device_props.major}.{device_props.minor}",
"driver_version": torch.version.cuda
})
logger.info(f"NVIDIA GPU detected: {info['name']}")
except Exception as e:
logger.warning(f"NVIDIA detection failed: {e}")
# Apple Metal Detection
elif self._check_apple_metal():
try:
import torch
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
info.update({
"type": "apple",
"name": "Apple Silicon GPU",
"memory_gb": 16, # Unified memory estimation
"compute_capability": "Metal Performance Shaders"
})
logger.info("Apple Silicon GPU with Metal detected")
except Exception as e:
logger.warning(f"Apple Metal detection failed: {e}")
# AMD ROCm Detection
elif self._check_amd_rocm():
info.update({
"type": "amd",
"name": "AMD GPU",
"memory_gb": 8, # Default estimation
"compute_capability": "ROCm"
})
logger.info("AMD GPU with ROCm detected")
return info
def _check_nvidia(self) -> bool:
"""Check for NVIDIA GPU availability"""
try:
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
return result.returncode == 0
except FileNotFoundError:
return False
def _check_apple_metal(self) -> bool:
"""Check for Apple Metal support"""
try:
import platform
return platform.system() == "Darwin" and platform.machine() in ["arm64", "aarch64"]
except:
return False
def _check_amd_rocm(self) -> bool:
"""Check for AMD ROCm support"""
try:
result = subprocess.run(['rocm-smi'], capture_output=True, text=True)
return result.returncode == 0
except FileNotFoundError:
return False
def _configure_device(self):
"""Configure optimal device for computation"""
if self.device_info["type"] == "cpu":
return "cpu"
try:
import torch
if self.device_info["type"] == "nvidia":
device = torch.device(f"cuda:{self.config.device_id}")
if self.config.memory_limit_gb:
fraction = self.config.memory_limit_gb / self.device_info["memory_gb"]
torch.cuda.set_per_process_memory_fraction(fraction, self.config.device_id)
return device
elif self.device_info["type"] == "apple":
return torch.device("mps")
elif self.device_info["type"] == "amd":
return torch.device("cuda") # ROCm uses CUDA-like interface
except Exception as e:
logger.warning(f"Device configuration failed: {e}")
return "cpu"
def get_device_info(self) -> Dict[str, Any]:
"""Get comprehensive device information"""
return self.device_info.copy()
# Abstract LLM Interface
class LLMProvider(ABC):
"""Abstract base class for LLM providers"""
@abstractmethod
async def analyze_request(self, prompt: str) -> ParsedRequest:
"""Analyze user prompt and extract structured information"""
pass
@abstractmethod
async def generate_grammar(self, request: ParsedRequest) -> str:
"""Generate ANTLR grammar based on request"""
pass
@abstractmethod
async def convert_bnf_to_antlr(self, bnf_content: str, grammar_name: str) -> str:
"""Convert BNF/EBNF to ANTLR grammar"""
pass
@abstractmethod
async def enhance_grammar(self, grammar: str, requirements: List[str]) -> str:
"""Enhance existing grammar with additional requirements"""
pass
@abstractmethod
async def generate_summary(self, result: GenerationResult) -> str:
"""Generate comprehensive summary and documentation"""
pass
# OpenAI Implementation
class OpenAIProvider(LLMProvider):
"""OpenAI GPT implementation"""
def __init__(self, config: LLMConfig, gpu_manager: GPUManager):
self.config = config
self.gpu_manager = gpu_manager
if not config.api_key:
raise ValueError("OpenAI API key required")
async def _make_request(self, messages: List[Dict], temperature: float = None) -> str:
"""Make async request to OpenAI API"""
headers = {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json"
}
data = {
"model": self.config.model_name,
"messages": messages,
"max_tokens": self.config.max_tokens,
"temperature": temperature or self.config.temperature
}
async with aiohttp.ClientSession() as session:
try:
async with session.post(
"https://api.openai.com/v1/chat/completions",
headers=headers,
json=data,
timeout=aiohttp.ClientTimeout(total=self.config.timeout)
) as response:
response.raise_for_status()
result = await response.json()
return result["choices"][0]["message"]["content"]
except Exception as e:
logger.error(f"OpenAI API request failed: {e}")
raise
async def analyze_request(self, prompt: str) -> ParsedRequest:
"""Analyze user prompt using OpenAI"""
system_message = {
"role": "system",
"content": """You are an expert in formal languages, parsing, and ANTLR grammar development.
Analyze user requests for parser generation and extract structured information.
Respond with JSON containing:
- intent: "generate_parser", "convert_bnf", "find_grammar", or "custom_language"
- language_name: if requesting parser for existing language (null if custom)
- language_description: detailed description of the language to parse
- bnf_specification: if BNF/EBNF is provided in the request
- target_language: programming language for generated parser (default "Java")
- grammar_name: suggested name for the grammar
- special_requirements: array of special features or requirements
- examples: array of example inputs if provided
- confidence: confidence score 0.0-1.0 for the analysis
Be thorough in extracting language_description even for known languages."""
}
user_message = {
"role": "user",
"content": f"Analyze this parser generation request:\n\n{prompt}"
}
response = await self._make_request([system_message, user_message], temperature=0.1)
try:
# Extract JSON from response
json_match = re.search(r'\{.*\}', response, re.DOTALL)
if json_match:
data = json.loads(json_match.group())
return ParsedRequest(
original_prompt=prompt,
intent=data.get("intent", "generate_parser"),
language_name=data.get("language_name"),
language_description=data.get("language_description", ""),
bnf_specification=data.get("bnf_specification"),
target_language=data.get("target_language", "Java"),
grammar_name=data.get("grammar_name"),
special_requirements=data.get("special_requirements", []),
examples=data.get("examples", []),
confidence=data.get("confidence", 0.5)
)
except Exception as e:
logger.warning(f"Failed to parse LLM analysis: {e}")
# Fallback analysis
return self._fallback_analysis(prompt)
def _fallback_analysis(self, prompt: str) -> ParsedRequest:
"""Fallback analysis when JSON parsing fails"""
prompt_lower = prompt.lower()
# Detect BNF/EBNF
if "::=" in prompt or "=" in prompt and ("<" in prompt and ">" in prompt):
return ParsedRequest(
original_prompt=prompt,
intent="convert_bnf",
bnf_specification=prompt,
language_description="BNF specification conversion",
target_language="Java",
confidence=0.8
)
# Detect known languages
known_languages = {
"json": "JSON data format",
"xml": "XML markup language",
"sql": "SQL database query language",
"calculator": "arithmetic expression calculator",
"java": "Java programming language",
"python": "Python programming language",
"javascript": "JavaScript programming language",
"c++": "C++ programming language"
}
for lang, desc in known_languages.items():
if lang in prompt_lower:
return ParsedRequest(
original_prompt=prompt,
intent="find_grammar",
language_name=lang,
language_description=desc,
target_language="Java",
confidence=0.7
)
return ParsedRequest(
original_prompt=prompt,
intent="custom_language",
language_description=prompt,
target_language="Java",
confidence=0.5
)
async def generate_grammar(self, request: ParsedRequest) -> str:
"""Generate ANTLR grammar from request"""
system_message = {
"role": "system",
"content": """You are an expert ANTLR v4 grammar developer. Generate complete, production-ready ANTLR grammars.
Requirements:
- Use ANTLR v4 syntax exactly
- Include grammar declaration with appropriate name
- Define comprehensive lexer rules for all tokens
- Create well-structured parser rules with proper precedence
- Handle whitespace and comments appropriately
- Follow ANTLR naming conventions (parser rules lowercase, lexer rules uppercase)
- Include error handling considerations
- Make grammar unambiguous and efficient
Respond with only the grammar content, no explanations."""
}
prompt_parts = [f"Generate ANTLR v4 grammar for: {request.language_description}"]
if request.language_name:
prompt_parts.append(f"Language name: {request.language_name}")
if request.grammar_name:
prompt_parts.append(f"Grammar name: {request.grammar_name}")
if request.special_requirements:
prompt_parts.append(f"Special requirements: {', '.join(request.special_requirements)}")
if request.examples:
prompt_parts.append(f"Example inputs:\n{chr(10).join(request.examples)}")
prompt_parts.append(f"Target language: {request.target_language}")
user_message = {
"role": "user",
"content": "\n\n".join(prompt_parts)
}
return await self._make_request([system_message, user_message])
async def convert_bnf_to_antlr(self, bnf_content: str, grammar_name: str) -> str:
"""Convert BNF/EBNF to ANTLR grammar"""
system_message = {
"role": "system",
"content": """Convert BNF/EBNF specifications to ANTLR v4 grammar syntax.
Conversion rules:
- Replace ::= with :
- Remove angle brackets from non-terminals
- Convert | to ANTLR alternatives
- Handle optional elements [...] as (...)?
- Handle repetition {...} as (...)*
- Add appropriate lexer rules
- Ensure ANTLR v4 compatibility
- Add grammar declaration
- Include whitespace handling
Respond with only the converted grammar."""
}
user_message = {
"role": "user",
"content": f"Convert this BNF/EBNF to ANTLR v4 grammar named '{grammar_name}':\n\n{bnf_content}"
}
return await self._make_request([system_message, user_message])
async def enhance_grammar(self, grammar: str, requirements: List[str]) -> str:
"""Enhance existing grammar with additional requirements"""
system_message = {
"role": "system",
"content": "Enhance the given ANTLR grammar to meet additional requirements. Maintain compatibility and add features as requested."
}
user_message = {
"role": "user",
"content": f"Enhance this ANTLR grammar:\n\n{grammar}\n\nAdditional requirements:\n{chr(10).join(f'- {req}' for req in requirements)}"
}
return await self._make_request([system_message, user_message])
async def generate_summary(self, result: GenerationResult) -> str:
"""Generate comprehensive summary"""
system_message = {
"role": "system",
"content": "Generate clear, comprehensive documentation for ANTLR parser generation results. Include usage instructions and next steps."
}
user_message = {
"role": "user",
"content": f"""Generate summary for this ANTLR parser generation:
Original Request: {result.request.original_prompt}
Grammar File: {result.grammar_file_path}
Target Language: {result.target_language}
Compilation: {'Success' if result.compilation_success else 'Failed'}
Generated Files: {len(result.generated_files)}
Generation Time: {result.generation_time:.2f}s
Include:
- Overview of what was generated
- File structure and contents
- Integration instructions for {result.target_language}
- Usage examples
- Next development steps
- Performance considerations"""
}
return await self._make_request([system_message, user_message])
# Grammar Search Engine
class GrammarSearchEngine:
"""Comprehensive grammar search across multiple sources"""
def __init__(self, config: SearchConfig):
self.config = config
self.cache = {}
self.session = None
async def search_grammar(self, language_name: str) -> Optional[GrammarSource]:
"""Search for existing grammar across all sources"""
cache_key = language_name.lower()
# Check cache
if cache_key in self.cache:
cached_time, result = self.cache[cache_key]
if datetime.now() - cached_time < timedelta(hours=self.config.cache_duration_hours):
return result
# Search all configured sources
results = []
if self.config.enable_web_search:
if "github" in self.config.search_engines:
github_results = await self._search_github(language_name)
results.extend(github_results)
if "antlr-grammars" in self.config.search_engines:
antlr_results = await self._search_antlr_grammars(language_name)
results.extend(antlr_results)
# Select best result
if results:
best_result = max(results, key=lambda x: x.quality_score)
self.cache[cache_key] = (datetime.now(), best_result)
return best_result
return None
async def _search_github(self, language_name: str) -> List[GrammarSource]:
"""Search GitHub for ANTLR grammars"""
results = []
if not self.session:
self.session = aiohttp.ClientSession()
try:
# Search GitHub API
query = f"{language_name} antlr grammar filetype:g4"
url = f"https://api.github.com/search/code?q={quote_plus(query)}"
headers = {}
if self.config.github_token:
headers["Authorization"] = f"token {self.config.github_token}"
async with self.session.get(url, headers=headers, timeout=self.config.timeout) as response:
if response.status == 200:
data = await response.json()
for item in data.get("items", [])[:self.config.max_results_per_source]:
# Fetch grammar content
content = await self._fetch_github_file(item["download_url"])
if content:
quality_score = self._calculate_quality_score(content, item)
results.append(GrammarSource(
content=content,
source_type="web",
url=item["html_url"],
quality_score=quality_score,
language=language_name,
description=f"GitHub: {item['repository']['full_name']}"
))
except Exception as e:
logger.warning(f"GitHub search failed: {e}")
return results
async def _fetch_github_file(self, download_url: str) -> Optional[str]:
"""Fetch file content from GitHub"""
try:
async with self.session.get(download_url, timeout=self.config.timeout) as response:
if response.status == 200:
return await response.text()
except Exception as e:
logger.warning(f"Failed to fetch GitHub file: {e}")
return None
async def _search_antlr_grammars(self, language_name: str) -> List[GrammarSource]:
"""Search official ANTLR grammars repository"""
results = []
try:
# Search the official ANTLR grammars-v4 repository
base_url = "https://api.github.com/repos/antlr/grammars-v4/contents"
async with self.session.get(base_url, timeout=self.config.timeout) as response:
if response.status == 200:
contents = await response.json()
# Look for matching directories
for item in contents:
if (item["type"] == "dir" and
language_name.lower() in item["name"].lower()):
grammar_content = await self._fetch_antlr_grammar_dir(item["url"])
if grammar_content:
results.append(GrammarSource(
content=grammar_content,
source_type="web",
url=f"https://github.com/antlr/grammars-v4/tree/master/{item['name']}",
quality_score=0.9, # High quality for official grammars
language=language_name,
description=f"Official ANTLR grammar: {item['name']}"
))
except Exception as e:
logger.warning(f"ANTLR grammars search failed: {e}")
return results
async def _fetch_antlr_grammar_dir(self, dir_url: str) -> Optional[str]:
"""Fetch grammar from ANTLR grammars directory"""
try:
async with self.session.get(dir_url, timeout=self.config.timeout) as response:
if response.status == 200:
files = await response.json()
# Look for .g4 files
for file_info in files:
if file_info["name"].endswith(".g4"):
content = await self._fetch_github_file(file_info["download_url"])
if content and "grammar" in content:
return content
except Exception as e:
logger.warning(f"Failed to fetch ANTLR grammar directory: {e}")
return None
def _calculate_quality_score(self, content: str, metadata: Dict) -> float:
"""Calculate quality score for grammar"""
score = 0.0
# Basic grammar structure
if "grammar" in content and ":" in content:
score += 0.3
# Lexer rules present
if re.search(r'[A-Z_]+\s*:', content):
score += 0.2
# Parser rules present
if re.search(r'[a-z_]+\s*:', content):
score += 0.2
# Repository stars (if available)
if "stargazers_count" in metadata.get("repository", {}):
stars = metadata["repository"]["stargazers_count"]
score += min(0.2, stars / 1000)
# Recent activity
if "updated_at" in metadata.get("repository", {}):
score += 0.1
return min(1.0, score)
async def close(self):
"""Close HTTP session"""
if self.session:
await self.session.close()
# ANTLR Compiler and File Manager
class ANTLRCompiler:
"""Manages ANTLR compilation and file generation"""
def __init__(self, config: ANTLRConfig):
self.config = config
self._verify_antlr_installation()
def _verify_antlr_installation(self):
"""Verify ANTLR installation and Java availability"""
if not os.path.exists(self.config.jar_path):
raise FileNotFoundError(f"ANTLR JAR not found: {self.config.jar_path}")
try:
result = subprocess.run(
[self.config.java_path, "-version"],
capture_output=True, text=True
)
if result.returncode != 0:
raise RuntimeError("Java not available")
except FileNotFoundError:
raise RuntimeError("Java not found in PATH")
logger.info(f"ANTLR {self.config.version} verified at {self.config.jar_path}")
def create_project_directory(self, base_dir: str, grammar_name: str) -> str:
"""Create organized project directory structure"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
project_name = f"{grammar_name}_{timestamp}"
project_dir = os.path.join(base_dir, project_name)
# Create directory structure
os.makedirs(project_dir, exist_ok=True)
os.makedirs(os.path.join(project_dir, "grammar"), exist_ok=True)
os.makedirs(os.path.join(project_dir, "generated"), exist_ok=True)
os.makedirs(os.path.join(project_dir, "examples"), exist_ok=True)
os.makedirs(os.path.join(project_dir, "docs"), exist_ok=True)
logger.info(f"Created project directory: {project_dir}")
return project_dir
def save_grammar(self, grammar_content: str, project_dir: str, grammar_name: str) -> str:
"""Save grammar to file with proper naming"""
# Ensure grammar has proper name declaration
if not grammar_content.strip().startswith("grammar"):
grammar_content = f"grammar {grammar_name};\n\n{grammar_content}"
grammar_file = os.path.join(project_dir, "grammar", f"{grammar_name}.g4")
with open(grammar_file, 'w', encoding='utf-8') as f:
f.write(grammar_content)
logger.info(f"Grammar saved: {grammar_file}")
return grammar_file
async def compile_grammar(self, grammar_file: str, target_language: str, project_dir: str) -> Tuple[bool, str, Dict[str, str]]:
"""Compile ANTLR grammar and return results"""
output_dir = os.path.join(project_dir, "generated")
# Build ANTLR command
cmd = [
self.config.java_path,
"-jar", self.config.jar_path,
"-Dlanguage=" + target_language,
"-o", output_dir
]
if self.config.generate_visitor:
cmd.append("-visitor")
if self.config.generate_listener:
cmd.append("-listener")
cmd.append(grammar_file)
logger.info(f"Compiling grammar with command: {' '.join(cmd)}")
try:
# Run ANTLR compilation
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=120,
cwd=project_dir
)
compilation_log = f"STDOUT:\n{result.stdout}\n\nSTDERR:\n{result.stderr}"
success = result.returncode == 0
# Collect generated files
generated_files = {}
if success:
for root, dirs, files in os.walk(output_dir):
for file in files:
if file.endswith(('.java', '.py', '.cpp', '.cs', '.js', '.go')):
full_path = os.path.join(root, file)
rel_path = os.path.relpath(full_path, project_dir)
generated_files[rel_path] = full_path
logger.info(f"Compilation {'succeeded' if success else 'failed'}")
return success, compilation_log, generated_files
except subprocess.TimeoutExpired:
error_msg = "ANTLR compilation timed out"
logger.error(error_msg)
return False, error_msg, {}
except Exception as e:
error_msg = f"ANTLR compilation failed: {e}"
logger.error(error_msg)
return False, error_msg, {}
def generate_build_files(self, project_dir: str, target_language: str, grammar_name: str):
"""Generate build files and integration examples"""
if target_language == "Java":
self._generate_java_build_files(project_dir, grammar_name)
elif target_language == "Python3":
self._generate_python_build_files(project_dir, grammar_name)
elif target_language == "Cpp":
self._generate_cpp_build_files(project_dir, grammar_name)
def _generate_java_build_files(self, project_dir: str, grammar_name: str):
"""Generate Java build files and examples"""
# Maven pom.xml
pom_content = f"""<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>{grammar_name.lower()}-parser</artifactId>
<version>1.0.0</version>
<properties>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<antlr.version>{self.config.version}</antlr.version>
</properties>
<dependencies>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>${{antlr.version}}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>${{antlr.version}}</version>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>"""
with open(os.path.join(project_dir, "pom.xml"), 'w') as f:
f.write(pom_content)
# Example Java usage
example_content = f"""import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class {grammar_name}Example {{
public static void main(String[] args) throws Exception {{
// Create input stream (from string, file, or stdin)
String input = "your input here";
ANTLRInputStream inputStream = new ANTLRInputStream(input);
// Create lexer
{grammar_name}Lexer lexer = new {grammar_name}Lexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
// Create parser
{grammar_name}Parser parser = new {grammar_name}Parser(tokens);
// Parse starting from the root rule (adjust as needed)
ParseTree tree = parser.startRule(); // Replace 'startRule' with your actual start rule
// Print parse tree
System.out.println(tree.toStringTree(parser));
// Use visitor or listener for tree processing
// {grammar_name}BaseVisitor visitor = new {grammar_name}BaseVisitor();
// visitor.visit(tree);
}}
}}"""
with open(os.path.join(project_dir, "examples", f"{grammar_name}Example.java"), 'w') as f:
f.write(example_content)
def _generate_python_build_files(self, project_dir: str, grammar_name: str):
"""Generate Python build files and examples"""
# requirements.txt
requirements = f"antlr4-python3-runtime=={self.config.version}\n"
with open(os.path.join(project_dir, "requirements.txt"), 'w') as f:
f.write(requirements)
# Example Python usage
example_content = f"""from antlr4 import *
from generated.{grammar_name}Lexer import {grammar_name}Lexer
from generated.{grammar_name}Parser import {grammar_name}Parser
def main():
# Create input stream
input_text = "your input here"
input_stream = InputStream(input_text)
# Create lexer
lexer = {grammar_name}Lexer(input_stream)
token_stream = CommonTokenStream(lexer)
# Create parser
parser = {grammar_name}Parser(token_stream)
# Parse starting from root rule (adjust as needed)
tree = parser.startRule() # Replace 'startRule' with your actual start rule
# Print parse tree
print(tree.toStringTree(recog=parser))
# Use visitor or listener for tree processing
# visitor = {grammar_name}Visitor()
# visitor.visit(tree)
if __name__ == '__main__':
main()
"""
with open(os.path.join(project_dir, "examples", f"{grammar_name.lower()}_example.py"), 'w') as f:
f.write(example_content)
def _generate_cpp_build_files(self, project_dir: str, grammar_name: str):
"""Generate C++ build files and examples"""
# CMakeLists.txt
cmake_content = f"""cmake_minimum_required(VERSION 3.10)
project({grammar_name}Parser)
set(CMAKE_CXX_STANDARD 17)
# Find ANTLR runtime
find_package(PkgConfig REQUIRED)
pkg_check_modules(ANTLR4 REQUIRED antlr4-runtime)
# Include directories
include_directories(${{ANTLR4_INCLUDE_DIRS}})
include_directories(generated)
# Source files
file(GLOB GENERATED_SOURCES "generated/*.cpp")
set(SOURCES
examples/{grammar_name.lower()}_example.cpp
${{GENERATED_SOURCES}}
)
# Executable
add_executable({grammar_name.lower()}_parser ${{SOURCES}})
# Link libraries
target_link_libraries({grammar_name.lower()}_parser ${{ANTLR4_LIBRARIES}})
"""
with open(os.path.join(project_dir, "CMakeLists.txt"), 'w') as f:
f.write(cmake_content)
# Example C++ usage
example_content = f"""#include <iostream>
#include <fstream>
#include "antlr4-runtime.h"
#include "{grammar_name}Lexer.h"
#include "{grammar_name}Parser.h"
using namespace antlr4;
int main(int argc, char* argv[]) {{
// Create input stream
std::string input = "your input here";
ANTLRInputStream inputStream(input);
// Create lexer
{grammar_name}Lexer lexer(&inputStream);
CommonTokenStream tokens(&lexer);
// Create parser
{grammar_name}Parser parser(&tokens);
// Parse starting from root rule (adjust as needed)
tree::ParseTree* tree = parser.startRule(); // Replace 'startRule' with your actual start rule
// Print parse tree
std::cout << tree->toStringTree(&parser) << std::endl;
return 0;
}}
"""
with open(os.path.join(project_dir, "examples", f"{grammar_name.lower()}_example.cpp"), 'w') as f:
f.write(example_content)
# Main LLM Agent
class ANTLRGeneratorAgent:
"""Main LLM Agent for ANTLR parser generation"""
def __init__(self, config: SystemConfig):
self.config = config
self.gpu_manager = GPUManager(config.gpu_config)
self.llm_provider = self._create_llm_provider()
self.search_engine = GrammarSearchEngine(config.search_config)
self.compiler = ANTLRCompiler(config.antlr_config)
# Ensure output directory exists
os.makedirs(config.output_base_directory, exist_ok=True)
logger.info("ANTLR Generator Agent initialized")
def _create_llm_provider(self) -> LLMProvider:
"""Create appropriate LLM provider based on configuration"""
if self.config.llm_config.provider == "openai":
return OpenAIProvider(self.config.llm_config, self.gpu_manager)
else:
raise ValueError(f"Unsupported LLM provider: {self.config.llm_config.provider}")
async def generate_parser(self, user_prompt: str) -> GenerationResult:
"""Main method to generate parser from user prompt"""
start_time = datetime.now()
logger.info(f"Processing user prompt: {user_prompt[:100]}...")
try:
# Step 1: Analyze user request
request = await self.llm_provider.analyze_request(user_prompt)
logger.info(f"Request analysis: {request.intent} (confidence: {request.confidence})")
# Step 2: Determine grammar source strategy
grammar_content = await self._obtain_grammar(request)
# Step 3: Create project directory
grammar_name = request.grammar_name or self._generate_grammar_name(request)
project_dir = self.compiler.create_project_directory(
self.config.output_base_directory,
grammar_name
)
# Step 4: Save grammar file
grammar_file = self.compiler.save_grammar(grammar_content, project_dir, grammar_name)
# Step 5: Compile grammar
success, compilation_log, generated_files = await self.compiler.compile_grammar(
grammar_file,
request.target_language,
project_dir
)
# Step 6: Generate build files and examples
if success:
self.compiler.generate_build_files(project_dir, request.target_language, grammar_name)
# Step 7: Create result object
generation_time = (datetime.now() - start_time).total_seconds()
result = GenerationResult(
request=request,
grammar_file_path=grammar_file,
generated_files=generated_files,
output_directory=project_dir,
compilation_success=success,
compilation_log=compilation_log,
antlr_version=self.config.antlr_config.version,
target_language=request.target_language,
generation_time=generation_time,
summary="",
next_steps=[],
performance_notes=[]
)
# Step 8: Generate summary and documentation
result.summary = await self.llm_provider.generate_summary(result)
result.next_steps = self._generate_next_steps(result)
result.performance_notes = self._generate_performance_notes(result)
# Step 9: Save documentation
await self._save_documentation(result)
logger.info(f"Parser generation completed in {generation_time:.2f}s")
return result
except Exception as e:
logger.error(f"Parser generation failed: {e}")
raise
async def _obtain_grammar(self, request: ParsedRequest) -> str:
"""Obtain grammar content based on request type"""
if request.intent == "convert_bnf" and request.bnf_specification:
logger.info("Converting BNF specification to ANTLR")
grammar_name = request.grammar_name or "GeneratedGrammar"
return await self.llm_provider.convert_bnf_to_antlr(
request.bnf_specification,
grammar_name
)
elif request.intent == "find_grammar" and request.language_name:
logger.info(f"Searching for existing grammar: {request.language_name}")
# Try to find existing grammar
existing_grammar = await self.search_engine.search_grammar(request.language_name)
if existing_grammar and existing_grammar.quality_score > 0.7:
logger.info(f"Found high-quality grammar from {existing_grammar.source_type}")
# Enhance if special requirements exist
if request.special_requirements:
return await self.llm_provider.enhance_grammar(
existing_grammar.content,
request.special_requirements
)
return existing_grammar.content
else:
logger.info("No suitable existing grammar found, generating new one")
return await self.llm_provider.generate_grammar(request)
else:
logger.info("Generating custom grammar from description")
return await self.llm_provider.generate_grammar(request)
def _generate_grammar_name(self, request: ParsedRequest) -> str:
"""Generate appropriate grammar name"""
if request.language_name:
return request.language_name.capitalize()
# Extract name from description
words = re.findall(r'\b[a-zA-Z]+\b', request.language_description)
if words:
return ''.join(word.capitalize() for word in words[:2])
return "CustomGrammar"
def _generate_next_steps(self, result: GenerationResult) -> List[str]:
"""Generate next steps for the user"""
steps = []
if result.compilation_success:
steps.extend([
f"Navigate to the project directory: {result.output_directory}",
f"Review the generated grammar file: {os.path.basename(result.grammar_file_path)}",
"Examine the generated parser files in the 'generated' directory",
"Run the example code in the 'examples' directory",
"Customize the grammar for your specific needs",
"Add semantic actions or tree processing logic",
"Create comprehensive test cases",
"Integrate the parser into your application"
])
if result.target_language == "Java":
steps.append("Build the project using Maven: mvn compile")
elif result.target_language == "Python3":
steps.append("Install dependencies: pip install -r requirements.txt")
elif result.target_language == "Cpp":
steps.append("Build using CMake: mkdir build && cd build && cmake .. && make")
else:
steps.extend([
"Review the compilation errors in the log",
"Fix grammar syntax issues",
"Re-run the ANTLR compilation",
"Consider simplifying the grammar structure"
])
return steps
def _generate_performance_notes(self, result: GenerationResult) -> List[str]:
"""Generate performance optimization notes"""
notes = []
if result.compilation_success:
notes.extend([
"Consider left-factoring rules to reduce ambiguity",
"Use lexer modes for context-sensitive tokenization",
"Implement error recovery strategies for production use",
"Profile parser performance with large inputs",
"Consider using prediction mode SLL for better performance"
])
if result.generation_time > 30:
notes.append("Consider using a more powerful GPU for faster LLM processing")
return notes
async def _save_documentation(self, result: GenerationResult):
"""Save comprehensive documentation"""
docs_dir = os.path.join(result.output_directory, "docs")
# Save summary
with open(os.path.join(docs_dir, "README.md"), 'w') as f:
f.write(f"# {os.path.basename(result.output_directory)}\n\n")
f.write(result.summary)
f.write("\n\n## Next Steps\n\n")
for i, step in enumerate(result.next_steps, 1):
f.write(f"{i}. {step}\n")
f.write("\n\n## Performance Notes\n\n")
for note in result.performance_notes:
f.write(f"- {note}\n")
# Save generation metadata
metadata = {
"generation_time": result.generation_time,
"antlr_version": result.antlr_version,
"target_language": result.target_language,
"compilation_success": result.compilation_success,
"original_prompt": result.request.original_prompt,
"gpu_info": self.gpu_manager.get_device_info()
}
with open(os.path.join(docs_dir, "metadata.json"), 'w') as f:
json.dump(metadata, f, indent=2, default=str)
async def close(self):
"""Clean up resources"""
await self.search_engine.close()
# CLI Interface
async def main():
"""Main CLI interface for the ANTLR Generator Agent"""
import argparse
parser = argparse.ArgumentParser(description="LLM-Powered ANTLR Parser Generator")
parser.add_argument("prompt", help="User prompt for parser generation")
parser.add_argument("--config", help="Configuration file path")
parser.add_argument("--output-dir", help="Output directory", default="./generated_parsers")
parser.add_argument("--target-lang", help="Target language", default="Java")
parser.add_argument("--antlr-jar", help="Path to ANTLR JAR file", required=True)
parser.add_argument("--openai-key", help="OpenAI API key")
args = parser.parse_args()
# Create configuration
config = SystemConfig(
llm_config=LLMConfig(
provider="openai",
model_name="gpt-4",
api_key=args.openai_key or os.getenv("OPENAI_API_KEY")
),
gpu_config=GPUConfig(enable_gpu=True),
search_config=SearchConfig(enable_web_search=True),
antlr_config=ANTLRConfig(jar_path=args.antlr_jar),
output_base_directory=args.output_dir,
temp_directory=tempfile.mkdtemp()
)
# Create and run agent
agent = ANTLRGeneratorAgent(config)
try:
result = await agent.generate_parser(args.prompt)
print(f"\n{'='*60}")
print("ANTLR PARSER GENERATION COMPLETED")
print(f"{'='*60}")
print(f"Project Directory: {result.output_directory}")
print(f"Compilation: {'SUCCESS' if result.compilation_success else 'FAILED'}")
print(f"Generation Time: {result.generation_time:.2f} seconds")
print(f"Generated Files: {len(result.generated_files)}")
if result.compilation_success:
print(f"\nGenerated Files:")
for rel_path in result.generated_files.keys():
print(f" - {rel_path}")
print(f"\nNext Steps:")
for i, step in enumerate(result.next_steps[:5], 1):
print(f" {i}. {step}")
print(f"\nFull documentation available in: {result.output_directory}/docs/")
except Exception as e:
logger.error(f"Generation failed: {e}")
sys.exit(1)
finally:
await agent.close()
if __name__ == "__main__":
asyncio.run(main())
USAGE EXAMPLES
Example 1: Generate calculator parser
>> python antlr_agent.py "Create a parser for arithmetic expressions with variables, functions, and parentheses" --antlr-jar /path/to/antlr.jar --openai-key your_key
Example 2: Convert BNF to ANTLR
>> python antlr_agent.py "Convert this BNF: <expr> ::= <term> | <expr> '+' <term>" --antlr-jar /path/to/antlr.jar --target-lang Python3
Example 3: Generate JSON parser with extensions
>> python antlr_agent.py "Generate a JSON parser that also supports comments and trailing commas" --antlr-jar /path/to/antlr.jar
Example 4: Custom domain-specific language
>> python antlr_agent.py "Create a parser for a configuration language with sections, key-value pairs, and lists" --antlr-jar /path/to/antlr.jar --target-lang Cpp
This comprehensive implementation provides a general-purpose LLM Agent that can handle any user prompt and generate appropriate ANTLR v4 parsers. The system is not constrained to specific examples and can intelligently process diverse parsing requirements while leveraging GPU acceleration and web search capabilities.
No comments:
Post a Comment