Introduction: The Challenge of AI-Assisted Software Development

Software developers today face a paradox. Large Language Models can generate code at remarkable speed, yet this code often contains subtle bugs, uses deprecated libraries, ignores architectural best practices, or simply forgets requirements mentioned earlier in the conversation. A single LLM acting as a code generator behaves like a brilliant but inexperienced junior developer who codes fast but lacks the wisdom to question their own decisions.

The fundamental problem is that generative AI models are optimized to produce plausible output, not necessarily correct output. They hallucinate function signatures that don't exist, confidently recommend libraries that were deprecated years ago, and create architectures that violate basic design principles. When you ask an LLM to generate a complete application, it might produce something that compiles but fails to meet half the requirements you specified three prompts ago.

This tutorial presents a different approach inspired by Generative Adversarial Networks. Instead of relying on a single LLM to be simultaneously creative and critical, we build a system of specialized agents where creator agents generate artifacts while critique agents actively try to find flaws. The creator agent proposes an architecture; the critique agent challenges every decision. The code generator writes an implementation; the code reviewer searches for bugs, security issues, and violations of clean code principles. This adversarial relationship forces higher quality output through iterative refinement.

The developer remains in control throughout this process, but instead of manually reviewing every line of generated code, they orchestrate a team of AI agents that debate, challenge, and improve each other's work. The system maintains transparency by generating documentation, Architecture Decision Records, and detailed explanations of why certain choices were made. Let us explore how to build such a system.

Theoretical Foundation: Adversarial Agents for Code Quality

Generative Adversarial Networks revolutionized machine learning by pitting two neural networks against each other. A generator creates synthetic data while a discriminator tries to distinguish real from fake. Through this competition, both networks improve until the generator produces highly realistic output. We apply this same principle to software development.

In our system, creator agents generate software artifacts while critique agents evaluate them. The critique agents are not simple validators that check syntax or run linters. They are intelligent adversaries that actively search for logical flaws, missing requirements, poor design choices, and potential bugs. They ask questions like "Why did you choose this architecture pattern?" and "What happens if this API endpoint receives malformed data?"

Consider a traditional workflow where a developer asks an LLM to generate a REST API. The LLM produces code that looks reasonable at first glance. But it might use an outdated version of a web framework, forget to implement authentication, or create a database schema that cannot scale. A human reviewer would catch these issues, but reviewing AI-generated code is tedious and error-prone.

Now imagine an alternative workflow. The developer specifies requirements. An architecture agent proposes a high-level design. An architecture critique agent reviews this design, checking whether it satisfies non-functional requirements, follows established patterns, and anticipates future changes. The two agents iterate until the critique agent approves the architecture. Only then does a code generator agent implement the design, followed by a code review agent that examines the implementation for bugs and style violations. Finally, a test generator creates comprehensive tests while a test critique agent ensures the tests actually validate the requirements.

This multi-agent approach addresses the core weaknesses of single-LLM code generation. Specialized agents become experts in their domains. The architecture agent focuses solely on system design and can be fine-tuned on architecture documentation and design patterns. The code review agent specializes in finding bugs and can be trained on datasets of code vulnerabilities. The critique agents act as mentors, guiding the creator agents toward better solutions.

System Architecture Overview: Creator and Critique Agent Pairs

Our multi-agent system organizes agents into creator-critique pairs that work together through iterative refinement. Each pair focuses on a specific aspect of software development, from initial requirements analysis to final test validation.

The system begins with a prompt enhancement layer. When a developer writes a prompt describing what they want to build, a prompt critique agent analyzes it for ambiguity, missing details, and unclear requirements. It suggests improvements before any code generation begins. This prevents the common problem where an LLM builds the wrong thing because the initial prompt was vague.

Once requirements are clear, the architecture creator agent designs the system structure. It selects appropriate patterns, defines component boundaries, and makes technology choices. The architecture critique agent then challenges these decisions. It asks whether the proposed architecture handles the expected load, whether it allows for future extensibility, whether it introduces unnecessary complexity. The two agents debate until they reach consensus.

After architecture approval, the code generator agent implements the design. This agent focuses purely on translating architectural specifications into working code. It does not make architectural decisions because those were already settled. The code review agent examines this implementation, searching for bugs, security vulnerabilities, performance issues, and violations of coding standards. It also verifies that the implementation matches the approved architecture.

Parallel to code generation, the test creator agent develops comprehensive test suites. It generates unit tests, integration tests, and end-to-end tests based on the requirements and architecture. The test critique agent reviews these tests to ensure they actually validate the requirements rather than just achieving high code coverage with meaningless assertions.

Throughout this process, all agents generate documentation. The architecture agents produce Architecture Decision Records explaining why certain choices were made. The code agents generate inline documentation and API references. The test agents document test strategies and coverage reports. This documentation ensures that human developers understand what the system built and why.

Let us examine how to implement this infrastructure, starting with the foundation that allows us to work with different LLMs across various hardware platforms.

LLM Infrastructure: Supporting Local and Remote Models

Building a multi-agent system requires flexibility in how we access LLMs. Some agents might use powerful remote models like GPT-4 or Claude for complex reasoning tasks, while others might use smaller local models for simple review tasks. Supporting both local and remote LLMs, along with different GPU backends, provides this flexibility.

The infrastructure layer abstracts away the differences between LLM providers and hardware platforms. A unified interface allows agents to request completions without knowing whether they are talking to a remote API or a local model running on CUDA, Apple MLX, or Vulkan.

Here is a base abstraction for LLM providers that handles both local and remote models:

# llm_provider.py
from abc import ABC, abstractmethod
from typing import Dict, List, Optional
from dataclasses import dataclass

@dataclass
class LLMConfig:
    """Configuration for an LLM instance"""
    model_name: str
    temperature: float = 0.7
    max_tokens: int = 2048
    provider_type: str = "remote"  # "remote", "local_cuda", "local_mlx", "local_vulkan"
    api_key: Optional[str] = None
    model_path: Optional[str] = None
    
class LLMProvider(ABC):
    """Abstract base class for all LLM providers"""
    
    def __init__(self, config: LLMConfig):
        self.config = config
        
    @abstractmethod
    def generate(self, prompt: str, system_message: Optional[str] = None) -> str:
        """Generate a completion from the LLM"""
        pass
        
    @abstractmethod
    def generate_structured(self, prompt: str, schema: Dict) -> Dict:
        """Generate structured output matching a schema"""
        pass

This abstraction defines what every LLM provider must implement. The generate method handles basic text completion while generate_structured supports structured outputs like JSON, which is crucial for agents that need to return formatted data rather than free text.

For remote providers, we implement a wrapper around API calls. This example shows integration with OpenAI's API, but the same pattern works for Anthropic, Google, or any other provider:

# remote_provider.py
import openai
from typing import Dict, Optional
import json

class RemoteLLMProvider(LLMProvider):
    """Provider for remote LLM APIs like OpenAI, Anthropic, etc."""
    
    def __init__(self, config: LLMConfig):
        super().__init__(config)
        if config.api_key:
            openai.api_key = config.api_key
            
    def generate(self, prompt: str, system_message: Optional[str] = None) -> str:
        """Call remote API for text generation"""
        messages = []
        if system_message:
            messages.append({"role": "system", "content": system_message})
        messages.append({"role": "user", "content": prompt})
        
        response = openai.ChatCompletion.create(
            model=self.config.model_name,
            messages=messages,
            temperature=self.config.temperature,
            max_tokens=self.config.max_tokens
        )
        return response.choices[0].message.content
        
    def generate_structured(self, prompt: str, schema: Dict) -> Dict:
        """Generate JSON output matching schema"""
        system_msg = f"You must respond with valid JSON matching this schema: {json.dumps(schema)}"
        response = self.generate(prompt, system_msg)
        # Parse and validate JSON response
        try:
            return json.loads(response)
        except json.JSONDecodeError:
            # Attempt to extract JSON from markdown code blocks
            if "```json" in response:
                json_str = response.split("```json")[1].split("```")[0].strip()
                return json.loads(json_str)
            raise

The remote provider handles API authentication, message formatting, and response parsing. The generate_structured method is particularly important because many agents need to return data in specific formats. For example, an architecture critique agent might return a structured list of issues found, each with a severity level and suggested fix.

Local LLM providers require more complexity because they must initialize models on specific hardware. Here is an implementation for CUDA-based local inference:

# local_cuda_provider.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Dict, Optional
import json

class LocalCUDAProvider(LLMProvider):
    """Provider for local LLMs running on NVIDIA GPUs via CUDA"""
    
    def __init__(self, config: LLMConfig):
        super().__init__(config)
        if not torch.cuda.is_available():
            raise RuntimeError("CUDA not available but LocalCUDAProvider requested")
            
        # Load model and tokenizer from local path or HuggingFace
        self.tokenizer = AutoTokenizer.from_pretrained(config.model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            config.model_path,
            torch_dtype=torch.float16,  # Use half precision for efficiency
            device_map="auto"  # Automatically distribute across GPUs
        )
        self.device = "cuda"
        
    def generate(self, prompt: str, system_message: Optional[str] = None) -> str:
        """Generate text using local CUDA model"""
        # Format prompt with system message if provided
        full_prompt = prompt
        if system_message:
            full_prompt = f"{system_message}\n\n{prompt}"
            
        # Tokenize and generate
        inputs = self.tokenizer(full_prompt, return_tensors="pt").to(self.device)
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=self.config.max_tokens,
            temperature=self.config.temperature,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id
        )
        
        # Decode and return only the generated portion
        generated = self.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], 
                                         skip_special_tokens=True)
        return generated.strip()
        
    def generate_structured(self, prompt: str, schema: Dict) -> Dict:
        """Generate structured output - uses text generation with parsing"""
        instruction = f"Respond with valid JSON matching this schema: {json.dumps(schema)}"
        response = self.generate(f"{instruction}\n\n{prompt}")
        # Parse JSON from response
        return json.loads(response)

The CUDA provider loads models using the Transformers library and handles device placement automatically. The device_map parameter allows the system to distribute large models across multiple GPUs if available. This is crucial for running capable models locally without requiring expensive hardware.

For Apple Silicon users, we need a different backend using MLX, Apple's machine learning framework optimized for their unified memory architecture:

# local_mlx_provider.py
try:
    import mlx.core as mx
    import mlx.nn as nn
    from mlx_lm import load, generate
except ImportError:
    raise ImportError("MLX not available - install mlx and mlx-lm packages")
    
class LocalMLXProvider(LLMProvider):
    """Provider for local LLMs on Apple Silicon using MLX"""
    
    def __init__(self, config: LLMConfig):
        super().__init__(config)
        # Load model optimized for Apple Silicon
        self.model, self.tokenizer = load(config.model_path)
        
    def generate(self, prompt: str, system_message: Optional[str] = None) -> str:
        """Generate using MLX-optimized inference"""
        full_prompt = prompt
        if system_message:
            full_prompt = f"{system_message}\n\n{prompt}"
            
        # MLX generate function handles tokenization and generation
        response = generate(
            self.model,
            self.tokenizer,
            prompt=full_prompt,
            max_tokens=self.config.max_tokens,
            temp=self.config.temperature
        )
        return response
        
    def generate_structured(self, prompt: str, schema: Dict) -> Dict:
        """Generate structured JSON output"""
        instruction = f"Return only valid JSON matching: {json.dumps(schema)}"
        response = self.generate(f"{instruction}\n\n{prompt}")
        return json.loads(response)

The MLX provider uses Apple's optimized framework which takes advantage of the unified memory architecture in M1/M2/M3 chips. This allows running surprisingly large models on laptops without dedicated GPUs.

For maximum compatibility across different platforms, we can also support Vulkan-based inference which works on a wide range of GPUs from different vendors:

# local_vulkan_provider.py
from typing import Dict, Optional
import json

# Using llama.cpp with Vulkan backend as example
try:
    from llama_cpp import Llama
except ImportError:
    raise ImportError("llama-cpp-python not installed with Vulkan support")
    
class LocalVulkanProvider(LLMProvider):
    """Provider for local LLMs using Vulkan backend (cross-platform GPU)"""
    
    def __init__(self, config: LLMConfig):
        super().__init__(config)
        # Initialize llama.cpp with Vulkan backend
        self.model = Llama(
            model_path=config.model_path,
            n_gpu_layers=-1,  # Offload all layers to GPU
            n_ctx=4096,  # Context window size
            verbose=False
        )
        
    def generate(self, prompt: str, system_message: Optional[str] = None) -> str:
        """Generate using Vulkan-accelerated inference"""
        full_prompt = prompt
        if system_message:
            full_prompt = f"{system_message}\n\n{prompt}"
            
        output = self.model(
            full_prompt,
            max_tokens=self.config.max_tokens,
            temperature=self.config.temperature,
            stop=["</s>", "\n\n\n"]  # Common stop sequences
        )
        return output['choices'][0]['text'].strip()
        
    def generate_structured(self, prompt: str, schema: Dict) -> Dict:
        """Generate structured output"""
        instruction = f"Output valid JSON only: {json.dumps(schema)}"
        response = self.generate(f"{instruction}\n\n{prompt}")
        return json.loads(response)

With these provider implementations, we can create a factory that instantiates the appropriate provider based on configuration:

# llm_factory.py
from typing import Dict

class LLMFactory:
    """Factory for creating LLM providers based on configuration"""
    
    @staticmethod
    def create_provider(config: LLMConfig) -> LLMProvider:
        """Create appropriate provider based on config"""
        if config.provider_type == "remote":
            return RemoteLLMProvider(config)
        elif config.provider_type == "local_cuda":
            return LocalCUDAProvider(config)
        elif config.provider_type == "local_mlx":
            return LocalMLXProvider(config)
        elif config.provider_type == "local_vulkan":
            return LocalVulkanProvider(config)
        else:
            raise ValueError(f"Unknown provider type: {config.provider_type}")
            
    @staticmethod
    def create_from_dict(config_dict: Dict) -> LLMProvider:
        """Create provider from dictionary configuration"""
        config = LLMConfig(**config_dict)
        return LLMFactory.create_provider(config)

This factory pattern allows the rest of the system to remain agnostic about which LLM backend is being used. An agent simply requests an LLM provider and uses it without knowing whether it is talking to GPT-4 in the cloud or a local Llama model running on a GPU.

The infrastructure layer provides the foundation for our multi-agent system. Agents can be configured to use different models based on their needs. Complex reasoning tasks might use powerful remote models while simpler review tasks use fast local models. This flexibility is essential for building a practical system that balances capability with cost and latency.

Prompt Enhancement: The First Line of Defense

Before any code generation begins, we must ensure that the developer's intent is clearly captured. Ambiguous or incomplete prompts lead to wasted effort as agents build the wrong thing. The prompt enhancement layer addresses this by analyzing initial prompts and suggesting improvements.

The prompt critique agent examines a developer's request for missing information, unclear requirements, and potential ambiguities. It does not simply accept vague requests like "build a web app" but instead asks clarifying questions about authentication, data storage, expected load, and deployment environment.

Consider the interaction flow. A developer submits an initial prompt describing their desired application. The prompt critique agent analyzes this prompt and generates a structured critique identifying gaps and ambiguities. A prompt enhancement agent then suggests an improved version of the prompt that addresses these issues. The developer reviews these suggestions and either accepts them or provides additional clarification. This iterative refinement continues until the prompt clearly specifies what needs to be built.

Here is the implementation of a prompt critique agent:

# prompt_critique_agent.py
from dataclasses import dataclass
from typing import List, Optional
import json

@dataclass
class PromptIssue:
    """Represents an issue found in a prompt"""
    category: str  # "ambiguity", "missing_requirement", "unclear_scope", etc.
    description: str
    severity: str  # "critical", "important", "minor"
    suggestion: str
    
class PromptCritiqueAgent:
    """Agent that analyzes prompts for issues and suggests improvements"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are an expert at analyzing software requirements.
        Your job is to identify ambiguities, missing information, and unclear requirements
        in user prompts. Be thorough but constructive. Focus on issues that would lead to
        building the wrong software or missing critical features."""
        
    def critique_prompt(self, user_prompt: str) -> List[PromptIssue]:
        """Analyze a prompt and return list of issues found"""
        
        analysis_prompt = f"""Analyze this software development prompt for issues:
        
        PROMPT:
        {user_prompt}
        
        Identify:
        1. Ambiguous requirements that could be interpreted multiple ways
        2. Missing critical information (authentication, data storage, scalability, etc.)
        3. Unclear scope or boundaries
        4. Unstated assumptions that should be explicit
        5. Potential conflicts or contradictions
        
        For each issue, provide:
        - Category (ambiguity, missing_requirement, unclear_scope, etc.)
        - Description of the issue
        - Severity (critical, important, minor)
        - Specific suggestion for improvement
        """
        
        # Define schema for structured output
        schema = {
            "issues": [
                {
                    "category": "string",
                    "description": "string",
                    "severity": "string",
                    "suggestion": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(analysis_prompt, schema)
        
        # Convert to PromptIssue objects
        issues = []
        for issue_data in response.get("issues", []):
            issues.append(PromptIssue(**issue_data))
            
        return issues
        
    def generate_improved_prompt(self, original_prompt: str, 
                                 issues: List[PromptIssue]) -> str:
        """Generate an improved version of the prompt addressing the issues"""
        
        issues_text = "\n".join([
            f"- {issue.description} (Suggestion: {issue.suggestion})"
            for issue in issues
        ])
        
        improvement_prompt = f"""Given this original prompt and identified issues,
        generate an improved version that addresses all concerns:
        
        ORIGINAL PROMPT:
        {original_prompt}
        
        ISSUES FOUND:
        {issues_text}
        
        Generate an improved prompt that:
        - Addresses all identified issues
        - Maintains the original intent
        - Adds necessary details and clarifications
        - Removes ambiguities
        - Makes implicit assumptions explicit
        """
        
        improved = self.llm.generate(improvement_prompt, self.system_message)
        return improved

The critique agent uses structured output to return a list of specific issues rather than free-form text. This makes it easy for other parts of the system to process and display the results. Each issue includes a category, description, severity level, and concrete suggestion for improvement.

The generate_improved_prompt method takes the original prompt and the list of issues, then asks the LLM to produce a better version. This improved prompt serves as a starting point for the developer to refine further.

Let us see how this works in practice with an example interaction:

# Example usage of prompt critique
def demonstrate_prompt_enhancement():
    """Show how prompt enhancement works"""
    
    # Create LLM provider (using remote for this example)
    config = LLMConfig(
        model_name="gpt-4",
        provider_type="remote",
        api_key="your-api-key"
    )
    llm = LLMFactory.create_provider(config)
    
    # Create critique agent
    critique_agent = PromptCritiqueAgent(llm)
    
    # Developer's initial vague prompt
    initial_prompt = """Build a web application for managing tasks.
    Users should be able to create, edit, and delete tasks.
    It should have a nice UI."""
    
    print("ORIGINAL PROMPT:")
    print(initial_prompt)
    print("\n" + "="*60 + "\n")
    
    # Critique the prompt
    issues = critique_agent.critique_prompt(initial_prompt)
    
    print("ISSUES FOUND:")
    for issue in issues:
        print(f"\n[{issue.severity.upper()}] {issue.category}")
        print(f"  Problem: {issue.description}")
        print(f"  Suggestion: {issue.suggestion}")
        
    print("\n" + "="*60 + "\n")
    
    # Generate improved prompt
    improved = critique_agent.generate_improved_prompt(initial_prompt, issues)
    
    print("IMPROVED PROMPT:")
    print(improved)

When run with a vague initial prompt like "Build a web application for managing tasks", the critique agent might identify issues such as missing authentication requirements, no specification of data persistence, unclear scalability needs, undefined user roles, and ambiguous UI requirements. The improved prompt would address these by explicitly stating whether the app needs user authentication, what database should be used, expected number of users, whether there are different user roles with different permissions, and specific UI framework preferences.

This prompt enhancement step prevents the common problem where an LLM builds a complete application only for the developer to realize it is missing critical features. By investing time upfront to clarify requirements, we save significant effort later in the development process.

The critique agent acts as a mentor to the developer, asking the questions an experienced software architect would ask when gathering requirements. It does not make decisions but rather ensures that all necessary information is available before development begins.

Architecture Creation and Critique: Designing Before Building

With clear requirements established, the next step is architectural design. The architecture creator agent proposes a high-level system design while the architecture critique agent challenges and validates these decisions. This adversarial relationship ensures that architectural choices are well-reasoned and documented.

The architecture creator agent takes the refined requirements and designs a system structure. It selects appropriate architectural patterns, defines component boundaries, chooses technology stacks, and plans data flows. The agent produces not just a description but structured architectural artifacts including component diagrams, data models, and Architecture Decision Records.

The architecture critique agent then examines this design from multiple perspectives. It checks whether the architecture satisfies functional and non-functional requirements, whether it follows established best practices, whether it introduces unnecessary complexity, and whether it will scale to meet expected loads. The critique agent also verifies that the architecture allows for future evolution and maintenance.

Here is an implementation of the architecture creator agent:

# architecture_creator_agent.py
from dataclasses import dataclass
from typing import List, Dict, Optional
import json

@dataclass
class Component:
    """Represents a system component"""
    name: str
    responsibility: str
    dependencies: List[str]
    technology: str
    
@dataclass
class ArchitectureDecisionRecord:
    """Documents an architectural decision"""
    title: str
    context: str
    decision: str
    rationale: str
    consequences: List[str]
    alternatives_considered: List[str]
    
@dataclass
class Architecture:
    """Complete architecture specification"""
    overview: str
    components: List[Component]
    data_model: Dict
    api_design: Dict
    technology_stack: Dict[str, str]
    adrs: List[ArchitectureDecisionRecord]
    
class ArchitectureCreatorAgent:
    """Agent that designs system architecture from requirements"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are an expert software architect with deep knowledge
        of design patterns, scalability, and best practices. Design systems that are
        maintainable, testable, and aligned with requirements. Always document your
        decisions with clear rationale."""
        
    def create_architecture(self, requirements: str) -> Architecture:
        """Design architecture based on requirements"""
        
        design_prompt = f"""Design a complete software architecture for these requirements:
        
        {requirements}
        
        Provide:
        1. High-level overview of the system
        2. List of components with responsibilities and dependencies
        3. Data model showing entities and relationships
        4. API design with key endpoints
        5. Technology stack choices with justification
        6. Architecture Decision Records for major choices
        
        Focus on:
        - Separation of concerns
        - Testability
        - Scalability
        - Maintainability
        - Security
        """
        
        # Define schema for structured architecture output
        schema = {
            "overview": "string",
            "components": [
                {
                    "name": "string",
                    "responsibility": "string",
                    "dependencies": ["string"],
                    "technology": "string"
                }
            ],
            "data_model": {
                "entities": [
                    {
                        "name": "string",
                        "attributes": ["string"],
                        "relationships": ["string"]
                    }
                ]
            },
            "api_design": {
                "endpoints": [
                    {
                        "path": "string",
                        "method": "string",
                        "purpose": "string"
                    }
                ]
            },
            "technology_stack": {
                "backend": "string",
                "frontend": "string",
                "database": "string",
                "deployment": "string"
            },
            "adrs": [
                {
                    "title": "string",
                    "context": "string",
                    "decision": "string",
                    "rationale": "string",
                    "consequences": ["string"],
                    "alternatives_considered": ["string"]
                }
            ]
        }
        
        response = self.llm.generate_structured(design_prompt, schema)
        
        # Convert to Architecture object
        components = [Component(**c) for c in response["components"]]
        adrs = [ArchitectureDecisionRecord(**a) for a in response["adrs"]]
        
        architecture = Architecture(
            overview=response["overview"],
            components=components,
            data_model=response["data_model"],
            api_design=response["api_design"],
            technology_stack=response["technology_stack"],
            adrs=adrs
        )
        
        return architecture

The creator agent generates a comprehensive architecture specification including all the artifacts a human architect would produce. The structured output ensures that nothing is forgotten and makes it easy for other agents to process the architecture.

The architecture critique agent examines this design and identifies potential issues:

# architecture_critique_agent.py
from dataclasses import dataclass
from typing import List

@dataclass
class ArchitectureIssue:
    """Issue found in architecture design"""
    category: str  # "scalability", "security", "complexity", "testability", etc.
    component: Optional[str]  # Which component is affected
    description: str
    severity: str
    recommendation: str
    
class ArchitectureCritiqueAgent:
    """Agent that reviews architecture for issues"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are a senior architect conducting design review.
        Your job is to find potential issues in proposed architectures. Be thorough
        and critical but constructive. Consider scalability, security, maintainability,
        testability, and alignment with requirements."""
        
    def critique_architecture(self, architecture: Architecture, 
                              requirements: str) -> List[ArchitectureIssue]:
        """Review architecture and identify issues"""
        
        # Convert architecture to readable format for LLM
        arch_description = self._format_architecture(architecture)
        
        critique_prompt = f"""Review this architecture against requirements:
        
        REQUIREMENTS:
        {requirements}
        
        PROPOSED ARCHITECTURE:
        {arch_description}
        
        Identify issues in these areas:
        1. Does it satisfy all functional requirements?
        2. Does it meet non-functional requirements (performance, security, etc.)?
        3. Are components properly separated with clear responsibilities?
        4. Is the system testable?
        5. Will it scale to expected loads?
        6. Are there security vulnerabilities?
        7. Is there unnecessary complexity?
        8. Are technology choices appropriate and current?
        9. Are there missing components or functionality?
        10. Are the ADRs well-reasoned?
        
        For each issue found, provide category, affected component, description,
        severity, and specific recommendation for improvement.
        """
        
        schema = {
            "issues": [
                {
                    "category": "string",
                    "component": "string or null",
                    "description": "string",
                    "severity": "string",
                    "recommendation": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(critique_prompt, schema)
        
        issues = [ArchitectureIssue(**i) for i in response["issues"]]
        return issues
        
    def _format_architecture(self, arch: Architecture) -> str:
        """Format architecture for LLM consumption"""
        formatted = f"OVERVIEW:\n{arch.overview}\n\n"
        
        formatted += "COMPONENTS:\n"
        for comp in arch.components:
            formatted += f"  - {comp.name}: {comp.responsibility}\n"
            formatted += f"    Technology: {comp.technology}\n"
            formatted += f"    Dependencies: {', '.join(comp.dependencies)}\n\n"
            
        formatted += f"TECHNOLOGY STACK:\n"
        for key, value in arch.technology_stack.items():
            formatted += f"  {key}: {value}\n"
            
        formatted += f"\nADRs:\n"
        for adr in arch.adrs:
            formatted += f"  - {adr.title}\n"
            formatted += f"    Decision: {adr.decision}\n"
            formatted += f"    Rationale: {adr.rationale}\n\n"
            
        return formatted

The critique agent performs a comprehensive review checking multiple dimensions of the architecture. It does not simply validate that the design is syntactically correct but actively searches for logical flaws, missing functionality, and potential problems.

The two agents work together iteratively until the critique agent approves the architecture:

# architecture_orchestrator.py
class ArchitectureOrchestrator:
    """Orchestrates architecture creation and critique cycle"""
    
    def __init__(self, creator_llm: LLMProvider, critique_llm: LLMProvider):
        self.creator = ArchitectureCreatorAgent(creator_llm)
        self.critic = ArchitectureCritiqueAgent(critique_llm)
        
    def design_architecture(self, requirements: str, 
                           max_iterations: int = 5) -> Architecture:
        """Iteratively design and refine architecture"""
        
        current_architecture = None
        
        for iteration in range(max_iterations):
            print(f"\n=== Architecture Iteration {iteration + 1} ===\n")
            
            # Create or refine architecture
            if current_architecture is None:
                current_architecture = self.creator.create_architecture(requirements)
            else:
                # Refine based on previous critique
                current_architecture = self._refine_architecture(
                    current_architecture, issues
                )
                
            # Critique the architecture
            issues = self.critic.critique_architecture(
                current_architecture, requirements
            )
            
            # Check if critique is satisfied
            critical_issues = [i for i in issues if i.severity == "critical"]
            
            if not critical_issues:
                print("Architecture approved by critique agent!")
                return current_architecture
                
            print(f"Found {len(critical_issues)} critical issues:")
            for issue in critical_issues:
                print(f"  - {issue.description}")
                
        print("Max iterations reached. Returning best architecture.")
        return current_architecture
        
    def _refine_architecture(self, architecture: Architecture, 
                            issues: List[ArchitectureIssue]) -> Architecture:
        """Refine architecture based on critique"""
        
        issues_text = "\n".join([
            f"- [{issue.severity}] {issue.description}\n  Fix: {issue.recommendation}"
            for issue in issues
        ])
        
        refinement_prompt = f"""Refine this architecture to address the issues:
        
        CURRENT ARCHITECTURE:
        {self.critic._format_architecture(architecture)}
        
        ISSUES TO ADDRESS:
        {issues_text}
        
        Provide an improved architecture that fixes all critical and important issues
        while maintaining the overall design intent.
        """
        
        # Use the same schema as initial creation
        # Implementation would be similar to create_architecture
        # but incorporating fixes for identified issues
        
        return self.creator.create_architecture(refinement_prompt)

This orchestrator manages the iterative refinement process. It creates an initial architecture, gets critique, refines based on feedback, and repeats until the critique agent is satisfied or a maximum number of iterations is reached. This adversarial process produces higher quality architectures than a single-pass generation.

The architecture phase is critical because it establishes the foundation for all subsequent code generation. By investing effort in getting the architecture right through this creator-critique cycle, we prevent costly rework later when implementation reveals fundamental design flaws.

Code Generation and Review: Implementing the Design

With an approved architecture in place, the code generator agent can implement the design. Unlike a general-purpose coding LLM that must simultaneously design and implement, our code generator focuses solely on translating architectural specifications into working code. This specialization leads to better results.

The code generator agent receives the architecture specification and implements each component according to the design. It follows the technology choices, respects component boundaries, and adheres to the API contracts defined in the architecture. The agent generates clean, well-documented code that matches the architectural intent.

The code review agent then examines this implementation with a critical eye. It searches for bugs, security vulnerabilities, performance issues, violations of coding standards, and deviations from the architecture. The reviewer acts like an experienced senior developer conducting a thorough code review.

Here is the code generator agent implementation:

# code_generator_agent.py
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class CodeArtifact:
    """Represents generated code for a component"""
    component_name: str
    file_path: str
    code: str
    language: str
    dependencies: List[str]
    
class CodeGeneratorAgent:
    """Agent that generates code from architecture specifications"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are an expert software developer who writes clean,
        well-documented, production-quality code. Follow SOLID principles, write
        comprehensive docstrings, use meaningful variable names, and include error
        handling. Generate code that exactly matches the architecture specification."""
        
    def generate_component(self, component: Component, 
                          architecture: Architecture) -> List[CodeArtifact]:
        """Generate code for a specific component"""
        
        # Build context about the component's role in the system
        context = self._build_component_context(component, architecture)
        
        generation_prompt = f"""Generate production-quality code for this component:
        
        COMPONENT: {component.name}
        RESPONSIBILITY: {component.responsibility}
        TECHNOLOGY: {component.technology}
        DEPENDENCIES: {', '.join(component.dependencies)}
        
        SYSTEM CONTEXT:
        {context}
        
        Requirements:
        1. Follow clean code principles
        2. Include comprehensive docstrings and comments
        3. Implement proper error handling
        4. Use type hints (if language supports)
        5. Follow the architecture's API contracts
        6. Include input validation
        7. Consider security best practices
        
        Generate all necessary files for this component including:
        - Main implementation
        - Interface definitions
        - Configuration
        - Any supporting utilities
        """
        
        schema = {
            "files": [
                {
                    "file_path": "string",
                    "code": "string",
                    "language": "string",
                    "purpose": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(generation_prompt, schema)
        
        artifacts = []
        for file_data in response["files"]:
            artifact = CodeArtifact(
                component_name=component.name,
                file_path=file_data["file_path"],
                code=file_data["code"],
                language=file_data["language"],
                dependencies=component.dependencies
            )
            artifacts.append(artifact)
            
        return artifacts
        
    def _build_component_context(self, component: Component, 
                                 architecture: Architecture) -> str:
        """Build context about how component fits in system"""
        
        context = f"Overall System: {architecture.overview}\n\n"
        
        # Add information about dependent components
        context += "Dependent Components:\n"
        for dep_name in component.dependencies:
            dep_component = next(
                (c for c in architecture.components if c.name == dep_name),
                None
            )
            if dep_component:
                context += f"  - {dep_name}: {dep_component.responsibility}\n"
                
        # Add relevant API contracts
        if component.name in architecture.api_design:
            context += f"\nAPI Contracts:\n{architecture.api_design[component.name]}\n"
            
        return context

The generator creates code that respects the architectural boundaries and follows best practices. It generates not just the main implementation but also supporting files like interfaces, configuration, and utilities.

The code review agent examines this generated code for issues:

# code_review_agent.py
from dataclasses import dataclass
from typing import List

@dataclass
class CodeIssue:
    """Issue found during code review"""
    file_path: str
    line_number: Optional[int]
    category: str  # "bug", "security", "performance", "style", "architecture_violation"
    description: str
    severity: str
    suggested_fix: str
    
class CodeReviewAgent:
    """Agent that reviews generated code for issues"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are a senior software engineer conducting code review.
        Find bugs, security issues, performance problems, style violations, and
        deviations from architecture. Be thorough and specific. Provide actionable
        suggestions for fixes."""
        
    def review_code(self, artifacts: List[CodeArtifact], 
                   component: Component,
                   architecture: Architecture) -> List[CodeIssue]:
        """Review code artifacts and identify issues"""
        
        issues = []
        
        for artifact in artifacts:
            artifact_issues = self._review_artifact(artifact, component, architecture)
            issues.extend(artifact_issues)
            
        return issues
        
    def _review_artifact(self, artifact: CodeArtifact,
                        component: Component,
                        architecture: Architecture) -> List[CodeIssue]:
        """Review a single code artifact"""
        
        review_prompt = f"""Review this code for issues:
        
        FILE: {artifact.file_path}
        COMPONENT: {artifact.component_name}
        EXPECTED RESPONSIBILITY: {component.responsibility}
        
        CODE:
        {artifact.code}
        
        Check for:
        1. Bugs and logical errors
        2. Security vulnerabilities (SQL injection, XSS, authentication issues, etc.)
        3. Performance problems (N+1 queries, inefficient algorithms, etc.)
        4. Violations of clean code principles
        5. Missing error handling
        6. Deviations from the architecture specification
        7. Missing input validation
        8. Inadequate documentation
        9. Use of deprecated or outdated libraries
        10. Thread safety issues if applicable
        
        For each issue, provide:
        - Specific line number if applicable
        - Category (bug, security, performance, style, architecture_violation)
        - Clear description
        - Severity (critical, important, minor)
        - Concrete suggestion for fixing
        """
        
        schema = {
            "issues": [
                {
                    "line_number": "number or null",
                    "category": "string",
                    "description": "string",
                    "severity": "string",
                    "suggested_fix": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(review_prompt, schema)
        
        issues = []
        for issue_data in response["issues"]:
            issue = CodeIssue(
                file_path=artifact.file_path,
                **issue_data
            )
            issues.append(issue)
            
        return issues

The review agent performs a comprehensive analysis checking for multiple categories of issues. It provides specific line numbers when possible and concrete suggestions for fixes, making it easy for the generator to address the problems.

The orchestrator manages the generation-review cycle:

# code_orchestrator.py
class CodeOrchestrator:
    """Orchestrates code generation and review cycle"""
    
    def __init__(self, generator_llm: LLMProvider, reviewer_llm: LLMProvider):
        self.generator = CodeGeneratorAgent(generator_llm)
        self.reviewer = CodeReviewAgent(reviewer_llm)
        
    def implement_component(self, component: Component,
                           architecture: Architecture,
                           max_iterations: int = 3) -> List[CodeArtifact]:
        """Generate and refine code for a component"""
        
        artifacts = None
        
        for iteration in range(max_iterations):
            print(f"\n=== Code Generation Iteration {iteration + 1} ===")
            print(f"Component: {component.name}\n")
            
            # Generate or refine code
            if artifacts is None:
                artifacts = self.generator.generate_component(component, architecture)
            else:
                artifacts = self._refine_code(artifacts, issues, component, architecture)
                
            # Review the code
            issues = self.reviewer.review_code(artifacts, component, architecture)
            
            # Check for critical issues
            critical_issues = [i for i in issues if i.severity == "critical"]
            
            if not critical_issues:
                print("Code approved by review agent!")
                return artifacts
                
            print(f"Found {len(critical_issues)} critical issues:")
            for issue in critical_issues[:5]:  # Show first 5
                print(f"  - {issue.file_path}: {issue.description}")
                
        print("Max iterations reached. Returning current code.")
        return artifacts
        
    def _refine_code(self, artifacts: List[CodeArtifact],
                    issues: List[CodeIssue],
                    component: Component,
                    architecture: Architecture) -> List[CodeArtifact]:
        """Refine code based on review feedback"""
        
        # Group issues by file
        issues_by_file = {}
        for issue in issues:
            if issue.file_path not in issues_by_file:
                issues_by_file[issue.file_path] = []
            issues_by_file[issue.file_path].append(issue)
            
        refined_artifacts = []
        
        for artifact in artifacts:
            file_issues = issues_by_file.get(artifact.file_path, [])
            
            if not file_issues:
                # No issues in this file, keep as is
                refined_artifacts.append(artifact)
                continue
                
            # Generate fixes for this file
            issues_text = "\n".join([
                f"Line {i.line_number or 'N/A'}: [{i.severity}] {i.description}\n"
                f"  Fix: {i.suggested_fix}"
                for i in file_issues
            ])
            
            fix_prompt = f"""Fix the issues in this code:
            
            ORIGINAL CODE:
            {artifact.code}
            
            ISSUES TO FIX:
            {issues_text}
            
            Provide the corrected code that addresses all issues while maintaining
            the original functionality and architecture compliance.
            """
            
            fixed_code = self.generator.llm.generate(fix_prompt)
            
            refined_artifact = CodeArtifact(
                component_name=artifact.component_name,
                file_path=artifact.file_path,
                code=fixed_code,
                language=artifact.language,
                dependencies=artifact.dependencies
            )
            refined_artifacts.append(refined_artifact)
            
        return refined_artifacts

This orchestrator implements each component through an iterative refinement process. The generator creates code, the reviewer finds issues, the generator fixes them, and the cycle continues until the code passes review or reaches the maximum iteration limit.

The adversarial relationship between generator and reviewer produces higher quality code than single-pass generation. The reviewer catches bugs, security issues, and architectural violations that the generator might have introduced. This mimics the human code review process but operates automatically and consistently.

Test Generation and Validation: Ensuring Correctness

Code without tests is incomplete. The test generation layer creates comprehensive test suites while the test critique agent ensures these tests actually validate the requirements rather than just achieving meaningless code coverage.

The test creator agent generates unit tests, integration tests, and end-to-end tests based on the requirements and implementation. It creates tests that verify both happy paths and error conditions, edge cases and normal cases, functional requirements and non-functional characteristics.

The test critique agent reviews these tests to ensure they are meaningful. It checks whether the tests actually validate requirements, whether they cover important edge cases, whether they are maintainable, and whether they would catch real bugs. A test that simply calls a function and asserts the result is not None might achieve code coverage but provides no real value.

Here is the test creator agent:

# test_creator_agent.py
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class TestArtifact:
    """Represents a test file"""
    file_path: str
    code: str
    test_type: str  # "unit", "integration", "e2e"
    component_tested: str
    coverage_target: List[str]  # Functions/methods covered
    
class TestCreatorAgent:
    """Agent that generates comprehensive test suites"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are an expert in test-driven development and
        quality assurance. Write comprehensive, maintainable tests that validate
        requirements and catch real bugs. Include tests for edge cases, error
        conditions, and integration points."""
        
    def generate_tests(self, component: Component,
                      code_artifacts: List[CodeArtifact],
                      requirements: str) -> List[TestArtifact]:
        """Generate test suite for a component"""
        
        # Extract key functionality to test from code
        functionality = self._extract_functionality(code_artifacts)
        
        test_prompt = f"""Generate comprehensive tests for this component:
        
        COMPONENT: {component.name}
        RESPONSIBILITY: {component.responsibility}
        
        REQUIREMENTS TO VALIDATE:
        {requirements}
        
        IMPLEMENTATION:
        {self._format_code_artifacts(code_artifacts)}
        
        KEY FUNCTIONALITY:
        {functionality}
        
        Generate tests that:
        1. Validate all functional requirements
        2. Test happy paths and error conditions
        3. Cover edge cases and boundary conditions
        4. Verify integration with dependencies
        5. Check error handling and validation
        6. Test performance characteristics if relevant
        7. Are maintainable and well-documented
        
        Include:
        - Unit tests for individual functions/methods
        - Integration tests for component interactions
        - End-to-end tests for complete workflows
        
        Use appropriate testing framework for the technology stack.
        """
        
        schema = {
            "test_files": [
                {
                    "file_path": "string",
                    "code": "string",
                    "test_type": "string",
                    "coverage_target": ["string"],
                    "purpose": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(test_prompt, schema)
        
        artifacts = []
        for test_data in response["test_files"]:
            artifact = TestArtifact(
                file_path=test_data["file_path"],
                code=test_data["code"],
                test_type=test_data["test_type"],
                component_tested=component.name,
                coverage_target=test_data["coverage_target"]
            )
            artifacts.append(artifact)
            
        return artifacts
        
    def _extract_functionality(self, artifacts: List[CodeArtifact]) -> str:
        """Extract key functionality from code for test generation"""
        
        # Simple extraction - in practice would use AST parsing
        functions = []
        for artifact in artifacts:
            # Extract function/method signatures
            lines = artifact.code.split('\n')
            for line in lines:
                if 'def ' in line or 'function ' in line or 'public ' in line:
                    functions.append(line.strip())
                    
        return '\n'.join(functions)
        
    def _format_code_artifacts(self, artifacts: List[CodeArtifact]) -> str:
        """Format code artifacts for LLM consumption"""
        formatted = ""
        for artifact in artifacts:
            formatted += f"\nFile: {artifact.file_path}\n"
            formatted += f"{artifact.code}\n"
            formatted += "-" * 60 + "\n"
        return formatted

The test creator generates multiple types of tests covering different aspects of the component. It creates unit tests for individual functions, integration tests for component interactions, and end-to-end tests for complete workflows.

The test critique agent reviews these tests:

# test_critique_agent.py
from dataclasses import dataclass
from typing import List

@dataclass
class TestIssue:
    """Issue found in test suite"""
    test_file: str
    category: str  # "coverage_gap", "meaningless_test", "missing_edge_case", etc.
    description: str
    severity: str
    recommendation: str
    
class TestCritiqueAgent:
    """Agent that reviews test suites for quality and completeness"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        self.system_message = """You are a QA expert reviewing test suites. Your job
        is to ensure tests actually validate requirements and would catch real bugs.
        Identify missing test cases, meaningless tests, and gaps in coverage."""
        
    def critique_tests(self, test_artifacts: List[TestArtifact],
                      requirements: str,
                      code_artifacts: List[CodeArtifact]) -> List[TestIssue]:
        """Review test suite for issues"""
        
        critique_prompt = f"""Review this test suite for quality and completeness:
        
        REQUIREMENTS:
        {requirements}
        
        IMPLEMENTATION:
        {self._format_code(code_artifacts)}
        
        TEST SUITE:
        {self._format_tests(test_artifacts)}
        
        Identify:
        1. Requirements not validated by any test
        2. Important edge cases not covered
        3. Error conditions not tested
        4. Meaningless tests that don't validate anything useful
        5. Tests that are too tightly coupled to implementation
        6. Missing integration tests
        7. Inadequate test documentation
        8. Tests that would not catch real bugs
        9. Performance tests if needed but missing
        10. Security tests if needed but missing
        
        For each issue, provide category, description, severity, and specific
        recommendation for improvement.
        """
        
        schema = {
            "issues": [
                {
                    "test_file": "string",
                    "category": "string",
                    "description": "string",
                    "severity": "string",
                    "recommendation": "string"
                }
            ]
        }
        
        response = self.llm.generate_structured(critique_prompt, schema)
        
        issues = [TestIssue(**i) for i in response["issues"]]
        return issues
        
    def _format_tests(self, artifacts: List[TestArtifact]) -> str:
        """Format test artifacts for review"""
        formatted = ""
        for artifact in artifacts:
            formatted += f"\n{artifact.test_type.upper()} TEST: {artifact.file_path}\n"
            formatted += f"Covers: {', '.join(artifact.coverage_target)}\n"
            formatted += f"{artifact.code}\n"
            formatted += "=" * 60 + "\n"
        return formatted
        
    def _format_code(self, artifacts: List[CodeArtifact]) -> str:
        """Format code for review context"""
        formatted = ""
        for artifact in artifacts[:3]:  # Limit to first 3 files for context
            formatted += f"File: {artifact.file_path}\n{artifact.code}\n\n"
        return formatted

The test critique agent ensures that tests are meaningful and comprehensive. It identifies gaps in coverage, meaningless assertions, and missing edge cases. This prevents the common problem where developers achieve high code coverage with low-quality tests that provide false confidence.

An orchestrator manages the test generation and refinement cycle similar to the code orchestrator. The test creator generates tests, the test critique identifies issues, the creator addresses them, and the cycle continues until the test suite is approved.

The test generation layer ensures that the final codebase includes comprehensive, meaningful tests that actually validate requirements. This is crucial for maintaining code quality as the system evolves.

Agent Orchestration and Workflow: Coordinating the System

With all the individual agents implemented, we need an orchestration layer that coordinates their interactions and manages the overall workflow. This orchestrator ensures that agents execute in the correct order, that artifacts flow between agents properly, and that the iterative refinement process converges to a solution.

The master orchestrator manages the complete development workflow from requirements to tested code:

# master_orchestrator.py
from dataclasses import dataclass
from typing import Dict, List
import json

@dataclass
class DevelopmentArtifacts:
    """Complete set of artifacts produced by the system"""
    requirements: str
    enhanced_prompt: str
    architecture: Architecture
    code_artifacts: Dict[str, List[CodeArtifact]]  # component_name -> artifacts
    test_artifacts: Dict[str, List[TestArtifact]]  # component_name -> tests
    documentation: Dict[str, str]  # artifact_type -> content
    
class MasterOrchestrator:
    """Coordinates all agents to produce complete software system"""
    
    def __init__(self, llm_configs: Dict[str, LLMConfig]):
        """Initialize with LLM configurations for different agents"""
        
        # Create LLM providers for different agent types
        # Could use different models for different tasks
        self.prompt_llm = LLMFactory.create_provider(llm_configs["prompt"])
        self.arch_llm = LLMFactory.create_provider(llm_configs["architecture"])
        self.code_llm = LLMFactory.create_provider(llm_configs["code"])
        self.test_llm = LLMFactory.create_provider(llm_configs["test"])
        
        # Initialize agent orchestrators
        self.prompt_critique = PromptCritiqueAgent(self.prompt_llm)
        self.arch_orchestrator = ArchitectureOrchestrator(
            self.arch_llm, self.arch_llm
        )
        self.code_orchestrator = CodeOrchestrator(
            self.code_llm, self.code_llm
        )
        self.test_orchestrator = TestOrchestrator(
            self.test_llm, self.test_llm
        )
        
    def develop_system(self, initial_prompt: str) -> DevelopmentArtifacts:
        """Execute complete development workflow"""
        
        print("="*70)
        print("MULTI-AGENT SOFTWARE DEVELOPMENT SYSTEM")
        print("="*70)
        
        # Phase 1: Enhance prompt
        print("\n[PHASE 1] Enhancing requirements prompt...")
        enhanced_prompt = self._enhance_prompt(initial_prompt)
        
        # Phase 2: Design architecture
        print("\n[PHASE 2] Designing system architecture...")
        architecture = self.arch_orchestrator.design_architecture(enhanced_prompt)
        
        # Phase 3: Generate code for each component
        print("\n[PHASE 3] Generating code for components...")
        code_artifacts = {}
        for component in architecture.components:
            print(f"\n  Implementing component: {component.name}")
            artifacts = self.code_orchestrator.implement_component(
                component, architecture
            )
            code_artifacts[component.name] = artifacts
            
        # Phase 4: Generate tests
        print("\n[PHASE 4] Generating test suites...")
        test_artifacts = {}
        for component in architecture.components:
            print(f"\n  Creating tests for: {component.name}")
            tests = self.test_orchestrator.generate_tests(
                component,
                code_artifacts[component.name],
                enhanced_prompt
            )
            test_artifacts[component.name] = tests
            
        # Phase 5: Generate documentation
        print("\n[PHASE 5] Generating documentation...")
        documentation = self._generate_documentation(
            enhanced_prompt, architecture, code_artifacts, test_artifacts
        )
        
        print("\n[COMPLETE] System development finished!")
        
        return DevelopmentArtifacts(
            requirements=initial_prompt,
            enhanced_prompt=enhanced_prompt,
            architecture=architecture,
            code_artifacts=code_artifacts,
            test_artifacts=test_artifacts,
            documentation=documentation
        )
        
    def _enhance_prompt(self, initial_prompt: str) -> str:
        """Enhance initial prompt through critique cycle"""
        
        issues = self.prompt_critique.critique_prompt(initial_prompt)
        
        if not issues:
            return initial_prompt
            
        print(f"  Found {len(issues)} issues in initial prompt")
        
        enhanced = self.prompt_critique.generate_improved_prompt(
            initial_prompt, issues
        )
        
        return enhanced
        
    def _generate_documentation(self, requirements: str,
                               architecture: Architecture,
                               code_artifacts: Dict[str, List[CodeArtifact]],
                               test_artifacts: Dict[str, List[TestArtifact]]) -> Dict[str, str]:
        """Generate comprehensive documentation"""
        
        documentation = {}
        
        # Generate README
        readme_prompt = f"""Generate a comprehensive README.md for this system:
        
        Requirements: {requirements}
        Architecture: {architecture.overview}
        Components: {', '.join([c.name for c in architecture.components])}
        
        Include:
        - Project overview
        - Architecture summary
        - Setup instructions
        - Usage examples
        - Testing instructions
        - Technology stack
        """
        
        documentation["README"] = self.arch_llm.generate(readme_prompt)
        
        # Generate API documentation
        api_doc_prompt = f"""Generate API documentation:
        
        API Design: {json.dumps(architecture.api_design, indent=2)}
        
        Include for each endpoint:
        - Purpose
        - Request format
        - Response format
        - Error codes
        - Examples
        """
        
        documentation["API"] = self.arch_llm.generate(api_doc_prompt)
        
        # Format Architecture Decision Records
        adrs = []
        for adr in architecture.adrs:
            adr_text = f"""
            ADR: {adr.title}
            
            Context:
            {adr.context}
            
            Decision:
            {adr.decision}
            
            Rationale:
            {adr.rationale}
            
            Consequences:
            {chr(10).join(['- ' + c for c in adr.consequences])}
            
            Alternatives Considered:
            {chr(10).join(['- ' + a for a in adr.alternatives_considered])}
            """
            adrs.append(adr_text)
            
        documentation["ADRs"] = "\n\n".join(adrs)
        
        return documentation

The master orchestrator coordinates the entire workflow from initial prompt to complete system with tests and documentation. It manages the flow of artifacts between agents and ensures each phase completes successfully before moving to the next.

This orchestrator can be configured to use different LLM providers for different tasks. For example, you might use a powerful remote model like GPT-4 for architecture design but a smaller local model for code review to reduce costs and latency.

The orchestration layer provides the glue that binds individual agents into a cohesive system. It manages complexity, handles errors, and provides visibility into what the system is doing at each step.

Artifact Generation and Documentation: Making Work Visible

Throughout the development process, agents generate various artifacts that document decisions, explain rationale, and provide context for future maintenance. These artifacts are crucial for making the AI's work transparent and understandable to human developers.

Architecture Decision Records are particularly important. They capture why certain architectural choices were made, what alternatives were considered, and what consequences are expected. This documentation prevents future developers from wondering "why did they build it this way?" and helps them understand the constraints and trade-offs that influenced the design.

Here is an example of generating comprehensive ADRs:

# adr_generator.py
from datetime import datetime

class ADRGenerator:
    """Generates Architecture Decision Records"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        
    def generate_adr(self, decision_context: str, 
                    architecture: Architecture) -> ArchitectureDecisionRecord:
        """Generate a detailed ADR for a specific decision"""
        
        adr_prompt = f"""Generate a comprehensive Architecture Decision Record:
        
        DECISION CONTEXT:
        {decision_context}
        
        CURRENT ARCHITECTURE:
        {architecture.overview}
        
        Create an ADR that includes:
        1. Clear title summarizing the decision
        2. Context explaining why this decision was needed
        3. The specific decision made
        4. Detailed rationale explaining why this choice is best
        5. Expected consequences (positive and negative)
        6. Alternative approaches that were considered and why they were rejected
        
        Be specific and technical. Future developers should understand exactly
        why this decision was made and what trade-offs were accepted.
        """
        
        schema = {
            "title": "string",
            "context": "string",
            "decision": "string",
            "rationale": "string",
            "consequences": ["string"],
            "alternatives_considered": ["string"]
        }
        
        response = self.llm.generate_structured(adr_prompt, schema)
        
        adr = ArchitectureDecisionRecord(**response)
        return adr
        
    def format_adr_document(self, adr: ArchitectureDecisionRecord, 
                           adr_number: int) -> str:
        """Format ADR as a readable document"""
        
        doc = f"""
# ADR {adr_number:03d}: {adr.title}

Date: {datetime.now().strftime('%Y-%m-%d')}
Status: Accepted

## Context

{adr.context}

## Decision

{adr.decision}

## Rationale

{adr.rationale}

## Consequences

"""
        
        for consequence in adr.consequences:
            doc += f"* {consequence}\n"
            
        doc += "\n## Alternatives Considered\n\n"
        
        for alternative in adr.alternatives_considered:
            doc += f"* {alternative}\n"
            
        return doc

ADRs provide a historical record of architectural decisions that helps future developers understand the evolution of the system. They are particularly valuable when someone needs to revisit a decision or understand why certain constraints exist.

The system also generates comprehensive code documentation:

# documentation_generator.py
class DocumentationGenerator:
    """Generates various types of documentation"""
    
    def __init__(self, llm_provider: LLMProvider):
        self.llm = llm_provider
        
    def generate_component_documentation(self, component: Component,
                                        code_artifacts: List[CodeArtifact]) -> str:
        """Generate detailed documentation for a component"""
        
        doc_prompt = f"""Generate comprehensive documentation for this component:
        
        COMPONENT: {component.name}
        RESPONSIBILITY: {component.responsibility}
        TECHNOLOGY: {component.technology}
        
        CODE:
        {self._format_artifacts(code_artifacts)}
        
        Generate documentation that includes:
        1. Component overview and purpose
        2. Key classes/modules and their responsibilities
        3. Public API with parameters and return values
        4. Usage examples
        5. Dependencies and how to integrate
        6. Configuration options
        7. Error handling approach
        8. Performance considerations
        
        Format as markdown suitable for a developer documentation site.
        """
        
        documentation = self.llm.generate(doc_prompt)
        return documentation
        
    def generate_testing_guide(self, test_artifacts: List[TestArtifact]) -> str:
        """Generate guide for running and understanding tests"""
        
        guide_prompt = f"""Generate a testing guide based on these test suites:
        
        TEST SUITES:
        {self._format_test_artifacts(test_artifacts)}
        
        Include:
        1. How to run all tests
        2. How to run specific test suites
        3. What each test suite validates
        4. How to interpret test results
        5. How to add new tests
        6. Testing best practices for this codebase
        7. Coverage expectations
        """
        
        guide = self.llm.generate(guide_prompt)
        return guide
        
    def _format_artifacts(self, artifacts: List[CodeArtifact]) -> str:
        """Format code artifacts for documentation generation"""
        formatted = ""
        for artifact in artifacts:
            formatted += f"\nFile: {artifact.file_path}\n"
            formatted += f"{artifact.code}\n"
            formatted += "-" * 60 + "\n"
        return formatted
        
    def _format_test_artifacts(self, artifacts: List[TestArtifact]) -> str:
        """Format test artifacts for documentation"""
        formatted = ""
        for artifact in artifacts:
            formatted += f"\n{artifact.test_type} Test: {artifact.file_path}\n"
            formatted += f"Tests: {', '.join(artifact.coverage_target)}\n"
            formatted += "-" * 60 + "\n"
        return formatted

The documentation generator creates human-readable guides that explain how the system works, how to use it, and how to extend it. This documentation is essential for making the AI-generated code maintainable by human developers.

All these artifacts together provide transparency into what the AI system built and why. A developer can review the ADRs to understand architectural decisions, read the component documentation to understand how pieces fit together, and consult the testing guide to validate changes. This transparency is crucial for building trust in AI-generated code.

IDE Integration: Bringing Agents to Developers

The multi-agent system is most useful when integrated into developers' existing workflows. IDE integration allows developers to invoke agents directly from their code editor, review suggestions in context, and accept or reject changes with familiar tools.

A Language Server Protocol implementation provides IDE integration across multiple editors:

# lsp_server.py
from typing import Dict, List, Optional
import json

class MultiAgentLSP:
    """Language Server Protocol implementation for multi-agent system"""
    
    def __init__(self, orchestrator: MasterOrchestrator):
        self.orchestrator = orchestrator
        self.workspace_state = {}
        
    def handle_code_action(self, document_uri: str, 
                          range_selection: Dict) -> List[Dict]:
        """Provide code actions (refactorings, fixes) for selected code"""
        
        # Get current code from document
        current_code = self._get_document_content(document_uri)
        selected_code = self._extract_range(current_code, range_selection)
        
        # Use code review agent to suggest improvements
        review_agent = CodeReviewAgent(self.orchestrator.code_llm)
        
        # Create temporary artifact for review
        artifact = CodeArtifact(
            component_name="current",
            file_path=document_uri,
            code=selected_code,
            language=self._detect_language(document_uri),
            dependencies=[]
        )
        
        # Get suggestions
        issues = review_agent._review_artifact(artifact, None, None)
        
        # Convert to code actions
        actions = []
        for issue in issues:
            action = {
                "title": f"Fix: {issue.description[:50]}...",
                "kind": "quickfix",
                "edit": {
                    "changes": {
                        document_uri: [{
                            "range": range_selection,
                            "newText": issue.suggested_fix
                        }]
                    }
                }
            }
            actions.append(action)
            
        return actions
        
    def handle_completion(self, document_uri: str, 
                        position: Dict) -> List[Dict]:
        """Provide intelligent code completions"""
        
        current_code = self._get_document_content(document_uri)
        context = self._get_context(current_code, position)
        
        # Use code generator to suggest completions
        generator = CodeGeneratorAgent(self.orchestrator.code_llm)
        
        completion_prompt = f"""Given this code context, suggest the next logical code:
        
        CONTEXT:
        {context}
        
        Provide a completion that:
        1. Follows the existing code style
        2. Implements the apparent intent
        3. Includes proper error handling
        4. Is well-documented
        """
        
        suggestion = generator.llm.generate(completion_prompt)
        
        return [{
            "label": "AI Suggestion",
            "kind": "text",
            "insertText": suggestion,
            "documentation": "Generated by multi-agent system"
        }]
        
    def handle_hover(self, document_uri: str, position: Dict) -> Optional[Dict]:
        """Provide hover information for symbols"""
        
        current_code = self._get_document_content(document_uri)
        symbol = self._get_symbol_at_position(current_code, position)
        
        if not symbol:
            return None
            
        # Generate documentation for symbol
        doc_generator = DocumentationGenerator(self.orchestrator.arch_llm)
        
        doc_prompt = f"""Explain this code symbol in context:
        
        SYMBOL: {symbol}
        
        CONTEXT:
        {current_code}
        
        Provide a concise explanation of what this symbol does, its parameters,
        return value, and any important notes.
        """
        
        explanation = doc_generator.llm.generate(doc_prompt)
        
        return {
            "contents": {
                "kind": "markdown",
                "value": explanation
            }
        }
        
    def _get_document_content(self, uri: str) -> str:
        """Get current content of document"""
        # In real implementation, would maintain document state
        return self.workspace_state.get(uri, "")
        
    def _extract_range(self, content: str, range_dict: Dict) -> str:
        """Extract text from range in document"""
        lines = content.split('\n')
        start_line = range_dict['start']['line']
        end_line = range_dict['end']['line']
        return '\n'.join(lines[start_line:end_line + 1])
        
    def _detect_language(self, uri: str) -> str:
        """Detect programming language from file extension"""
        if uri.endswith('.py'):
            return 'python'
        elif uri.endswith('.js'):
            return 'javascript'
        elif uri.endswith('.java'):
            return 'java'
        return 'unknown'
        
    def _get_context(self, code: str, position: Dict) -> str:
        """Get relevant context around cursor position"""
        lines = code.split('\n')
        line_num = position['line']
        
        # Get surrounding lines for context
        start = max(0, line_num - 10)
        end = min(len(lines), line_num + 10)
        
        return '\n'.join(lines[start:end])
        
    def _get_symbol_at_position(self, code: str, position: Dict) -> Optional[str]:
        """Extract symbol at cursor position"""
        lines = code.split('\n')
        line = lines[position['line']]
        char = position['character']
        
        # Simple symbol extraction - real implementation would use AST
        # Find word boundaries around cursor
        start = char
        while start > 0 and line[start-1].isalnum() or line[start-1] == '_':
            start -= 1
            
        end = char
        while end < len(line) and (line[end].isalnum() or line[end] == '_'):
            end += 1
            
        if start < end:
            return line[start:end]
            
        return None

This LSP implementation provides IDE features like code actions, intelligent completions, and hover documentation powered by the multi-agent system. Developers can right-click on code to get AI-powered refactoring suggestions, receive context-aware completions, and see explanations of unfamiliar code.

For developers who prefer command-line tools, a CLI interface provides similar functionality:

# cli_interface.py
import argparse
import sys

class MultiAgentCLI:
    """Command-line interface for multi-agent system"""
    
    def __init__(self, orchestrator: MasterOrchestrator):
        self.orchestrator = orchestrator
        
    def run(self):
        """Run CLI interface"""
        parser = argparse.ArgumentParser(
            description="Multi-Agent Software Development System"
        )
        
        subparsers = parser.add_subparsers(dest='command')
        
        # Generate command
        generate_parser = subparsers.add_parser('generate',
            help='Generate code from requirements')
        generate_parser.add_argument('requirements_file',
            help='File containing requirements')
        generate_parser.add_argument('--output-dir', default='./output',
            help='Output directory for generated code')
            
        # Review command
        review_parser = subparsers.add_parser('review',
            help='Review existing code')
        review_parser.add_argument('code_dir',
            help='Directory containing code to review')
            
        # Test command
        test_parser = subparsers.add_parser('test',
            help='Generate tests for code')
        test_parser.add_argument('code_dir',
            help='Directory containing code')
        test_parser.add_argument('requirements_file',
            help='Requirements file')
            
        args = parser.parse_args()
        
        if args.command == 'generate':
            self._handle_generate(args)
        elif args.command == 'review':
            self._handle_review(args)
        elif args.command == 'test':
            self._handle_test(args)
        else:
            parser.print_help()
            
    def _handle_generate(self, args):
        """Handle generate command"""
        print("Reading requirements...")
        with open(args.requirements_file, 'r') as f:
            requirements = f.read()
            
        print("Generating system...")
        artifacts = self.orchestrator.develop_system(requirements)
        
        print(f"Writing output to {args.output_dir}...")
        self._write_artifacts(artifacts, args.output_dir)
        
        print("Generation complete!")
        
    def _handle_review(self, args):
        """Handle review command"""
        print(f"Reviewing code in {args.code_dir}...")
        
        # Load code files
        code_artifacts = self._load_code_artifacts(args.code_dir)
        
        # Review each file
        reviewer = CodeReviewAgent(self.orchestrator.code_llm)
        
        all_issues = []
        for artifact in code_artifacts:
            issues = reviewer._review_artifact(artifact, None, None)
            all_issues.extend(issues)
            
        # Print results
        print(f"\nFound {len(all_issues)} issues:")
        for issue in all_issues:
            print(f"\n[{issue.severity.upper()}] {issue.file_path}")
            print(f"  {issue.description}")
            print(f"  Fix: {issue.suggested_fix}")
            
    def _handle_test(self, args):
        """Handle test generation command"""
        print("Generating tests...")
        
        # Load code and requirements
        code_artifacts = self._load_code_artifacts(args.code_dir)
        with open(args.requirements_file, 'r') as f:
            requirements = f.read()
            
        # Generate tests
        test_creator = TestCreatorAgent(self.orchestrator.test_llm)
        
        # Create dummy component for test generation
        component = Component(
            name="main",
            responsibility="Main component",
            dependencies=[],
            technology="python"
        )
        
        test_artifacts = test_creator.generate_tests(
            component, code_artifacts, requirements
        )
        
        # Write test files
        print(f"Writing {len(test_artifacts)} test files...")
        for artifact in test_artifacts:
            with open(artifact.file_path, 'w') as f:
                f.write(artifact.code)
                
        print("Test generation complete!")
        
    def _write_artifacts(self, artifacts: DevelopmentArtifacts, output_dir: str):
        """Write all artifacts to output directory"""
        import os
        os.makedirs(output_dir, exist_ok=True)
        
        # Write code files
        for component_name, code_list in artifacts.code_artifacts.items():
            component_dir = os.path.join(output_dir, component_name)
            os.makedirs(component_dir, exist_ok=True)
            
            for artifact in code_list:
                file_path = os.path.join(component_dir, 
                                        os.path.basename(artifact.file_path))
                with open(file_path, 'w') as f:
                    f.write(artifact.code)
                    
        # Write test files
        test_dir = os.path.join(output_dir, 'tests')
        os.makedirs(test_dir, exist_ok=True)
        
        for component_name, test_list in artifacts.test_artifacts.items():
            for artifact in test_list:
                file_path = os.path.join(test_dir,
                                        os.path.basename(artifact.file_path))
                with open(file_path, 'w') as f:
                    f.write(artifact.code)
                    
        # Write documentation
        for doc_type, content in artifacts.documentation.items():
            file_path = os.path.join(output_dir, f"{doc_type}.md")
            with open(file_path, 'w') as f:
                f.write(content)
                
    def _load_code_artifacts(self, code_dir: str) -> List[CodeArtifact]:
        """Load code files from directory"""
        import os
        artifacts = []
        
        for root, dirs, files in os.walk(code_dir):
            for file in files:
                if file.endswith(('.py', '.js', '.java', '.cpp')):
                    file_path = os.path.join(root, file)
                    with open(file_path, 'r') as f:
                        code = f.read()
                        
                    artifact = CodeArtifact(
                        component_name=os.path.basename(root),
                        file_path=file_path,
                        code=code,
                        language=self._detect_language(file),
                        dependencies=[]
                    )
                    artifacts.append(artifact)
                    
        return artifacts
        
    def _detect_language(self, filename: str) -> str:
        """Detect language from filename"""
        if filename.endswith('.py'):
            return 'python'
        elif filename.endswith('.js'):
            return 'javascript'
        elif filename.endswith('.java'):
            return 'java'
        elif filename.endswith('.cpp'):
            return 'cpp'
        return 'unknown'

The CLI provides commands for generating complete systems from requirements, reviewing existing code, and generating tests. This allows developers to integrate the multi-agent system into their build pipelines and automation workflows.

Both the IDE integration and CLI interface make the multi-agent system accessible to developers in their preferred working environment. They can invoke powerful AI assistance without leaving their familiar tools.

Learning and Improvement: Evolution Through Feedback

The multi-agent system can improve over time through reinforcement learning and feedback incorporation. By tracking which suggestions developers accept or reject, the system learns to make better recommendations.

A feedback collection system captures developer interactions:

# feedback_system.py
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime
import json

@dataclass
class Feedback:
    """Represents developer feedback on agent output"""
    timestamp: datetime
    agent_type: str  # "architecture", "code", "test", etc.
    artifact_id: str
    accepted: bool
    modifications: Optional[str]  # What developer changed
    rating: Optional[int]  # 1-5 rating
    comments: Optional[str]
    
class FeedbackCollector:
    """Collects and stores developer feedback"""
    
    def __init__(self, storage_path: str):
        self.storage_path = storage_path
        self.feedback_buffer = []
        
    def record_feedback(self, agent_type: str, artifact_id: str,
                      accepted: bool, modifications: Optional[str] = None,
                      rating: Optional[int] = None,
                      comments: Optional[str] = None):
        """Record feedback on agent output"""
        
        feedback = Feedback(
            timestamp=datetime.now(),
            agent_type=agent_type,
            artifact_id=artifact_id,
            accepted=accepted,
            modifications=modifications,
            rating=rating,
            comments=comments
        )
        
        self.feedback_buffer.append(feedback)
        
        # Persist periodically
        if len(self.feedback_buffer) >= 10:
            self._persist_feedback()
            
    def _persist_feedback(self):
        """Save feedback to storage"""
        with open(self.storage_path, 'a') as f:
            for feedback in self.feedback_buffer:
                f.write(json.dumps({
                    'timestamp': feedback.timestamp.isoformat(),
                    'agent_type': feedback.agent_type,
                    'artifact_id': feedback.artifact_id,
                    'accepted': feedback.accepted,
                    'modifications': feedback.modifications,
                    'rating': feedback.rating,
                    'comments': feedback.comments
                }) + '\n')
                
        self.feedback_buffer.clear()
        
    def get_feedback_for_agent(self, agent_type: str) -> List[Feedback]:
        """Retrieve all feedback for a specific agent type"""
        feedback_list = []
        
        with open(self.storage_path, 'r') as f:
            for line in f:
                data = json.loads(line)
                if data['agent_type'] == agent_type:
                    feedback = Feedback(
                        timestamp=datetime.fromisoformat(data['timestamp']),
                        agent_type=data['agent_type'],
                        artifact_id=data['artifact_id'],
                        accepted=data['accepted'],
                        modifications=data.get('modifications'),
                        rating=data.get('rating'),
                        comments=data.get('comments')
                    )
                    feedback_list.append(feedback)
                    
        return feedback_list

The feedback collector tracks whether developers accept or modify agent suggestions. This data becomes training material for improving the agents.

A learning system uses this feedback to fine-tune agent behavior:

# learning_system.py
from typing import List, Dict

class AgentLearningSystem:
    """System for improving agents through feedback"""
    
    def __init__(self, feedback_collector: FeedbackCollector):
        self.feedback_collector = feedback_collector
        
    def analyze_agent_performance(self, agent_type: str) -> Dict:
        """Analyze how well an agent is performing"""
        
        feedback = self.feedback_collector.get_feedback_for_agent(agent_type)
        
        if not feedback:
            return {"status": "insufficient_data"}
            
        total = len(feedback)
        accepted = sum(1 for f in feedback if f.accepted)
        acceptance_rate = accepted / total
        
        # Analyze ratings
        ratings = [f.rating for f in feedback if f.rating is not None]
        avg_rating = sum(ratings) / len(ratings) if ratings else None
        
        # Identify common issues from comments
        comments = [f.comments for f in feedback if f.comments]
        
        return {
            "agent_type": agent_type,
            "total_outputs": total,
            "acceptance_rate": acceptance_rate,
            "average_rating": avg_rating,
            "common_issues": self._extract_common_issues(comments)
        }
        
    def generate_training_examples(self, agent_type: str) -> List[Dict]:
        """Generate training examples from feedback"""
        
        feedback = self.feedback_collector.get_feedback_for_agent(agent_type)
        
        training_examples = []
        
        for f in feedback:
            if f.modifications:
                # Developer modified output - learn from the correction
                example = {
                    "input": f.artifact_id,  # Original prompt/context
                    "incorrect_output": "original_output",  # Would need to store
                    "correct_output": f.modifications,
                    "feedback": f.comments
                }
                training_examples.append(example)
                
        return training_examples
        
    def _extract_common_issues(self, comments: List[str]) -> List[str]:
        """Extract common themes from feedback comments"""
        
        # In practice, would use NLP to cluster similar comments
        # For now, simple keyword extraction
        
        issue_keywords = {}
        for comment in comments:
            words = comment.lower().split()
            for word in words:
                if len(word) > 4:  # Filter short words
                    issue_keywords[word] = issue_keywords.get(word, 0) + 1
                    
        # Return most common issues
        sorted_issues = sorted(issue_keywords.items(), 
                             key=lambda x: x[1], 
                             reverse=True)
        
        return [issue for issue, count in sorted_issues[:5]]

The learning system analyzes feedback to identify patterns in what developers accept or reject. This analysis guides improvements to agent prompts and behavior.

For agents using local models, we can implement actual fine-tuning:

# model_finetuning.py
import torch
from transformers import Trainer, TrainingArguments

class AgentFineTuner:
    """Fine-tunes local models based on feedback"""
    
    def __init__(self, base_model_path: str):
        self.base_model_path = base_model_path
        
    def fine_tune_from_feedback(self, training_examples: List[Dict],
                               output_path: str):
        """Fine-tune model using feedback examples"""
        
        # Load base model
        from transformers import AutoModelForCausalLM, AutoTokenizer
        
        model = AutoModelForCausalLM.from_pretrained(self.base_model_path)
        tokenizer = AutoTokenizer.from_pretrained(self.base_model_path)
        
        # Prepare training data
        train_dataset = self._prepare_dataset(training_examples, tokenizer)
        
        # Configure training
        training_args = TrainingArguments(
            output_dir=output_path,
            num_train_epochs=3,
            per_device_train_batch_size=4,
            save_steps=100,
            save_total_limit=2,
            learning_rate=2e-5,
            warmup_steps=100,
            logging_steps=10
        )
        
        # Train
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset
        )
        
        trainer.train()
        
        # Save fine-tuned model
        model.save_pretrained(output_path)
        tokenizer.save_pretrained(output_path)
        
        print(f"Fine-tuned model saved to {output_path}")
        
    def _prepare_dataset(self, examples: List[Dict], tokenizer):
        """Prepare dataset for training"""
        
        # Format examples as input-output pairs
        formatted_examples = []
        
        for ex in examples:
            # Create training example showing correct output
            text = f"Input: {ex['input']}\nOutput: {ex['correct_output']}"
            formatted_examples.append(text)
            
        # Tokenize
        encodings = tokenizer(formatted_examples, truncation=True, 
                            padding=True, max_length=512)
                            
        # Create dataset
        import torch
        class SimpleDataset(torch.utils.data.Dataset):
            def __init__(self, encodings):
                self.encodings = encodings
                
            def __getitem__(self, idx):
                return {key: torch.tensor(val[idx]) 
                       for key, val in self.encodings.items()}
                       
            def __len__(self):
                return len(self.encodings['input_ids'])
                
        return SimpleDataset(encodings)

The fine-tuning system allows local models to improve based on actual usage patterns. As developers accept or reject suggestions, the models learn to produce output more aligned with developer preferences.

This learning capability transforms the multi-agent system from a static tool into an evolving assistant that gets better with use. Over time, the agents learn the coding style, architectural preferences, and quality standards of the team using them.

Practical Example: Building a Task Management API

Let us walk through a complete example of using the multi-agent system to build a task management API. This demonstrates how all the components work together in practice.

A developer starts with a simple requirement:

# example_usage.py

def demonstrate_complete_workflow():
    """Complete example of building a system with multi-agent approach"""
    
    # Configure LLMs for different agents
    # Using mix of remote and local models
    llm_configs = {
        "prompt": LLMConfig(
            model_name="gpt-4",
            provider_type="remote",
            api_key="your-api-key"
        ),
        "architecture": LLMConfig(
            model_name="gpt-4",
            provider_type="remote",
            api_key="your-api-key"
        ),
        "code": LLMConfig(
            model_name="codellama-13b",
            provider_type="local_cuda",
            model_path="/models/codellama-13b"
        ),
        "test": LLMConfig(
            model_name="codellama-13b",
            provider_type="local_cuda",
            model_path="/models/codellama-13b"
        )
    }
    
    # Create orchestrator
    orchestrator = MasterOrchestrator(llm_configs)
    
    # Initial requirements (intentionally vague)
    initial_requirements = """
    Build a REST API for managing tasks. Users should be able to:
    - Create tasks with title and description
    - Mark tasks as complete
    - Delete tasks
    - List all tasks
    
    The API should be fast and reliable.
    """
    
    print("Starting development with initial requirements:")
    print(initial_requirements)
    print("\n" + "="*70 + "\n")
    
    # Run complete development workflow
    artifacts = orchestrator.develop_system(initial_requirements)
    
    # Display results
    print("\n" + "="*70)
    print("DEVELOPMENT COMPLETE")
    print("="*70 + "\n")
    
    print("Enhanced Requirements:")
    print(artifacts.enhanced_prompt)
    print("\n" + "-"*70 + "\n")
    
    print("Architecture Overview:")
    print(artifacts.architecture.overview)
    print("\n" + "-"*70 + "\n")
    
    print("Components:")
    for component in artifacts.architecture.components:
        print(f"\n  {component.name}")
        print(f"    Responsibility: {component.responsibility}")
        print(f"    Technology: {component.technology}")
        
    print("\n" + "-"*70 + "\n")
    
    print("Architecture Decision Records:")
    for i, adr in enumerate(artifacts.architecture.adrs, 1):
        print(f"\n  ADR {i}: {adr.title}")
        print(f"    Decision: {adr.decision}")
        
    print("\n" + "-"*70 + "\n")
    
    print("Generated Code Files:")
    for component_name, code_list in artifacts.code_artifacts.items():
        print(f"\n  Component: {component_name}")
        for artifact in code_list:
            print(f"    - {artifact.file_path}")
            
    print("\n" + "-"*70 + "\n")
    
    print("Generated Test Files:")
    for component_name, test_list in artifacts.test_artifacts.items():
        print(f"\n  Component: {component_name}")
        for artifact in test_list:
            print(f"    - {artifact.file_path} ({artifact.test_type})")
            
    print("\n" + "-"*70 + "\n")
    
    print("Documentation Generated:")
    for doc_type in artifacts.documentation.keys():
        print(f"  - {doc_type}")
        
    # Save artifacts to disk
    import os
    output_dir = "./task_api_output"
    os.makedirs(output_dir, exist_ok=True)
    
    # Save code
    for component_name, code_list in artifacts.code_artifacts.items():
        comp_dir = os.path.join(output_dir, component_name)
        os.makedirs(comp_dir, exist_ok=True)
        
        for artifact in code_list:
            file_path = os.path.join(comp_dir, 
                                    os.path.basename(artifact.file_path))
            with open(file_path, 'w') as f:
                f.write(artifact.code)
                
    # Save tests
    test_dir = os.path.join(output_dir, "tests")
    os.makedirs(test_dir, exist_ok=True)
    
    for component_name, test_list in artifacts.test_artifacts.items():
        for artifact in test_list:
            file_path = os.path.join(test_dir,
                                    os.path.basename(artifact.file_path))
            with open(file_path, 'w') as f:
                f.write(artifact.code)
                
    # Save documentation
    for doc_type, content in artifacts.documentation.items():
        file_path = os.path.join(output_dir, f"{doc_type}.md")
        with open(file_path, 'w') as f:
            f.write(content)
            
    print(f"\nAll artifacts saved to: {output_dir}")
    
    return artifacts

When this example runs, the system goes through several phases. First, the prompt critique agent identifies that the initial requirements are vague. It notes missing information about authentication, data persistence, expected load, error handling, and API versioning. The enhanced prompt addresses these gaps by specifying that the API should support JWT authentication, use PostgreSQL for data storage, handle up to one thousand concurrent users, include comprehensive error handling with appropriate HTTP status codes, and follow REST API versioning best practices.

The architecture creator then designs a system with clear component separation. It proposes an API Gateway component for request routing and authentication, a Task Service component for business logic, a Data Access Layer for database operations, and a separate Authentication Service. The architecture uses Python with FastAPI for the web framework, PostgreSQL for the database, and Redis for caching. Each component has clearly defined responsibilities and interfaces.

The architecture critique agent reviews this design and validates that it satisfies the requirements. It confirms that the architecture supports the expected load through horizontal scaling, that authentication is properly separated, and that the data access layer prevents SQL injection. The critique agent approves the architecture after verifying all concerns are addressed.

The code generator then implements each component. For the Task Service, it generates clean Python code with proper type hints, comprehensive docstrings, and error handling. The implementation follows the architecture specification exactly, using the defined interfaces and respecting component boundaries.

The code review agent examines this implementation and identifies a few issues. It notes that one endpoint is missing input validation, that error messages could expose internal details, and that a database query could be optimized. The code generator addresses these issues in the next iteration.

Finally, the test creator generates comprehensive test suites. It creates unit tests for individual functions, integration tests for API endpoints, and end-to-end tests for complete workflows. The test critique agent reviews these tests and confirms they validate the requirements rather than just achieving code coverage.

Throughout this process, the system generates documentation including a README with setup instructions, API documentation with example requests and responses, and Architecture Decision Records explaining why FastAPI was chosen over Flask, why PostgreSQL was selected for data storage, and why JWT was used for authentication.

The entire workflow produces a complete, tested, documented system from a vague initial prompt. The adversarial relationship between creator and critique agents ensures quality at every step.

Conclusion: The Future of AI-Assisted Development

The multi-agent approach to software development represents a significant evolution beyond single-LLM code generation. By organizing specialized agents into creator-critique pairs, we build systems that produce higher quality code through adversarial refinement. The architecture mirrors how human development teams work, with different specialists focusing on their areas of expertise while reviewing each other's work.

This approach addresses the fundamental weaknesses of current AI coding assistants. Specialized agents become experts in their domains rather than generalists trying to do everything. The critique agents catch bugs, architectural flaws, and forgotten requirements that creator agents might miss. The iterative refinement process produces better results than single-pass generation.

The system maintains transparency through comprehensive documentation and artifact generation. Developers can understand what the AI built and why through Architecture Decision Records, component documentation, and detailed test suites. This transparency is essential for building trust in AI-generated code and enabling human developers to maintain and extend the systems.

Integration with existing developer tools through IDE plugins and CLI interfaces makes the multi-agent system practical for real-world use. Developers can invoke powerful AI assistance without changing their workflows or learning new tools.

The learning capability allows the system to improve over time based on developer feedback. As teams use the system, it learns their coding standards, architectural preferences, and quality expectations. This continuous improvement transforms the multi-agent system from a static tool into an evolving assistant.

The future of software development likely involves increasingly sophisticated collaboration between human developers and AI agents. The multi-agent architecture presented here provides a foundation for this collaboration, ensuring that AI assistance enhances rather than replaces human expertise. Developers remain in control, making high-level decisions and reviewing AI suggestions, while the agent system handles the tedious work of implementation, testing, and documentation.

As LLMs continue to improve and as we develop better techniques for agent coordination and learning, multi-agent development systems will become increasingly capable. They will handle larger projects, make better architectural decisions, and produce higher quality code. But the fundamental principle will remain: multiple specialized agents working adversarially produce better results than any single agent working alone.

The code examples and architecture presented in this tutorial provide a starting point for building such systems. Developers can extend and customize the agents for their specific needs, integrate with their preferred tools and workflows, and adapt the system to their team's coding standards. The multi-agent approach is not a rigid framework but a flexible pattern that can be applied in many different ways.

By combining the creativity of generative AI with the critical thinking of adversarial review, we create software development systems that are greater than the sum of their parts. This is the promise of multi-agent AI for software engineering.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Sunday, February 01, 2026

BUILDING BETTER SOFTWARE WITH ADVERSARIAL MULTI-AGENT LLM SYSTEMS