The source code is available here: Reportr GitHub Repository
ABSTRACT
This booklet presents Reportr, a sophisticated multi-agent artificial intelligence system designed to automate the process of conducting comprehensive research across multiple information sources. The system employs a coordinated ensemble of five specialized agents, each responsible for distinct aspects of the research workflow. This booklet provides an in-depth analysis of the system's architecture, explores the critical architectural decisions that shaped its design, and offers practical guidance for users seeking to leverage this powerful research automation tool. The system represents a significant advancement in the field of autonomous research assistants, combining modern large language model capabilities with traditional information retrieval techniques to deliver high-quality, well-structured research reports.
TABLE OF CONTENTS
- Introduction
- System Overview
- Architectural Design 3.1 High-Level Architecture 3.2 Agent-Based Architecture Pattern 3.3 Component Interaction Model
- The Five Agents: Detailed Analysis 4.1 Orchestrator Agent 4.2 Search Agent 4.3 Document Agent 4.4 Vector Store Agent 4.5 Report Agent
- Critical Architectural Decisions 5.1 Decision 1: Agent-Based Architecture 5.2 Decision 2: Separation of Concerns 5.3 Decision 3: Hybrid Search Strategy 5.4 Decision 4: Vector Database Integration 5.5 Decision 5: Asynchronous Processing Model 5.6 Decision 6: Configuration-Driven Design 5.7 Decision 7: LLM Provider Abstraction
- Data Flow and Processing Pipeline
- Technology Stack and Dependencies
- Scalability and Performance Considerations
- Error Handling and Resilience
- Future Enhancements
- Conclusion
- Addendum: Complete User Manual
- Addendum: LOCAL LLM SUPPORT WITH LLAMA.CPP AND OLLAMA EXTENDING REPORTR FOR OFFLINE AND COST-FREE OPERATION
1. INTRODUCTION
In the contemporary landscape of information technology, researchers, analysts, and knowledge workers face an unprecedented challenge. The volume of available information has grown exponentially, with academic papers, technical articles, blog posts, and web content being published at rates that far exceed human capacity to process and synthesize. Traditional manual research methods, while thorough, are increasingly inadequate for keeping pace with the rapid evolution of knowledge in fields such as artificial intelligence, machine learning, quantum computing, and biotechnology.
Reportr emerges as a response to this challenge. It is an autonomous, multi- agent artificial intelligence system specifically engineered to automate the entire research workflow, from initial query formulation through information gathering, content analysis, and final report generation. The system is not merely a search aggregator or a simple summarization tool. Rather, it represents a sophisticated orchestration of multiple specialized agents, each employing advanced techniques in their respective domains, working in concert to produce comprehensive, well-structured research reports that rival those produced by human researchers.
The name "Reportr" reflects the system's core mission: to generate reports. The deliberate omission of the final 'e' in "Reporter" serves as a nod to modern software naming conventions while also suggesting the system's focus on the artifact (the report) rather than the actor (the reporter). This subtle linguistic choice underscores a fundamental aspect of the system's design philosophy: the emphasis on output quality and utility rather than on mimicking human behavior.
The development of Reportr was motivated by several key observations. First, researchers spend a disproportionate amount of time on mechanical tasks such as searching multiple databases, downloading papers, extracting relevant passages, and formatting citations. Second, the fragmentation of information across multiple sources (academic databases, preprint servers, web content, technical blogs) creates significant friction in the research process. Third, the advent of large language models has created new possibilities for automated synthesis and summarization that were previously impossible. Fourth, vector databases and semantic search technologies have matured to the point where they can provide genuinely useful similarity-based retrieval that complements traditional keyword search.
This booklet provides a comprehensive examination of Reportr's architecture, design decisions, and practical usage. It is intended for multiple audiences: software architects seeking to understand the design patterns employed in modern AI systems, developers interested in implementing similar multi-agent architectures, researchers who wish to use the system effectively, and students of artificial intelligence who want to understand how various AI technologies can be composed into coherent, useful applications.
2. SYSTEM OVERVIEW
Reportr is a Python-based application that accepts a research topic or query as input and produces a comprehensive research report as output. The system operates by coordinating five specialized agents, each responsible for a specific aspect of the research process. The entire workflow, from query submission to report delivery, is fully automated, requiring no human intervention beyond the initial query specification.
The system's operation can be understood through a simple example. A user submits the query "recent advances in transformer architectures for natural language processing." Reportr then initiates a multi-stage process. First, it searches multiple information sources including arXiv (the preprint server for scientific papers), Google Scholar (the academic search engine), general web search engines, and Medium (a popular technical blogging platform). This multi- source approach ensures comprehensive coverage across academic, technical, and practitioner-oriented content.
Second, the system retrieves the full text of the most relevant documents. For academic papers, this typically involves downloading PDF files and extracting their textual content. For web pages and blog posts, it involves fetching the HTML content and extracting the main textual content while filtering out navigation elements, advertisements, and other non-essential material. This full-text retrieval is critical because abstracts and snippets, while useful for initial relevance assessment, often lack the detail necessary for comprehensive analysis.
Third, the system processes the retrieved documents using advanced natural language processing techniques. Specifically, it generates vector embeddings (high-dimensional numerical representations that capture semantic meaning) for the documents and stores them in a vector database. This enables semantic search capabilities that go beyond simple keyword matching, allowing the system to find conceptually related content even when different terminology is used.
Fourth, the system employs a large language model to synthesize the gathered information into a coherent narrative. This is not simple concatenation or excerpt extraction. Rather, the language model reads the source materials, identifies key themes and findings, recognizes relationships between different sources, and generates original prose that accurately represents the current state of knowledge on the topic.
Fifth, the system formats the results into a well-structured report. The report includes an executive summary, identification of key trends, detailed findings with proper citations and links to source materials, and metadata about the research process itself (such as the number of sources consulted and their distribution across different types of publications).
Finally, the system saves the report in multiple formats. A Markdown file provides a human-readable, easily shareable format that can be viewed in any text editor or converted to HTML, PDF, or other formats. A JSON file provides a structured, machine-readable representation that can be processed by other software tools, integrated into databases, or used for further analysis.
The entire process, from query submission to report delivery, typically completes in two to five minutes, depending on the complexity of the query, the number of sources searched, and the availability of full-text content. This represents a dramatic acceleration compared to manual research, which might require hours or days to achieve comparable coverage and synthesis.
Reportr is designed to be both powerful and accessible. For users who simply want to generate a research report, the system can be invoked with a single command-line instruction. For users who require more control, the system provides extensive configuration options that allow fine-tuning of search parameters, source selection, language model behavior, and report formatting. For developers who wish to extend or integrate the system, the modular architecture and clear separation of concerns make it straightforward to add new data sources, implement alternative processing strategies, or integrate with existing workflows.
3. ARCHITECTURAL DESIGN
The architecture of Reportr is the result of careful consideration of multiple competing concerns: modularity, maintainability, extensibility, performance, reliability, and ease of use. This section provides a detailed examination of the architectural design, beginning with a high-level overview and then drilling down into specific architectural patterns and design decisions.
3.1 HIGH-LEVEL ARCHITECTURE
At the highest level, Reportr employs a multi-agent architecture in which specialized agents collaborate to accomplish a complex task that would be difficult or impossible for a single monolithic component to handle effectively. This architectural style has several important advantages that will be explored in detail in subsequent sections.
The system can be visualized as follows:
+-----------------------------------+
| |
| USER INTERFACE |
| (Command Line / API) |
| |
+----------------+------------------+
|
| Query
|
v
+-----------------------------------+
| |
| ORCHESTRATOR AGENT |
| (Workflow Coordination) |
| |
+-----------------------------------+
| | | |
+------------+ | | +------------+
| | | |
v v v v
+---------------+ +---------------+ +---------------+ +--------------+
| | | | | | | |
| SEARCH | | DOCUMENT | | VECTOR | | REPORT |
| AGENT | | AGENT | | STORE | | AGENT |
| | | | | AGENT | | |
+-------+-------+ +-------+-------+ +-------+-------+ +------+-------+
| | | |
v v v v
+---------------+ +---------------+ +---------------+ +--------------+
| Multiple | | PDF & Web | | ChromaDB | | LLM |
| Search | | Content | | Vector | | (OpenAI/ |
| Sources | | Extraction | | Database | | Anthropic) |
+---------------+ +---------------+ +---------------+ +--------------+
| | | |
+--------------------+--------------------+-------------------+
|
v
+-----------------------------------+
| |
| RESEARCH REPORT |
| (Markdown + JSON Output) |
| |
+-----------------------------------+
This diagram illustrates the hierarchical nature of the architecture. The user interface sits at the top, accepting queries from users. The Orchestrator Agent serves as the central coordinator, delegating specific tasks to the four specialized agents. Each specialized agent, in turn, interacts with external services, databases, or APIs to accomplish its designated function. Finally, the results flow back up through the hierarchy, ultimately producing the final research report.
3.2 AGENT-BASED ARCHITECTURE PATTERN
The decision to employ an agent-based architecture is one of the most significant architectural choices in Reportr. An agent, in this context, is a software component that exhibits several key characteristics. First, it has a specific, well-defined responsibility or set of responsibilities. Second, it encapsulates both the data and the behavior necessary to fulfill those responsibilities. Third, it operates with a degree of autonomy, making decisions about how to accomplish its tasks without requiring detailed instruction from other components. Fourth, it communicates with other agents through well-defined interfaces, typically by exchanging messages or invoking methods.
The agent-based architecture offers several compelling advantages for a system like Reportr. The first advantage is modularity. Each agent is a self-contained unit that can be developed, tested, and maintained independently. This modularity significantly reduces the cognitive load on developers, as they can focus on one agent at a time without needing to understand the intricate details of the entire system. It also facilitates parallel development, as different team members can work on different agents simultaneously without creating conflicts or dependencies.
The second advantage is flexibility and extensibility. Because agents communicate through well-defined interfaces rather than being tightly coupled, it is relatively straightforward to replace or enhance individual agents without affecting the rest of the system. For example, if a new search API becomes available, the Search Agent can be modified to incorporate it without requiring changes to the Document Agent, Vector Store Agent, or Report Agent. Similarly, if a more advanced language model becomes available, the Report Agent can be updated to use it without affecting the other components.
The third advantage is fault isolation. In a monolithic architecture, a failure in one component can cascade through the system, potentially causing complete system failure. In an agent-based architecture, agents can be designed to fail gracefully. If the Search Agent fails to retrieve results from one source, it can still return results from other sources. If the Document Agent fails to download a particular PDF, it can skip that document and continue processing others. This resilience is critical for a system that interacts with multiple external services, any of which might be temporarily unavailable or rate- limited.
The fourth advantage is conceptual clarity. The agent-based architecture maps naturally onto the conceptual model of the research process. Researchers intuitively understand that research involves searching for information, retrieving documents, organizing knowledge, and synthesizing findings. By creating agents that correspond to these conceptual activities, the architecture becomes more intuitive and easier to reason about.
The fifth advantage is testability. Each agent can be tested in isolation, with mock implementations of the other agents or external services. This unit testing approach is far more tractable than attempting to test a monolithic system where all components are intertwined. Integration testing is also simplified because the well-defined interfaces between agents provide clear points at which to verify correct interaction.
3.3 COMPONENT INTERACTION MODEL
The agents in Reportr interact according to a carefully designed protocol. The primary interaction pattern is orchestrated delegation, in which the Orchestrator Agent serves as the central coordinator and delegates specific tasks to the specialized agents. This pattern can be contrasted with peer-to- peer interaction, in which agents communicate directly with each other, or with a blackboard architecture, in which agents communicate by reading and writing to a shared data structure.
The orchestrated delegation pattern was chosen for several reasons. First, it provides a clear, linear workflow that is easy to understand and debug. The sequence of operations is explicit: search, then retrieve documents, then build vector index, then generate report. This linearity simplifies reasoning about the system's behavior and makes it easier to identify where problems occur.
Second, it centralizes control and coordination logic in a single location (the Orchestrator Agent), rather than distributing it across multiple agents. This centralization reduces the risk of coordination failures, race conditions, or deadlocks that can occur in more complex interaction patterns.
Third, it provides a natural extension point for adding workflow variations. For example, if a user wants to conduct research on multiple related topics and then generate a comparative analysis, the Orchestrator Agent can be extended to support this workflow without requiring changes to the specialized agents.
The interaction between the Orchestrator Agent and the specialized agents follows a request-response pattern. The Orchestrator sends a request to an agent, typically in the form of a method call with parameters. The agent processes the request, potentially interacting with external services or databases, and returns a response. The response is typically a data structure (such as a list of search results or a generated report) that the Orchestrator can then pass to the next agent in the workflow.
The data structures exchanged between agents are carefully designed to be both rich enough to carry all necessary information and simple enough to be easily understood and manipulated. The primary data structure is the SearchResult, which represents a single document found during the search process. A SearchResult contains the document's title, authors, abstract, URL, source (indicating which search service found it), publication date, and optionally the full text content. It also contains metadata such as relevance scores and processing flags.
Another important data structure is the ResearchReport, which represents the final output of the system. A ResearchReport contains the research topic, the original query, a timestamp indicating when the report was generated, the list of SearchResults that were analyzed, a synthesized summary, a list of key findings, identified trends, and metadata about the research process.
These data structures serve as contracts between agents. As long as an agent produces data structures that conform to the expected schema, the other agents can process them correctly. This contract-based approach provides a degree of decoupling that facilitates independent development and testing of agents.
4. THE FIVE AGENTS: DETAILED ANALYSIS
This section provides an in-depth examination of each of the five agents that comprise Reportr. For each agent, we explore its responsibilities, its internal architecture, the key algorithms and techniques it employs, and its interactions with other components of the system.
4.1 ORCHESTRATOR AGENT
The Orchestrator Agent is the master coordinator of the Reportr system. It is responsible for managing the overall research workflow, coordinating the activities of the specialized agents, handling user requests, managing the file system for report storage, and providing scheduling capabilities for recurring research tasks.
When a user submits a research query, the Orchestrator Agent receives it and initiates the research workflow. The workflow proceeds through several distinct phases, each of which involves delegating a specific task to one of the specialized agents.
In the first phase, the Orchestrator Agent invokes the Search Agent with the user's query. The Search Agent returns a list of SearchResult objects representing documents that are potentially relevant to the query. The Orchestrator Agent receives this list and performs initial validation, such as checking that at least some results were found and logging information about the number and sources of results.
In the second phase, the Orchestrator Agent passes the list of SearchResults to the Document Agent. The Document Agent attempts to retrieve the full text content for each result, enriching the SearchResult objects with this additional information. Not all documents will have retrievable full text (some may be behind paywalls, some URLs may be broken, some PDFs may be malformed), so the Document Agent returns an updated list of SearchResults, some of which now contain full text and some of which do not.
In the third phase, the Orchestrator Agent passes the enriched SearchResults to the Vector Store Agent. The Vector Store Agent generates vector embeddings for the documents and stores them in a vector database. This phase is somewhat different from the previous phases in that it does not return modified data; rather, it has the side effect of populating the vector database, which can then be queried for semantic search.
In the fourth phase, the Orchestrator Agent invokes the Report Agent, passing it the user's query and the list of SearchResults. The Report Agent analyzes the results, generates a synthesis using a large language model, extracts key findings, identifies trends, and produces a structured ResearchReport object.
In the fifth and final phase, the Orchestrator Agent takes the ResearchReport and saves it to the file system. It creates a directory structure organized by topic, generates filenames that include timestamps to prevent collisions, and writes both a Markdown version (for human consumption) and a JSON version (for machine processing) of the report.
The Orchestrator Agent also provides scheduling capabilities. Users can configure the system to automatically conduct research on specified topics at regular intervals (for example, daily or weekly). The Orchestrator Agent maintains a schedule, checks periodically whether any scheduled research tasks are due, and executes them automatically. This scheduling capability is particularly valuable for users who need to stay current with rapidly evolving fields, as it provides a continuous stream of updated research reports without requiring manual intervention.
The Orchestrator Agent employs several important design patterns. It uses the Template Method pattern to define the overall structure of the research workflow, with specific steps delegated to specialized agents. It uses the Facade pattern to provide a simple, unified interface to the complex subsystem of specialized agents. It uses the Strategy pattern to support different types of research workflows (single query, scheduled research, comparative analysis).
Error handling in the Orchestrator Agent is designed to be robust and informative. If an error occurs in one of the specialized agents, the Orchestrator Agent catches the exception, logs detailed information about the error, and decides whether to abort the workflow or continue with degraded functionality. For example, if the Search Agent fails to retrieve results from one source but succeeds with others, the Orchestrator Agent will continue the workflow with the available results. If the Search Agent fails completely and returns no results, the Orchestrator Agent will abort the workflow and inform the user.
The Orchestrator Agent also implements rate limiting and resource management. It ensures that the system does not overwhelm external services with too many concurrent requests, respects rate limits imposed by APIs, and manages memory usage when processing large numbers of documents.
4.2 SEARCH AGENT
The Search Agent is responsible for discovering relevant documents across multiple information sources. It is one of the most complex agents in the system, as it must interact with several different search APIs, each with its own protocols, data formats, and rate limits.
The Search Agent supports four primary information sources. The first is arXiv, a preprint server that hosts scientific papers in fields such as physics, mathematics, computer science, and quantitative biology. The second is Google Scholar, an academic search engine that indexes scholarly literature across many disciplines. The third is general web search, implemented using the DuckDuckGo search engine, which provides access to web pages, blog posts, and other online content. The fourth is Medium, a popular platform for technical writing and thought leadership articles.
For each information source, the Search Agent implements a specialized search method that handles the particulars of that source's API. The arXiv search method uses the arXiv API, which accepts queries in a specific format and returns results in XML format. The Search Agent parses this XML, extracts relevant fields (title, authors, abstract, publication date, PDF URL), and constructs SearchResult objects.
The Google Scholar search method uses the Scholarly library, which provides a Python interface to Google Scholar. This method is more complex because Google Scholar does not provide an official API and actively attempts to prevent automated access. The Search Agent must therefore implement careful rate limiting, user agent rotation, and error handling to avoid being blocked.
The web search method uses the DuckDuckGo search API, which is more permissive and provides a straightforward JSON-based interface. The Search Agent submits the query, receives a list of web pages, and constructs SearchResult objects from the returned data.
The Medium search method uses web scraping techniques to search Medium's platform. This is necessary because Medium does not provide a public search API. The Search Agent constructs a search URL, fetches the HTML content, parses it to extract article information, and constructs SearchResult objects.
A critical capability of the Search Agent is multi-source aggregation. When conducting a search, the Search Agent queries all enabled sources (as specified in the configuration) and aggregates the results. This aggregation is not simply concatenation; the Search Agent also performs deduplication to remove results that appear in multiple sources, and it normalizes relevance scores across different sources to enable fair comparison.
The deduplication algorithm is particularly important. Different sources may return the same document with slightly different metadata (for example, one source might include middle initials in author names while another does not). The Search Agent uses a combination of exact matching (on URLs) and fuzzy matching (on titles) to identify duplicates. When duplicates are found, the Search Agent merges them, preserving the most complete metadata from each source.
The Search Agent also implements intelligent query expansion. For certain types of queries, it automatically generates variations that are likely to retrieve additional relevant results. For example, if the query contains an acronym, the Search Agent might also search for the expanded form. If the query is very short, the Search Agent might add common related terms to broaden the search.
Rate limiting is a critical concern for the Search Agent. Many search APIs impose limits on the number of queries that can be made per unit time. The Search Agent implements a token bucket algorithm to ensure that it respects these limits. It maintains a count of recent requests for each API, delays requests when necessary to stay within limits, and provides informative error messages if rate limits are exceeded.
The Search Agent also implements retry logic with exponential backoff. If a search request fails due to a transient error (such as a network timeout or a temporary service unavailability), the Search Agent will automatically retry the request after a delay. The delay increases exponentially with each retry, which helps to avoid overwhelming a service that is experiencing problems.
Caching is another important feature of the Search Agent. Search results are cached locally so that repeated searches for the same query do not require repeated API calls. This caching improves performance, reduces load on external services, and helps to stay within rate limits. The cache has a configurable time-to-live, so that results are periodically refreshed to capture new content.
The Search Agent returns results in a standardized format regardless of which sources were queried. This standardization is crucial for the downstream agents, which can process results uniformly without needing to know which source they came from. The standardized SearchResult object includes all the information that downstream agents need, such as title, authors, abstract, URL, publication date, and source identifier.
4.3 DOCUMENT AGENT
The Document Agent is responsible for retrieving the full text content of documents identified by the Search Agent. While the Search Agent provides abstracts or snippets, the Document Agent obtains complete documents, which enables more thorough analysis and more accurate synthesis.
The Document Agent handles two primary types of content: PDF documents and web pages. Each type requires different retrieval and extraction techniques.
For PDF documents, the Document Agent first downloads the PDF file from the provided URL. It implements several safeguards during this process. It checks the file size before downloading to avoid attempting to download extremely large files that could exhaust memory or disk space. It sets a timeout on the download operation to avoid hanging indefinitely on slow or unresponsive servers. It validates that the downloaded content is actually a PDF file (by checking the file header) to avoid processing malformed or mislabeled files.
Once a PDF file is successfully downloaded, the Document Agent extracts the textual content. This extraction is more complex than it might initially appear. PDF is a page description format, not a text format, so extracting text requires parsing the PDF structure and reconstructing the text in reading order. The Document Agent uses the PyPDF2 library for this purpose, which handles most common PDF formats correctly.
However, PDF extraction is not always perfect. Some PDFs contain text as images (scanned documents), which cannot be extracted without optical character recognition. Some PDFs have complex layouts with multiple columns, sidebars, or embedded figures, which can result in text being extracted in the wrong order. Some PDFs are encrypted or password-protected, which prevents extraction entirely. The Document Agent handles these cases gracefully, logging warnings when extraction fails or produces questionable results, and continuing to process other documents.
For web pages, the Document Agent fetches the HTML content and extracts the main textual content while filtering out navigation menus, advertisements, footers, and other non-essential elements. This extraction is challenging because web pages have enormous variability in structure and layout. The Document Agent uses the BeautifulSoup library combined with heuristics to identify the main content area.
The extraction heuristics work as follows. The Document Agent first looks for HTML elements that are commonly used for main content, such as article tags, main tags, or divs with class names like "content" or "article-body". If such elements are found, their text is extracted. If not, the Document Agent falls back to extracting all paragraph elements, which usually captures the main content while excluding navigation and other non-paragraph elements.
The Document Agent also performs text cleaning and normalization. It removes excessive whitespace, normalizes line endings, removes non-printable characters, and optionally removes very short paragraphs that are likely to be navigation elements or captions rather than substantive content.
An important consideration for the Document Agent is concurrency. Downloading and processing documents is I/O-bound and can be slow, especially when dealing with large PDFs or slow servers. To address this, the Document Agent implements concurrent processing using Python's asyncio library. It can process multiple documents simultaneously, subject to a configurable concurrency limit that prevents overwhelming the system or external servers.
The Document Agent also implements intelligent error handling and fallback strategies. If full text retrieval fails for a document, the Document Agent does not simply discard that document. Instead, it marks the document as having failed full text retrieval but retains the abstract and other metadata. This allows the document to still contribute to the research report, albeit with less detail than would be available with full text.
The Document Agent maintains statistics about its operations, such as the number of documents processed, the number of successful full text retrievals, the number of failures, and the reasons for failures. These statistics are logged and can be included in the research report metadata, providing transparency about the completeness of the research.
The Document Agent also implements content validation. After extracting text from a document, it checks that the extracted text is substantive (not just a few words or characters) and appears to be coherent (not random characters or encoding errors). If the extracted text fails these validation checks, the Document Agent logs a warning and may discard the extracted text in favor of using just the abstract.
4.4 VECTOR STORE AGENT
The Vector Store Agent is responsible for managing the semantic search capabilities of Reportr. It generates vector embeddings for documents, stores them in a vector database, and provides semantic search functionality that complements the keyword-based search provided by the Search Agent.
Vector embeddings are high-dimensional numerical representations of text that capture semantic meaning. Documents with similar meanings have similar vector representations, even if they use different words. This property enables semantic search, in which a query can find relevant documents based on meaning rather than just keyword matches.
The Vector Store Agent uses the Sentence Transformers library to generate embeddings. Specifically, it uses the "all-MiniLM-L6-v2" model by default, which is a compact but effective model that produces 384-dimensional embeddings. This model was chosen because it provides a good balance between quality and computational efficiency. It is small enough to run on modest hardware without requiring a GPU, yet it produces embeddings that are sufficiently high quality for most research tasks.
The embedding process works as follows. For each document, the Vector Store Agent extracts the textual content (either the full text if available, or the abstract if not). It then passes this text to the Sentence Transformers model, which returns a 384-dimensional vector. This vector is stored in a vector database along with metadata about the document (such as its title, authors, URL, and publication date).
The vector database used by Reportr is ChromaDB, an open-source embedding database designed specifically for AI applications. ChromaDB provides several important capabilities. It efficiently stores large numbers of vectors and their associated metadata. It provides fast similarity search, allowing queries to find the most similar vectors to a query vector in milliseconds. It supports filtering based on metadata, allowing queries to be restricted to documents from certain sources or time periods. It persists data to disk, so that the vector database survives across multiple runs of the system.
The Vector Store Agent organizes vectors into collections. A collection is a logical grouping of related vectors. By default, Reportr uses a single collection for all research, but the system can be configured to use separate collections for different research topics or projects. This separation can be useful for managing large research projects or for maintaining separate knowledge bases for different domains.
When adding documents to the vector store, the Vector Store Agent implements several optimizations. It batches documents together and processes them in groups, which is more efficient than processing them one at a time. It checks for duplicates before adding documents, using the document URL as a unique identifier. If a document is already in the vector store, it is not added again, which prevents the database from growing unnecessarily large.
The Vector Store Agent provides several types of queries. The most basic is similarity search, in which a query string is converted to a vector and the most similar document vectors are retrieved. This is useful for finding documents that are conceptually related to a query, even if they do not contain the exact query terms.
A more advanced query type is hybrid search, which combines keyword search and semantic search. The Vector Store Agent can filter documents based on keyword matches and then rank them by semantic similarity, or vice versa. This hybrid approach often produces better results than either technique alone.
The Vector Store Agent also supports metadata filtering. For example, a query can request only documents published after a certain date, or only documents from certain sources. This filtering is implemented efficiently by ChromaDB, which indexes metadata fields for fast lookup.
An important consideration for the Vector Store Agent is the chunking strategy. Long documents (such as full research papers) often exceed the maximum input length for the embedding model, which is typically 512 tokens (roughly 400 words). The Vector Store Agent handles this by splitting long documents into chunks. Each chunk is embedded separately, and the chunks are stored as separate entries in the vector database, all linked to the same source document.
The chunking strategy uses a sliding window approach with overlap. Documents are split into chunks of approximately 500 words, with a 100-word overlap between consecutive chunks. This overlap ensures that concepts that span chunk boundaries are still captured in at least one chunk. The chunk size and overlap are configurable to allow tuning for different types of content.
The Vector Store Agent also implements relevance scoring. When returning search results, it includes a similarity score that indicates how closely each result matches the query. These scores are normalized to a 0-1 range, where 1 indicates perfect similarity and 0 indicates no similarity. The Vector Store Agent can filter results based on a minimum similarity threshold, returning only results that are sufficiently relevant.
The Vector Store Agent maintains the vector database across multiple research sessions. This persistence means that the knowledge base grows over time, accumulating information from all research queries. This accumulated knowledge can be valuable for discovering connections between different research topics or for providing context for new queries.
However, unbounded growth of the vector database can eventually lead to performance degradation and storage issues. The Vector Store Agent therefore implements a cleanup strategy. It can be configured to automatically remove old entries based on age or to limit the total number of entries. It can also compact the database periodically to reclaim space from deleted entries.
4.5 REPORT AGENT
The Report Agent is responsible for synthesizing the information gathered by the other agents into a coherent, well-structured research report. This is arguably the most sophisticated agent in the system, as it must perform complex natural language understanding and generation tasks.
The Report Agent uses a large language model (LLM) to perform synthesis and summarization. By default, Reportr supports both OpenAI's GPT models and Anthropic's Claude models. It also supports Ollama and llama.cpp for using local open-weights or open-source LLMs. Further inference engines can be integrated. The choice of model can be configured based on user preference, cost considerations, or specific capabilities required for a task.
The report generation process begins with the Report Agent receiving a list of SearchResult objects from the Orchestrator Agent. These results have been gathered by the Search Agent, enriched with full text by the Document Agent, and indexed by the Vector Store Agent. The Report Agent's task is to analyze these results and produce a comprehensive report.
The first step in report generation is deduplication and filtering. Although the Search Agent performs initial deduplication, the Report Agent applies additional deduplication logic based on content similarity rather than just title similarity. It computes pairwise similarity scores between all results and removes results that are highly similar to others. This ensures that the final report does not include redundant information from multiple sources covering the same ground.
The Report Agent also filters results based on relevance. It uses the relevance scores computed by the Search Agent and Vector Store Agent, as well as its own analysis of content quality, to select the most relevant and high-quality results. By default, the Report Agent limits the final report to the top ten results, but this limit is configurable.
Once the results have been deduplicated and filtered, the Report Agent proceeds to synthesis. This is where the large language model comes into play. The Report Agent constructs a prompt for the LLM that includes the research topic, the titles and abstracts (or full text excerpts) of the top results, and instructions for generating a summary.
The prompt is carefully designed to elicit high-quality output from the LLM. It instructs the model to read all the provided sources, identify the main themes and findings, note areas of agreement and disagreement among sources, and synthesize this information into a coherent narrative of two to three paragraphs. The prompt also instructs the model to write in a clear, professional style appropriate for a research report, and to avoid simply copying text from the sources.
The LLM processes this prompt and generates a summary. The Report Agent receives this summary and incorporates it into the research report. The summary serves as an executive overview that gives readers a quick understanding of the current state of knowledge on the research topic.
In addition to the summary, the Report Agent extracts key findings from the results. Key findings are individual insights or discoveries that are particularly important or interesting. The Report Agent identifies key findings by analyzing the titles and abstracts of the results, looking for statements that represent novel contributions, surprising results, or important conclusions. These key findings are presented as a bulleted list in the report.
The Report Agent also identifies trends. Trends are recurring themes or topics that appear across multiple sources. To identify trends, the Report Agent performs frequency analysis on the words and phrases that appear in the titles and abstracts of the results. It filters out common words (stop words) and focuses on domain-specific terminology. The most frequently occurring terms are identified as trends and included in the report.
The trend identification algorithm uses a combination of term frequency and document frequency. Terms that appear frequently but only in a single document are not considered trends; trends must appear across multiple documents. This approach helps to identify genuinely important themes rather than idiosyncratic terminology from individual papers.
Once the summary, key findings, and trends have been generated, the Report Agent assembles them into a structured report. The report includes several sections. The header contains the research topic, the original query, a timestamp, and metadata about the number of sources consulted. The summary section contains the LLM-generated synthesis. The trends section lists the identified trends. The findings section presents detailed information about each result, including title, authors, publication date, source, abstract or excerpt, and URL.
The Report Agent formats the report in Markdown, a lightweight markup language that is both human-readable and easily convertible to other formats such as HTML or PDF. The Markdown formatting includes headers for sections, bold text for emphasis, bulleted lists for findings and trends, and hyperlinks for URLs.
In addition to the Markdown report, the Report Agent also generates a JSON representation of the report. The JSON format is structured and machine- readable, making it suitable for programmatic processing, integration with other tools, or storage in databases. The JSON representation includes all the same information as the Markdown report, but organized as nested objects and arrays.
The Report Agent implements several quality control measures. It validates that the LLM-generated summary is substantive (not just a few words) and coherent (not garbled or nonsensical). If the summary fails validation, the Report Agent can retry with a modified prompt or fall back to a simpler summarization technique that does not use the LLM.
The Report Agent also implements citation formatting. Each finding in the report includes a properly formatted citation that identifies the source, authors, and publication date. These citations enable readers to trace the information back to its original source and assess its credibility.
An important feature of the Report Agent is its handling of full text content. When full text is available for a result, the Report Agent can provide more detailed excerpts in the report. It uses extractive summarization techniques to identify the most important sentences or paragraphs from the full text and includes them in the report. This provides readers with more context and detail than would be available from just the abstract.
The Report Agent also generates metadata about the research process itself. This metadata includes information such as the total number of sources searched, the breakdown of results by source (how many came from arXiv, how many from Google Scholar, etc.), the number of results for which full text was successfully retrieved, and the time taken for various stages of the research process. This metadata provides transparency and helps users understand the scope and completeness of the research.
5. CRITICAL ARCHITECTURAL DECISIONS
The architecture of Reportr is the result of numerous design decisions, each of which involved trade-offs between competing concerns. This section examines the most critical architectural decisions, explaining the rationale behind each decision and the alternatives that were considered.
5.1 DECISION 1: AGENT-BASED ARCHITECTURE
The decision to employ an agent-based architecture, rather than a monolithic or layered architecture, is perhaps the most fundamental architectural decision in Reportr.
The rationale for this decision is multifaceted. First, the research process naturally decomposes into distinct activities (searching, retrieving, indexing, synthesizing), each of which requires different expertise and interacts with different external systems. An agent-based architecture allows each activity to be encapsulated in a separate agent, with clear responsibilities and interfaces.
Second, the agent-based architecture provides modularity and extensibility. New search sources can be added by extending the Search Agent. New document formats can be supported by extending the Document Agent. New synthesis techniques can be implemented by modifying the Report Agent. These extensions can be made independently without requiring changes to the entire system.
Third, the agent-based architecture facilitates testing and debugging. Each agent can be tested in isolation with mock implementations of its dependencies. When a problem occurs, the modular structure makes it easier to identify which agent is responsible and to focus debugging efforts on that specific component.
Fourth, the agent-based architecture supports fault tolerance. If one agent encounters an error, it can fail gracefully without bringing down the entire system. For example, if the Search Agent fails to retrieve results from Google Scholar but succeeds with arXiv, the system can continue with the arXiv results.
The primary alternative to an agent-based architecture would be a monolithic architecture in which all functionality is implemented in a single large module or class. This approach would be simpler in some respects (no need to coordinate multiple agents, no need to define interfaces between agents), but it would sacrifice modularity, extensibility, and testability. As the system grew in complexity, a monolithic architecture would become increasingly difficult to understand, modify, and maintain.
Another alternative would be a microservices architecture in which each agent is a separate service that communicates with other services over a network. This approach would provide even greater decoupling than the agent-based architecture employed by Reportr, and it would enable agents to be deployed on separate machines for better scalability. However, it would also introduce significant additional complexity in the form of network communication, service discovery, distributed error handling, and deployment orchestration. For a system of Reportr's scale, this additional complexity was judged to be unwarranted.
5.2 DECISION 2: SEPARATION OF CONCERNS
A closely related decision is the strict separation of concerns among the agents. Each agent has a well-defined responsibility and does not encroach on the responsibilities of other agents.
The Search Agent is responsible only for finding potentially relevant documents; it does not retrieve full text or generate summaries. The Document Agent is responsible only for retrieving full text; it does not search for documents or generate summaries. The Vector Store Agent is responsible only for semantic indexing and search; it does not retrieve documents or generate reports. The Report Agent is responsible only for synthesis and formatting; it does not search for or retrieve documents.
This separation of concerns provides several benefits. It reduces coupling between agents, making them more independent and easier to modify. It clarifies the system's architecture, making it easier for developers to understand where specific functionality resides. It facilitates reuse, as agents can potentially be used in other systems or contexts.
The separation of concerns also aligns with the Single Responsibility Principle from object-oriented design, which states that each module or class should have one reason to change. By giving each agent a single, well-defined responsibility, the architecture minimizes the likelihood that changes in one area of functionality will require modifications to multiple agents.
An alternative approach would be to create more coarse-grained agents with broader responsibilities. For example, a single "Information Retrieval Agent" could be responsible for both searching and retrieving full text. This would reduce the number of agents and simplify coordination, but it would also create larger, more complex agents that are harder to understand and modify.
5.3 DECISION 3: HYBRID SEARCH STRATEGY
The decision to employ a hybrid search strategy, combining multiple search sources and both keyword-based and semantic search, is critical to Reportr's effectiveness.
Different information sources have different strengths. arXiv provides access to cutting-edge research in certain fields, often before it appears in traditional journals. Google Scholar provides broad coverage across many disciplines and includes both recent and historical papers. Web search provides access to blog posts, tutorials, and other informal content that may not appear in academic databases. Medium provides access to thought leadership and practitioner perspectives.
By searching multiple sources, Reportr ensures comprehensive coverage. A query about a recent development in machine learning might find preprints on arXiv, established papers on Google Scholar, tutorials on the web, and practitioner perspectives on Medium. This diversity of sources provides a more complete picture than any single source could provide.
The combination of keyword-based and semantic search is similarly important. Keyword-based search is precise and deterministic; it finds documents that contain specific terms. Semantic search is more flexible and can find conceptually related documents even when they use different terminology. By combining both approaches, Reportr can find both documents that directly address the query terms and documents that address related concepts.
The primary alternative to a hybrid search strategy would be to rely on a single search source or a single search technique. This would simplify the Search Agent and reduce the number of external dependencies, but it would significantly limit the comprehensiveness and quality of the search results.
5.4 DECISION 4: VECTOR DATABASE INTEGRATION
The decision to integrate a vector database (ChromaDB) for semantic search is one of the more sophisticated architectural choices in Reportr.
Vector databases are a relatively recent development in the data management landscape. They are specifically designed to store and query high-dimensional vectors efficiently. Traditional databases are optimized for exact matches and range queries on scalar values; vector databases are optimized for similarity queries on vectors.
The integration of ChromaDB provides several important capabilities. It enables semantic search, allowing users to find documents based on meaning rather than just keywords. It provides a persistent knowledge base that accumulates information across multiple research sessions. It enables advanced queries that combine semantic similarity with metadata filtering.
The vector database also serves as a foundation for potential future enhancements. For example, it could support question answering (finding specific answers to specific questions rather than just relevant documents), document clustering (automatically grouping related documents), or recommendation (suggesting related topics or documents based on a user's research history).
The primary alternative to integrating a vector database would be to rely solely on the keyword-based search provided by external search engines. This would simplify the architecture and eliminate a dependency, but it would sacrifice the semantic search capabilities that vector databases enable.
Another alternative would be to implement semantic search without a dedicated vector database, perhaps by computing embeddings on the fly and performing similarity comparisons in memory. This approach would work for small numbers of documents but would not scale to large knowledge bases, as the computational cost of comparing a query vector to thousands or millions of document vectors would be prohibitive.
5.5 DECISION 5: ASYNCHRONOUS PROCESSING MODEL
The decision to implement asynchronous processing for I/O-bound operations, particularly in the Document Agent, is important for performance.
Many operations in Reportr are I/O-bound, meaning that they spend most of their time waiting for external resources (network responses, disk reads, API calls) rather than performing computation. Downloading PDFs, fetching web pages, and querying search APIs are all I/O-bound operations.
In a synchronous processing model, these operations would be performed sequentially. The system would download one PDF, wait for it to complete, then download the next PDF, wait for it to complete, and so on. This sequential processing is simple to implement and reason about, but it is inefficient because the system is idle while waiting for I/O operations to complete.
In an asynchronous processing model, multiple I/O operations can be in flight simultaneously. The system can initiate a download, then immediately initiate another download without waiting for the first to complete. When downloads complete, they are processed as results become available. This concurrent processing can dramatically reduce the total time required to process a large number of documents.
Reportr implements asynchronous processing using Python's asyncio library, which provides language-level support for asynchronous I/O. The Document Agent, in particular, uses asyncio to download and process multiple documents concurrently. The degree of concurrency is configurable, allowing users to balance performance against resource usage and politeness to external servers.
The primary alternative to asynchronous processing would be synchronous processing, which would be simpler but slower. Another alternative would be thread-based or process-based concurrency, which can also enable concurrent I/O but is generally more complex and resource-intensive than asyncio-based concurrency.
5.6 DECISION 6: CONFIGURATION-DRIVEN DESIGN
The decision to make Reportr highly configurable, with most behavior controlled by a configuration file rather than hard-coded, is important for flexibility and usability.
Reportr's behavior is controlled by a YAML configuration file that specifies numerous parameters: which search sources to use, how many results to retrieve from each source, whether to fetch full text, which LLM provider to use, how to format reports, where to save reports, and many others.
This configuration-driven design provides several benefits. It allows users to customize the system's behavior without modifying code. It makes it easy to create different configurations for different use cases (for example, a configuration optimized for speed versus one optimized for comprehensiveness). It facilitates experimentation, as users can try different parameter settings to see which produce the best results.
The configuration file also serves as documentation, as it explicitly lists all the configurable parameters and their default values. This makes it easier for users to understand what options are available and how to adjust them.
The primary alternative to a configuration-driven design would be a hard-coded design in which behavior is determined by constants or variables in the code. This would be simpler in some respects, but it would require code modifications for any behavior changes, making the system less accessible to non-programmers.
Another alternative would be a database-driven design in which configuration is stored in a database rather than a file. This would enable more dynamic configuration changes and potentially support multiple users with different configurations, but it would add complexity and dependencies.
5.7 DECISION 7: LLM PROVIDER ABSTRACTION
The decision to abstract the large language model provider behind a common interface is important for flexibility and future-proofing.
Reportr defines a BaseLLMProvider interface that specifies the methods that any LLM provider must implement. Concrete implementations of this interface are provided for OpenAI's GPT models and Anthropic's Claude models. The Report Agent interacts with LLMs through this interface, without needing to know which specific provider is being used.
This abstraction provides several benefits. It allows users to choose their preferred LLM provider based on cost, performance, or other considerations. It makes it easy to add support for new LLM providers as they become available. It facilitates testing, as a mock LLM provider can be used in tests without requiring actual API calls.
The abstraction also provides a degree of insulation against changes in LLM APIs. If a provider changes their API, only the corresponding implementation of BaseLLMProvider needs to be updated; the Report Agent and other components that use the LLM remain unchanged.
The primary alternative to LLM provider abstraction would be to hard-code the use of a specific LLM provider. This would be simpler but would lock the system into that provider and make it difficult to switch providers or support multiple providers.
6. DATA FLOW AND PROCESSING PIPELINE
Understanding the data flow through Reportr is essential for understanding how the system works. This section traces the journey of data from initial query through final report generation.
The data flow can be visualized as follows:
USER QUERY
|
| "Recent advances in transformer architectures"
|
v
+-----------------------------------+
| ORCHESTRATOR AGENT |
| - Receives query |
| - Initiates workflow |
+-----------------------------------+
|
| Query string
|
v
+-----------------------------------+
| SEARCH AGENT |
| - Searches arXiv |
| - Searches Google Scholar |
| - Searches Web |
| - Searches Medium |
| - Aggregates results |
| - Deduplicates |
+-----------------------------------+
|
| List of SearchResult objects
| (title, authors, abstract, URL, source, date)
|
v
+-----------------------------------+
| DOCUMENT AGENT |
| - Downloads PDFs |
| - Fetches web pages |
| - Extracts text |
| - Enriches SearchResults |
+-----------------------------------+
|
| List of SearchResult objects
| (now with full_text field populated)
|
v
+-----------------------------------+
| VECTOR STORE AGENT |
| - Generates embeddings |
| - Stores in ChromaDB |
| - Indexes for semantic search |
+-----------------------------------+
|
| (Side effect: vector database populated)
| List of SearchResult objects (unchanged)
|
v
+-----------------------------------+
| REPORT AGENT |
| - Deduplicates by content |
| - Filters by relevance |
| - Generates summary (via LLM) |
| - Extracts key findings |
| - Identifies trends |
| - Formats report |
+-----------------------------------+
|
| ResearchReport object
| (topic, query, timestamp, results, summary, findings, trends)
|
v
+-----------------------------------+
| ORCHESTRATOR AGENT |
| - Renders Markdown |
| - Renders JSON |
| - Saves to file system |
+-----------------------------------+
|
| Files written to disk
|
v
RESEARCH REPORT
- Markdown file (human-readable)
- JSON file (machine-readable)
Let us trace this flow in more detail with a concrete example. Suppose a user submits the query "recent advances in transformer architectures for natural language processing."
The Orchestrator Agent receives this query and passes it to the Search Agent. The Search Agent constructs search queries for each enabled source. For arXiv, it might search for "transformer architecture natural language processing" in the computer science category. For Google Scholar, it might search for the same terms across all disciplines. For web search, it might use the full query string. For Medium, it might search for "transformer NLP."
Each search source returns a list of results. arXiv might return ten recent papers about transformer architectures. Google Scholar might return a mix of recent and older papers, including some highly cited foundational papers. Web search might return blog posts, tutorials, and documentation pages. Medium might return articles by practitioners discussing their experiences with transformers.
The Search Agent aggregates these results into a single list. It identifies duplicates (for example, a paper that appears in both arXiv and Google Scholar) and merges them. It normalizes relevance scores across sources. The result is a unified list of, say, thirty SearchResult objects, each containing title, authors, abstract, URL, source, and publication date.
The Orchestrator Agent receives this list and passes it to the Document Agent. The Document Agent examines each SearchResult and attempts to retrieve full text. For results from arXiv, it downloads the PDF from the arXiv URL and extracts the text. For results from web search or Medium, it fetches the HTML and extracts the main content. Some retrievals succeed, some fail (due to paywalls, broken links, or other issues). The Document Agent updates the SearchResult objects, populating the full_text field for those where retrieval succeeded.
The Orchestrator Agent receives the enriched list and passes it to the Vector Store Agent. The Vector Store Agent processes each SearchResult, generating a vector embedding from either the full text (if available) or the abstract (if not). It stores these embeddings in ChromaDB along with metadata. This process has the side effect of populating the vector database, but it does not modify the SearchResult objects themselves.
The Orchestrator Agent then passes the list of SearchResults to the Report Agent. The Report Agent first performs additional deduplication, identifying results that have very similar content even if they have different titles. It then filters the results by relevance, selecting the top ten most relevant results.
For these top ten results, the Report Agent constructs a prompt for the LLM. The prompt includes the original query and excerpts from each of the ten results. The LLM reads this material and generates a two-paragraph summary that synthesizes the key information.
The Report Agent also analyzes the results to extract key findings. It examines the titles and abstracts, identifying statements that represent important contributions or conclusions. It compiles these into a list of key findings.
The Report Agent performs frequency analysis on the terminology used in the results, identifying the most common domain-specific terms. These become the identified trends.
The Report Agent assembles all this information into a ResearchReport object. This object contains the original query, a timestamp, the list of SearchResults, the LLM-generated summary, the list of key findings, the list of trends, and metadata about the research process.
The Orchestrator Agent receives the ResearchReport object and renders it in two formats. It generates a Markdown document with formatted sections for the summary, trends, and detailed findings. It also generates a JSON document with the same information in structured form.
Finally, the Orchestrator Agent saves these documents to the file system. It creates a directory named after the research topic (with special characters removed for file system compatibility) and saves the Markdown and JSON files in that directory with timestamps in their filenames.
The entire process, from query submission to file creation, typically takes two to five minutes. The user receives two files that together constitute a comprehensive research report on the requested topic.
7. TECHNOLOGY STACK AND DEPENDENCIES
Reportr is built on a foundation of modern Python libraries and external services. This section catalogs the key technologies and explains why each was chosen.
The core programming language is Python 3.8 or later. Python was chosen for several reasons. It has excellent support for the kinds of tasks that Reportr performs, including web scraping, PDF processing, natural language processing, and API interaction. It has a rich ecosystem of libraries for these tasks. It is widely used in the AI and data science communities, making it familiar to the target audience. It supports both object-oriented and functional programming styles, allowing flexibility in implementation.
For search functionality, Reportr uses several libraries and APIs. The arXiv API is accessed using the arxiv Python library, which provides a clean interface to arXiv's search and retrieval capabilities. Google Scholar is accessed using the scholarly library, which scrapes Google Scholar's web interface (as Google does not provide an official API). Web search is performed using the duckduckgo- search library, which interfaces with DuckDuckGo's search API. Medium search is implemented using custom web scraping code with the requests and BeautifulSoup libraries.
For document retrieval and processing, Reportr uses several libraries. The requests library is used for HTTP requests to download PDFs and fetch web pages. The PyPDF2 library is used to extract text from PDF files. The BeautifulSoup library is used to parse HTML and extract main content from web pages. The aiohttp library is used for asynchronous HTTP requests, enabling concurrent document downloads.
For vector embeddings and semantic search, Reportr uses the Sentence Transformers library, which provides pre-trained models for generating high- quality embeddings. The specific model used by default is "all-MiniLM-L6-v2", which produces 384-dimensional embeddings and can run efficiently on CPU. For vector storage and similarity search, Reportr uses ChromaDB, an open-source embedding database designed for AI applications.
For large language model interaction, Reportr uses the official client libraries for OpenAI (openai library) and Anthropic (anthropic library). These libraries provide convenient interfaces to the respective APIs, handling authentication, request formatting, and response parsing.
For configuration management, Reportr uses YAML files parsed with the PyYAML library. YAML was chosen over alternatives like JSON or TOML because it is more human-readable and supports comments, making configuration files easier to understand and document.
For logging, Reportr uses Python's built-in logging module, which provides flexible, configurable logging with support for multiple log levels, multiple output destinations, and structured log messages.
For scheduling, Reportr uses the schedule library, which provides a simple, Pythonic interface for scheduling recurring tasks. This library was chosen over alternatives like cron or APScheduler because of its simplicity and ease of use.
For file system operations, Reportr uses Python's built-in pathlib module, which provides an object-oriented interface to file system paths that is more elegant and less error-prone than string-based path manipulation.
For data structures, Reportr makes extensive use of Python's dataclasses, which provide a concise way to define classes that are primarily used to store data. Dataclasses automatically generate initialization methods, string representations, and other boilerplate code.
For asynchronous programming, Reportr uses Python's built-in asyncio module, which provides language-level support for asynchronous I/O operations.
All of these dependencies are specified in a requirements.txt file, which allows them to be installed easily using pip, Python's package manager. The use of well-established, widely-used libraries reduces the amount of custom code that needs to be written and maintained, and it leverages the expertise of the broader Python community.
8. SCALABILITY AND PERFORMANCE CONSIDERATIONS
While Reportr is designed primarily for individual researchers or small teams rather than large-scale enterprise deployment, several design decisions support scalability and performance.
The asynchronous processing model used by the Document Agent allows it to download and process multiple documents concurrently. This concurrency is limited by a configurable parameter to prevent overwhelming the system or external servers, but it can be tuned based on available resources. On a machine with sufficient bandwidth and processing power, the Document Agent can process dozens of documents simultaneously.
The vector database (ChromaDB) is designed to scale to millions of vectors. It uses efficient indexing structures (such as HNSW, Hierarchical Navigable Small World graphs) that provide fast approximate nearest neighbor search even with large numbers of vectors. As the knowledge base grows over time, query performance remains acceptable.
The Search Agent implements caching to avoid redundant API calls. When the same query is submitted multiple times, the Search Agent can return cached results rather than querying external services again. This caching improves response time and reduces load on external services. The cache has a configurable time- to-live, so results are periodically refreshed to capture new content.
The Report Agent limits the number of results included in the final report (defaulting to ten). This limitation ensures that report generation time remains bounded even when searches return large numbers of results. Users who need more comprehensive reports can increase this limit, accepting longer generation times in exchange for more complete coverage.
The system implements rate limiting to respect the limits imposed by external APIs. This rate limiting prevents the system from being blocked by services that detect and penalize excessive request rates. The rate limits are configurable, allowing adjustment based on the specific limits of each service.
Memory usage is managed carefully, particularly when processing large documents. The Document Agent streams PDF content rather than loading entire files into memory. The Vector Store Agent processes documents in batches rather than all at once. These strategies prevent memory exhaustion when processing large research corpora.
The file system organization (separate directories for different topics, timestamped filenames) ensures that reports can be stored and retrieved efficiently even when thousands of reports have been generated. The directory structure prevents any single directory from containing an unwieldy number of files.
For users who need to scale beyond what a single machine can provide, the agent- based architecture provides a natural path to distributed deployment. Each agent could potentially be deployed as a separate service on a separate machine, communicating over a network. The Orchestrator Agent could distribute work across multiple instances of the Search Agent or Document Agent, enabling parallel processing of multiple queries or multiple documents.
However, for most use cases, a single-machine deployment is sufficient. The system can comfortably handle dozens of research queries per day on modest hardware (a laptop or desktop with a modern multi-core processor and 8-16 GB of RAM).
9. ERROR HANDLING AND RESILIENCE
Robust error handling is critical for a system like Reportr that interacts with multiple external services, any of which might be temporarily unavailable, rate- limited, or returning unexpected data.
Reportr employs a multi-layered error handling strategy. At the lowest level, individual operations that might fail (such as HTTP requests or PDF parsing) are wrapped in try-except blocks that catch exceptions and handle them gracefully. When an exception is caught, it is logged with detailed information about the context (what operation was being performed, what parameters were used, what error occurred), and the system decides whether to retry, skip the operation, or propagate the error.
For transient errors (such as network timeouts or temporary service unavailability), the system implements retry logic with exponential backoff. If an operation fails, it is retried after a short delay. If it fails again, it is retried after a longer delay. This pattern continues for a configurable number of retries. The exponential backoff prevents the system from hammering a service that is experiencing problems, giving it time to recover.
For errors that are unlikely to be resolved by retrying (such as a 404 Not Found error indicating that a URL does not exist), the system does not retry. Instead, it logs the error and continues processing other items. This fail-fast approach prevents the system from wasting time on operations that are unlikely to succeed.
At a higher level, each agent implements error handling appropriate to its function. The Search Agent, if it fails to retrieve results from one source, continues with other sources. The Document Agent, if it fails to download one PDF, continues with other PDFs. The Vector Store Agent, if it fails to generate an embedding for one document, continues with other documents. This graceful degradation ensures that partial failures do not cause complete system failure.
The Orchestrator Agent implements the highest level of error handling. If a critical error occurs (such as the Search Agent returning no results at all, or the LLM API being unavailable), the Orchestrator Agent catches the error, logs detailed information, and returns an error message to the user. For less critical errors (such as some documents failing to download), the Orchestrator Agent allows the workflow to continue and includes information about the partial failure in the report metadata.
All errors are logged using Python's logging module, which provides structured, configurable logging. Logs include timestamps, severity levels (DEBUG, INFO, WARNING, ERROR, CRITICAL), and contextual information. These logs are invaluable for diagnosing problems and understanding system behavior.
The system also implements validation at multiple points. When the Search Agent returns results, the Orchestrator Agent validates that the results are in the expected format and contain required fields. When the Document Agent extracts text from a PDF, it validates that the extracted text is substantive and coherent. When the LLM generates a summary, the Report Agent validates that the summary meets minimum quality standards. These validation steps catch errors early and prevent invalid data from propagating through the system.
For external API calls, the system implements timeout handling. Each HTTP request has a configurable timeout, ensuring that the system does not hang indefinitely waiting for a response from an unresponsive server. When a timeout occurs, it is treated as a transient error and handled according to the retry logic.
The system also handles rate limiting gracefully. When an API returns a rate limit error, the system logs the error, waits for an appropriate period (either as specified by the API's response headers or based on a default backoff schedule), and then retries the request. This approach ensures that the system respects API rate limits without requiring manual intervention.
10. FUTURE ENHANCEMENTS
While Reportr is a fully functional and useful system in its current form, there are numerous opportunities for future enhancements that could expand its capabilities and improve its performance.
One potential enhancement is the addition of more search sources. Currently, Reportr searches arXiv, Google Scholar, general web search, and Medium. Additional sources could include PubMed (for biomedical research), IEEE Xplore (for engineering and computer science), ACM Digital Library (for computer science), SSRN (for social sciences), bioRxiv (for biology preprints), and domain-specific databases. Each additional source would broaden the coverage and increase the likelihood of finding relevant information.
Another enhancement would be the implementation of more sophisticated query understanding. Currently, Reportr treats the user's query as a simple string and passes it directly to search APIs. A more sophisticated approach would analyze the query to identify key concepts, entities, and relationships, and then construct optimized search queries for each source. This could involve using natural language processing techniques to parse the query, identify named entities (such as people, organizations, or technologies), and expand the query with synonyms or related terms.
A third enhancement would be the implementation of citation network analysis. Many academic papers include citations to other papers. By analyzing these citation networks, Reportr could identify seminal papers that are frequently cited, discover related papers that cite the same sources, and trace the evolution of ideas over time. This would require accessing citation data, which is available from some sources (such as Google Scholar) but not others.
A fourth enhancement would be the implementation of question answering capabilities. Currently, Reportr generates a general research report on a topic. A question answering system would allow users to ask specific questions (such as "What is the state-of-the-art accuracy on ImageNet classification?") and receive specific answers extracted from the source documents. This would require more sophisticated natural language understanding and information extraction techniques.
A fifth enhancement would be the implementation of multi-lingual support. Currently, Reportr is designed primarily for English-language content. Supporting additional languages would require language detection, translation capabilities, and multi-lingual embedding models. This would greatly expand the range of content that Reportr could process.
A sixth enhancement would be the implementation of collaborative features. Currently, Reportr is designed for individual use. Collaborative features could include shared knowledge bases (so that multiple users can contribute to and benefit from a common vector database), shared reports (so that users can comment on and annotate reports), and team workflows (so that research tasks can be assigned and tracked).
A seventh enhancement would be the implementation of more sophisticated summarization techniques. Currently, Reportr uses a large language model to generate summaries, which works well but can be expensive and slow. Alternative approaches could include extractive summarization (selecting the most important sentences from source documents), abstractive summarization with smaller models, or hybrid approaches that combine multiple techniques.
An eighth enhancement would be the implementation of visualization capabilities. Currently, Reportr produces text-based reports. Visualizations could include citation networks (showing how papers cite each other), topic evolution over time (showing how research themes have changed), author collaboration networks (showing who works with whom), and concept maps (showing relationships between ideas).
A ninth enhancement would be the implementation of personalization. Currently, Reportr treats all users and all queries the same way. Personalization could involve learning user preferences (such as which sources they find most useful, which types of content they prefer, or which topics they are interested in) and tailoring the research process accordingly.
A tenth enhancement would be the implementation of active learning. Currently, Reportr does not learn from user feedback. An active learning system could ask users to rate the relevance of results, and use this feedback to improve future searches. Over time, the system could learn which sources are most reliable for which types of queries, which documents are most relevant, and which synthesis strategies produce the best reports.
11. CONCLUSION
Reportr represents a significant advancement in automated research assistance. By combining multiple specialized agents, each employing state-of-the-art techniques in their respective domains, the system is able to conduct comprehensive research that rivals the quality of manual research while requiring only a fraction of the time.
The architectural decisions that underpin Reportr reflect careful consideration of competing concerns. The agent-based architecture provides modularity, extensibility, and fault tolerance. The separation of concerns ensures that each agent has a clear, well-defined responsibility. The hybrid search strategy ensures comprehensive coverage across multiple sources and search modalities. The integration of vector databases enables semantic search capabilities that go beyond simple keyword matching. The asynchronous processing model ensures efficient use of resources. The configuration-driven design provides flexibility without requiring code changes. The LLM provider abstraction ensures that the system can adapt to the rapidly evolving landscape of language models.
The system is not without limitations. It depends on external services that may be unavailable or rate-limited. It cannot access content behind paywalls. Its summaries, while generally high quality, are not perfect and should be verified by human experts. Its coverage, while broad, is not exhaustive. However, within these limitations, Reportr provides genuine value to researchers, analysts, and knowledge workers who need to stay current with rapidly evolving fields.
Looking forward, there are numerous opportunities to enhance Reportr's capabilities. Additional search sources, more sophisticated query understanding, citation network analysis, question answering, multi-lingual support, collaborative features, advanced summarization, visualization, personalization, and active learning could all significantly expand what the system can do.
But even in its current form, Reportr demonstrates the power of well-designed multi-agent systems. By decomposing a complex task into manageable subtasks, assigning each subtask to a specialized agent, and carefully orchestrating the interactions between agents, it is possible to build systems that accomplish things that would be difficult or impossible for monolithic architectures.
The principles embodied in Reportr's architecture are applicable far beyond research automation. Any complex task that involves multiple distinct activities, each requiring different expertise or interacting with different external systems, is a candidate for an agent-based architecture. As artificial intelligence continues to advance, and as large language models become more capable and more accessible, we can expect to see many more systems that follow the architectural patterns pioneered by systems like Reportr.
In conclusion, Reportr is both a useful tool in its own right and a demonstration of architectural principles that will shape the next generation of AI-powered applications. It shows that by combining multiple AI technologies (search, natural language processing, embeddings, large language models) in a well-designed architecture, we can create systems that genuinely augment human capabilities and enable us to work more effectively in an information-rich world.
12. ADDENDUM: COMPLETE USER MANUAL
This addendum provides comprehensive instructions for installing, configuring, and using Reportr. It is intended for users who want to use the system to conduct research, as well as for administrators who need to deploy and maintain the system.
12.1 SYSTEM REQUIREMENTS
Reportr requires the following system resources and software:
Hardware Requirements:
- Processor: Modern multi-core processor (Intel Core i5 or equivalent or better)
- Memory: Minimum 8 GB RAM, 16 GB recommended for processing large documents
- Storage: Minimum 10 GB free disk space for the application, dependencies, and vector database (more space required for storing large numbers of reports)
- Network: Broadband internet connection for accessing external search APIs and downloading documents
Software Requirements:
- Operating System: Linux, macOS, or Windows (Linux or macOS recommended)
- Python: Version 3.8 or later (Python 3.9 or 3.10 recommended)
- pip: Python package installer (usually included with Python)
- Git: Version control system (for downloading the source code)
External Service Requirements:
- OpenAI API key (for using GPT models) OR Anthropic API key (for using Claude models). At least one is required for report generation.
- Internet access to arXiv, Google Scholar, DuckDuckGo, and Medium
12.2 INSTALLATION
Follow these steps to install Reportr on your system:
Step 1: Install Python If Python is not already installed on your system, download and install it from the official Python website (https://www.python.org/downloads/). During installation, ensure that the option to add Python to your system PATH is selected.
To verify that Python is installed correctly, open a terminal or command prompt and type: python --version or python3 --version
You should see output indicating Python version 3.8 or later.
Step 2: Install Git If Git is not already installed, download and install it from the official Git website (https://git-scm.com/downloads).
To verify that Git is installed correctly, type: git --version
Step 3: Download Reportr Open a terminal or command prompt and navigate to the directory where you want to install Reportr. Then clone the repository (or download the source code if it is not in a Git repository): git clone
If the source code is provided as a ZIP file, extract it and navigate to the extracted directory.
Step 4: Create a Virtual Environment (Recommended) It is recommended to create a Python virtual environment to isolate Reportr's dependencies from other Python projects on your system. To create a virtual environment, type: python -m venv venv
To activate the virtual environment:
- On Linux or macOS: source venv/bin/activate
- On Windows: venv\Scripts\activate
When the virtual environment is activated, your command prompt should show (venv) at the beginning of the line.
Step 5: Install Dependencies With the virtual environment activated, install Reportr's dependencies using pip: pip install -r requirements.txt
This command will download and install all the Python libraries that Reportr depends on. The installation may take several minutes.
Step 6: Obtain API Keys Reportr requires an API key for at least one large language model provider (OpenAI or Anthropic).
To obtain an OpenAI API key:
- Go to https://platform.openai.com/
- Create an account or log in
- Navigate to the API keys section
- Create a new API key
- Copy the key (you will not be able to see it again)
To obtain an Anthropic API key:
- Go to https://www.anthropic.com/
- Create an account or log in
- Navigate to the API keys section
- Create a new API key
- Copy the key
Step 7: Configure API Keys Create a file named .env in the Reportr directory and add your API key(s): OPENAI_API_KEY=your_openai_key_here or ANTHROPIC_API_KEY=your_anthropic_key_here
Replace "your_openai_key_here" or "your_anthropic_key_here" with your actual API key. Do not share this file or commit it to version control, as it contains sensitive credentials.
Step 8: Verify Installation To verify that Reportr is installed correctly, run: python reportr.py --help
You should see a help message describing the available command-line options.
12.3 CONFIGURATION
Reportr's behavior is controlled by a configuration file named config.yaml. This file is located in the Reportr directory. You can edit this file with any text editor to customize the system's behavior.
The configuration file is organized into several sections:
System Configuration Section: This section controls general system behavior.
system_config:
log_level: 'INFO'
log_file: './logs/agentic_system.log'
max_concurrent_searches: 4
max_concurrent_downloads: 10
max_concurrent_processing: 4
request_timeout: 30
retry_attempts: 3
retry_delay: 2
cache_enabled: true
cache_ttl: 3600
cache_dir: './cache'
rate_limit_calls: 10
rate_limit_window: 60
reports_dir: './reports'
scheduled_reports_dir: './reports/scheduled'
single_reports_dir: './reports/single'
Explanation of system configuration parameters:
- log_level: Controls the verbosity of logging. Options are DEBUG, INFO, WARNING, ERROR, CRITICAL. INFO is recommended for normal use.
- log_file: Path to the log file where system messages are written.
- max_concurrent_searches: Maximum number of search sources to query simultaneously.
- max_concurrent_downloads: Maximum number of documents to download simultaneously.
- max_concurrent_processing: Maximum number of documents to process simultaneously.
- request_timeout: Timeout in seconds for HTTP requests.
- retry_attempts: Number of times to retry failed operations.
- retry_delay: Initial delay in seconds between retry attempts (increases exponentially).
- cache_enabled: Whether to cache search results.
- cache_ttl: Time-to-live for cached results in seconds.
- cache_dir: Directory where cache files are stored.
- rate_limit_calls: Maximum number of API calls per time window.
- rate_limit_window: Time window in seconds for rate limiting.
- reports_dir: Base directory where reports are saved.
- scheduled_reports_dir: Subdirectory for scheduled reports.
- single_reports_dir: Subdirectory for single-query reports.
Search Configuration Section: This section controls the Search Agent's behavior.
search_config:
sources:
- 'arxiv'
- 'scholar'
- 'web'
- 'medium'
max_results_per_source: 10
relevance_threshold: 0.5
enable_query_expansion: true
deduplication_similarity: 0.85
Explanation of search configuration parameters:
- sources: List of search sources to use. You can remove sources you do not want to search.
- max_results_per_source: Maximum number of results to retrieve from each source.
- relevance_threshold: Minimum relevance score for results to be included (0.0 to 1.0).
- enable_query_expansion: Whether to automatically expand queries with related terms.
- deduplication_similarity: Similarity threshold for identifying duplicate results (0.0 to 1.0).
Document Configuration Section: This section controls the Document Agent's behavior.
document_config:
fetch_full_text: true
max_pdf_size_mb: 50
max_webpage_size_mb: 10
pdf_extraction_timeout: 60
webpage_extraction_timeout: 30
user_agent: 'Reportr/1.0 Research Assistant'
Explanation of document configuration parameters:
- fetch_full_text: Whether to attempt to retrieve full text for documents.
- max_pdf_size_mb: Maximum size in megabytes for PDF downloads.
- max_webpage_size_mb: Maximum size in megabytes for webpage downloads.
- pdf_extraction_timeout: Timeout in seconds for PDF text extraction.
- webpage_extraction_timeout: Timeout in seconds for webpage text extraction.
- user_agent: User agent string to use for HTTP requests.
Vector Store Configuration Section: This section controls the Vector Store Agent's behavior.
vector_store_config:
collection_name: 'research_papers'
embedding_model: 'all-MiniLM-L6-v2'
chunk_size: 500
chunk_overlap: 100
similarity_metric: 'cosine'
persist_directory: './chroma_db'
Explanation of vector store configuration parameters:
- collection_name: Name of the ChromaDB collection to use.
- embedding_model: Name of the Sentence Transformers model to use for embeddings.
- chunk_size: Size of text chunks in words for long documents.
- chunk_overlap: Overlap between consecutive chunks in words.
- similarity_metric: Metric to use for similarity calculations (cosine, euclidean, or dot).
- persist_directory: Directory where the vector database is stored.
Report Configuration Section: This section controls the Report Agent's behavior.
report_config:
max_items: 10
relevance_threshold: 0.6
deduplication_similarity: 0.85
summary_max_tokens: 500
include_full_text_excerpts: true
excerpt_length: 500
Explanation of report configuration parameters:
- max_items: Maximum number of results to include in the final report.
- relevance_threshold: Minimum relevance score for results to be included in the report.
- deduplication_similarity: Similarity threshold for content-based deduplication.
- summary_max_tokens: Maximum number of tokens for the LLM-generated summary.
- include_full_text_excerpts: Whether to include excerpts from full text in the report.
- excerpt_length: Length of excerpts in characters.
LLM Configuration Section: This section controls which large language model provider to use.
llm_config:
provider: 'openai'
model: 'gpt-4'
temperature: 0.7
max_tokens: 2000
Explanation of LLM configuration parameters:
- provider: Which LLM provider to use. Options are 'openai' or 'anthropic' , 'ollama' or 'local'
- model: Which specific model to use. For OpenAI, options include 'gpt-4', 'gpt-4-turbo', 'gpt-3.5-turbo'. For Anthropic, options include 'claude-3- opus', 'claude-3-sonnet', 'claude-3-haiku'.
- temperature: Controls randomness in LLM output (0.0 to 1.0). Lower values produce more deterministic output.
- max_tokens: Maximum number of tokens for LLM responses.
After modifying the configuration file, save it and restart Reportr for the changes to take effect.
12.4 BASIC USAGE
This section describes how to use Reportr to conduct research.
Conducting a Single Research Query: The most basic use of Reportr is to conduct research on a single topic. To do this, use the following command:
python reportr.py --query "your research topic here"
Replace "your research topic here" with the actual topic you want to research. For example:
python reportr.py --query "recent advances in quantum computing"
Reportr will then:
- Search multiple sources for relevant documents
- Download and process the documents
- Generate a comprehensive research report
- Save the report to the reports/single directory
The process typically takes two to five minutes. You will see progress messages in the terminal indicating what the system is doing.
When the process completes, you will see a message indicating where the report was saved. The report will be saved in two formats:
- A Markdown file (.md) for human reading
- A JSON file (.json) for machine processing
Viewing the Report: To view the Markdown report, you can use any text editor or Markdown viewer. On most systems, you can open the file with:
cat reports/single/<topic>/<filename>.md
Or open it in your preferred text editor.
The report will contain:
- A header with the research topic, query, timestamp, and metadata
- An executive summary synthesizing the key findings
- A list of identified trends
- Detailed findings for each source, including title, authors, publication date, abstract or excerpt, and URL
Specifying Output Directory: By default, reports are saved to the reports/single directory. You can specify a different output directory using the --output option:
python reportr.py --query "machine learning" --output /path/to/output
Adjusting Verbosity: By default, Reportr displays INFO-level messages. You can increase verbosity to see more detailed DEBUG-level messages:
python reportr.py --query "artificial intelligence" --verbose
Or you can reduce verbosity to see only WARNING-level and higher messages:
python reportr.py --query "neural networks" --quiet
12.5 ADVANCED USAGE
This section describes more advanced features of Reportr.
Scheduled Research: Reportr can automatically conduct research on specified topics at regular intervals. This is useful for staying current with rapidly evolving fields.
To schedule research, create a schedule configuration file (or add to the existing config.yaml):
schedule_config:
enabled: true
topics:
- topic: "artificial intelligence news"
frequency: "daily"
time: "09:00"
- topic: "quantum computing breakthroughs"
frequency: "weekly"
day: "Monday"
time: "10:00"
Then run Reportr in scheduling mode:
python reportr.py --schedule
Reportr will run continuously, checking the schedule and conducting research at the specified times. Scheduled reports are saved to the reports/scheduled directory.
To stop the scheduler, press Ctrl+C.
Customizing Search Sources: You can specify which search sources to use for a particular query using the --sources option:
python reportr.py --query "machine learning" --sources arxiv scholar
This will search only arXiv and Google Scholar, skipping web search and Medium.
Limiting Results: You can limit the number of results included in the report using the --max- results option:
python reportr.py --query "deep learning" --max-results 5
This will include only the top 5 most relevant results in the report.
Specifying LLM Provider: You can override the LLM provider specified in the configuration file using the --llm-provider option:
python reportr.py --query "neural networks" --llm-provider anthropic
This will use Anthropic's Claude models instead of OpenAI's GPT models or local LLMs.
Disabling Full Text Retrieval: If you want faster results and do not need full text content, you can disable full text retrieval:
python reportr.py --query "computer vision" --no-full-text
This will use only abstracts and snippets, which is faster but provides less detail.
Semantic Search: You can use the vector database to perform semantic search on previously researched topics:
python reportr.py --semantic-search "transformer architectures"
This will search the vector database for documents semantically similar to the query, without conducting new searches on external sources.
Exporting Reports: Reports are automatically saved in Markdown and JSON formats. You can convert the Markdown report to other formats using external tools. For example, to convert to PDF using pandoc:
pandoc reports/single/<topic>/<filename>.md -o report.pdf
To convert to HTML:
pandoc reports/single/<topic>/<filename>.md -o report.html
12.6 TROUBLESHOOTING
This section addresses common problems and their solutions.
Problem: "Module not found" error when running Reportr Solution: Ensure that you have activated the virtual environment and installed all dependencies. Run: source venv/bin/activate (on Linux/macOS) venv\Scripts\activate (on Windows) pip install -r requirements.txt
Problem: "API key not found" error Solution: Ensure that you have created a .env file with your API key. The file should contain: OPENAI_API_KEY=your_key_here or ANTHROPIC_API_KEY=your_key_here Make sure there are no extra spaces or quotes around the key.
Problem: "Rate limit exceeded" error Solution: You are making too many requests to an external API. Wait a few minutes and try again. If the problem persists, increase the rate_limit_window parameter in config.yaml or decrease the rate_limit_calls parameter.
Problem: No results found for query Solution: Try broadening your query or checking that the search sources are enabled in config.yaml. Some queries may be too specific or may use terminology that does not appear in the indexed sources.
Problem: PDF download or extraction fails Solution: Some PDFs may be behind paywalls, have broken links, or be in formats that are difficult to extract. Reportr will log warnings for failed downloads and continue with other documents. Check the log file for details.
Problem: Slow performance Solution: Several factors can affect performance:
- Reduce max_results_per_source in config.yaml to retrieve fewer results
- Disable full text retrieval using --no-full-text
- Reduce max_concurrent_downloads if your internet connection is slow
- Use a faster LLM model (e.g., gpt-3.5-turbo instead of gpt-4)
Problem: Out of memory error Solution: Reportr may run out of memory when processing very large documents or large numbers of documents. Solutions:
- Reduce max_pdf_size_mb in config.yaml to skip very large PDFs
- Reduce max_results_per_source to process fewer documents
- Increase your system's available RAM
- Close other applications to free up memory
Problem: ChromaDB errors Solution: If you encounter errors related to the vector database, try deleting the database directory (specified by persist_directory in config.yaml) and letting Reportr recreate it. Note that this will delete all previously indexed documents.
Problem: LLM generates poor quality summaries Solution: Try adjusting the temperature parameter in config.yaml. Lower values (e.g., 0.3) produce more focused, deterministic output. Higher values (e.g., 0.9) produce more creative but potentially less accurate output. You can also try a different model.
12.7 BEST PRACTICES
This section provides recommendations for getting the best results from Reportr.
Crafting Effective Queries:
- Be specific but not overly narrow. "transformer architectures for NLP" is better than just "transformers" or "attention mechanisms in transformer architectures for natural language processing tasks".
- Include key terminology from your field. This helps the search engines find relevant academic papers.
- Avoid very common words that might dilute the search. "Recent advances in X" is often better than "What is X".
- If searching for a specific paper or author, include that information in the query.
Interpreting Results:
- Always verify important claims by checking the original sources. The LLM- generated summary is generally accurate but should not be blindly trusted.
- Pay attention to publication dates. Older papers may not reflect current understanding.
- Consider the source. Papers from arXiv are preprints and may not have been peer-reviewed. Blog posts may represent individual opinions rather than consensus.
- Look for agreement across multiple sources. Claims that appear in multiple independent sources are more likely to be reliable.
Managing the Knowledge Base:
- The vector database grows over time as you conduct more research. This is generally beneficial, but very large databases may slow down searches.
- Periodically review and clean up the database by deleting old or irrelevant entries.
- Consider using separate collections for different research projects to keep them organized.
Optimizing Performance:
- For quick exploratory research, disable full text retrieval and reduce the number of results.
- For comprehensive research, enable full text retrieval and increase the number of results.
- Use scheduled research for topics you need to monitor continuously, rather than running manual queries repeatedly.
Staying Within API Limits:
- Be mindful of API rate limits and costs, especially when using paid services like OpenAI or Anthropic.
- Use caching to avoid redundant API calls for the same query.
- Consider using less expensive models (e.g., GPT-3.5 instead of GPT-4) for routine research.
Organizing Reports:
- Reports are automatically organized by topic in separate directories. Use descriptive, consistent topic names to keep reports organized.
- The timestamp in the filename allows you to track how knowledge on a topic evolves over time.
- Consider creating a naming convention for related queries (e.g., "ML- transformers", "ML-CNNs", "ML-RNNs") to group related reports.
12.8 FREQUENTLY ASKED QUESTIONS
Q: How much does it cost to use Reportr? A: Reportr itself is free and open source. However, it requires an API key for either OpenAI or Anthropic, which are paid services. The cost per research query depends on the model used and the amount of text processed. A typical query using GPT-3.5-turbo costs a few cents. Using GPT-4 is more expensive, typically 10-50 cents per query. Check the pricing pages of OpenAI or Anthropic for current rates.
Q: Can I use Reportr without an internet connection? A: No. Reportr requires internet access to search external sources, download documents, and access LLM APIs. However, once documents are indexed in the vector database, you can perform semantic searches on them offline (though you still need internet access for LLM-based summarization).
Q: How accurate are the generated summaries? A: The summaries are generally accurate and useful, but they are not perfect. Large language models can occasionally make mistakes, misinterpret information, or generate plausible-sounding but incorrect statements. Always verify important claims by checking the original sources.
Q: Can I use Reportr for commercial purposes? A: This depends on the license under which Reportr is distributed. Check the LICENSE file in the Reportr directory. Also be aware that the terms of service for the external APIs (OpenAI, Anthropic, arXiv, Google Scholar, etc.) may impose restrictions on commercial use.
Q: How do I update Reportr to a newer version? A: If Reportr is in a Git repository, you can update by running: git pull pip install -r requirements.txt If you downloaded Reportr as a ZIP file, download the new version and replace the old files.
Q: Can I run multiple instances of Reportr simultaneously? A: Yes, but be careful about rate limits and resource usage. Each instance will make API calls and consume memory. Ensure that your total usage stays within the rate limits of the external services.
Q: How do I contribute to Reportr development? A: If Reportr is an open source project, check the repository for contribution guidelines. Typically, you would fork the repository, make your changes, and submit a pull request.
Q: What should I do if I find a bug? A: Check the issue tracker (if the project has one) to see if the bug has already been reported. If not, create a new issue with a detailed description of the bug, steps to reproduce it, and any relevant error messages or log files.
Q: Can Reportr access paywalled content? A: No. Reportr can only access publicly available content. If a paper is behind a paywall, Reportr will not be able to retrieve the full text, though it may still be able to retrieve the abstract.
Q: How long are reports stored? A: Reports are stored indefinitely unless you manually delete them. They are saved as files in the reports directory and will remain there until you remove them.
Q: Can I customize the report format? A: Yes. The report format is controlled by the render_report_markdown method in the Report Agent. You can modify this method to change the structure, styling, or content of the reports. You can also create additional rendering methods for other formats (e.g., HTML, PDF, LaTeX).
12.9 COMMAND-LINE REFERENCE
This section provides a complete reference of all command-line options.
Basic Syntax: python reportr.py [OPTIONS]
Options:
--query QUERY, -q QUERY Conduct research on the specified topic. This is the primary way to use Reportr for single-query research. Example: python reportr.py --query "machine learning"
--schedule, -s Run Reportr in scheduling mode, conducting research on topics specified in the schedule configuration at regular intervals. Example: python reportr.py --schedule
--semantic-search QUERY Perform semantic search on the vector database without conducting new searches on external sources. Example: python reportr.py --semantic-search "neural networks"
--sources SOURCE [SOURCE ...] Specify which search sources to use. Options are: arxiv, scholar, web, medium. Multiple sources can be specified separated by spaces. Example: python reportr.py --query "AI" --sources arxiv scholar
--max-results N Limit the number of results included in the final report to N. Example: python reportr.py --query "AI" --max-results 5
--output PATH, -o PATH Specify the directory where reports should be saved. Example: python reportr.py --query "AI" --output /path/to/output
--llm-provider PROVIDER Specify which LLM provider to use. Options are: openai, anthropic, ollama, local (using llama.cpp). Example: python reportr.py --query "AI" --llm-provider anthropic
--llm-model MODEL Specify which LLM model to use. The available models depend on the provider. Example: python reportr.py --query "AI" --llm-model gpt-3.5-turbo
--no-full-text Disable full text retrieval. This makes research faster but provides less detail. Example: python reportr.py --query "AI" --no-full-text
--config PATH, -c PATH Specify a custom configuration file instead of the default config.yaml. Example: python reportr.py --query "AI" --config custom_config.yaml
--verbose, -v Increase logging verbosity to DEBUG level. Example: python reportr.py --query "AI" --verbose
--quiet Decrease logging verbosity to WARNING level. Example: python reportr.py --query "AI" --quiet
--version Display the version number of Reportr and exit. Example: python reportr.py --version
--help, -h Display help message describing all command-line options and exit. Example: python reportr.py --help
12.10 CONFIGURATION FILE REFERENCE
This section provides a complete reference of all configuration parameters.
System Configuration (system_config):
- log_level: Logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- log_file: Path to log file
- max_concurrent_searches: Maximum concurrent search operations
- max_concurrent_downloads: Maximum concurrent document downloads
- max_concurrent_processing: Maximum concurrent document processing operations
- request_timeout: HTTP request timeout in seconds
- retry_attempts: Number of retry attempts for failed operations
- retry_delay: Initial delay between retries in seconds
- cache_enabled: Whether to enable result caching (true/false)
- cache_ttl: Cache time-to-live in seconds
- cache_dir: Directory for cache storage
- rate_limit_calls: Maximum API calls per time window
- rate_limit_window: Time window for rate limiting in seconds
- reports_dir: Base directory for report storage
- scheduled_reports_dir: Subdirectory for scheduled reports
- single_reports_dir: Subdirectory for single-query reports
Search Configuration (search_config):
- sources: List of search sources to use (arxiv, scholar, web, medium)
- max_results_per_source: Maximum results to retrieve from each source
- relevance_threshold: Minimum relevance score (0.0 to 1.0)
- enable_query_expansion: Whether to expand queries (true/false)
- deduplication_similarity: Similarity threshold for deduplication (0.0 to 1.0)
Document Configuration (document_config):
- fetch_full_text: Whether to retrieve full text (true/false)
- max_pdf_size_mb: Maximum PDF size in megabytes
- max_webpage_size_mb: Maximum webpage size in megabytes
- pdf_extraction_timeout: PDF extraction timeout in seconds
- webpage_extraction_timeout: Webpage extraction timeout in seconds
- user_agent: User agent string for HTTP requests
Vector Store Configuration (vector_store_config):
- collection_name: Name of ChromaDB collection
- embedding_model: Sentence Transformers model name
- chunk_size: Text chunk size in words
- chunk_overlap: Overlap between chunks in words
- similarity_metric: Similarity metric (cosine, euclidean, dot)
- persist_directory: Directory for vector database storage
Report Configuration (report_config):
- max_items: Maximum results in final report
- relevance_threshold: Minimum relevance for inclusion (0.0 to 1.0)
- deduplication_similarity: Content similarity threshold (0.0 to 1.0)
- summary_max_tokens: Maximum tokens for LLM summary
- include_full_text_excerpts: Whether to include excerpts (true/false)
- excerpt_length: Length of excerpts in characters
LLM Configuration (llm_config):
- provider: LLM provider (openai, anthropic, ollama, local)
- model: Model name (e.g., gpt-4, claude-3-opus)
- temperature: Randomness parameter (0.0 to 1.0)
- max_tokens: Maximum tokens for LLM responses
Schedule Configuration (schedule_config):
- enabled: Whether scheduling is enabled (true/false)
- topics: List of scheduled research topics, each with:
- topic: Research topic string
- frequency: Frequency (daily, weekly, monthly)
- time: Time in HH:MM format
- day: Day of week (for weekly) or day of month (for monthly)
12.11 SUPPORT AND RESOURCES
For additional support and resources:
Documentation:
- This user manual provides comprehensive information about installation, configuration, and usage.
- The source code includes inline comments explaining implementation details.
- The configuration file includes comments describing each parameter.
Community Support:
- Check the project repository for issues, discussions, and updates.
- Search for existing issues before creating new ones.
- Provide detailed information when reporting bugs or requesting features.
Professional Support:
- For enterprise deployments or custom development, contact the author or maintainers.
Updates:
- Check the repository regularly for updates and new features.
- Subscribe to release notifications if available.
Learning Resources:
- The architecture section of this booklet provides in-depth explanation of the system's design.
- The source code serves as a reference implementation of multi-agent systems.
Contact Information:
- Author of this Booklet: Michael Stal
- Date: February 2026
- Project: Reportr - Multi-Agent AI Research System
13. ADDENDUM: LOCAL LLM SUPPORT WITH LLAMA.CPP AND OLLAMA EXTENDING REPORTR FOR OFFLINE AND COST-FREE OPERATION
1. INTRODUCTION TO LOCAL LLM SUPPORT
The original implementation of Reportr relied exclusively on cloud-based large language model providers, specifically OpenAI's GPT models and Anthropic's Claude models. While these cloud-based solutions offer exceptional quality and convenience, they present several significant limitations.
First, cloud-based LLMs require continuous internet connectivity. Second, they incur ongoing operational costs with each API call. Third, they raise privacy concerns as your research data is transmitted to external servers. Fourth, they create dependency on external services that may experience downtime or pricing changes.
To address these limitations, Reportr now supports local LLM execution using two popular frameworks: llama.cpp and Ollama. These frameworks enable running sophisticated language models entirely on your local hardware, without requiring internet connectivity or incurring per-use costs.
Benefits of Local LLMs:
- Complete privacy: All data processing occurs on your hardware
- Zero per-query costs: Unlimited usage after initial hardware investment
- Offline operation: No internet connection required
- Data sovereignty: Full control over your research data
- Regulatory compliance: Easier to meet HIPAA, GDPR, and other requirements
- Customization: Ability to fine-tune models for specific domains
Trade-offs:
- Lower quality compared to GPT-4 or Claude 3 Opus
- Requires capable hardware (preferably with GPU)
- Slower inference speed on consumer hardware
- Initial setup complexity
- Model management overhead
2. INSTALLING AND CONFIGURING OLLAMA
Ollama is the recommended option for most users because it provides the easiest setup and management experience. Ollama handles model downloading, version management, and provides a simple API interface.
2.1 INSTALLING OLLAMA
Installation on Linux: Open a terminal and run the following command:
curl -fsSL https://ollama.ai/install.sh | sh
This script will download and install Ollama on your system.
Installation on macOS: Download the Ollama application from the official website:
https://ollama.ai/download
Open the downloaded DMG file and drag Ollama to your Applications folder. Then launch Ollama from your Applications.
Installation on Windows: Download the Ollama installer from the official website:
https://ollama.ai/download
Run the installer and follow the installation wizard. Ollama will be installed as a Windows service that starts automatically.
Verifying Installation: After installation, verify that Ollama is running by opening a terminal and typing:
ollama --version
You should see the version number displayed.
2.2 DOWNLOADING MODELS WITH OLLAMA
Ollama makes it extremely easy to download and manage models. The following models are recommended for use with Reportr:
For Best Quality (Requires 16GB+ RAM): Download the Mixtral 8x7B model, which provides excellent quality comparable to GPT-3.5:
ollama pull mixtral:8x7b
For Balanced Performance (Requires 8GB+ RAM): Download the Llama 2 13B model, which provides good quality with moderate resource requirements:
ollama pull llama2:13b
For Lower-End Hardware (Requires 4GB+ RAM): Download the Llama 2 7B model, which can run on more modest hardware:
ollama pull llama2:7b
For Fastest Performance (Requires 2GB+ RAM): Download the Mistral 7B model, which is optimized for speed:
ollama pull mistral:7b
For Code and Technical Content: Download the CodeLlama model, which is optimized for technical and programming content:
ollama pull codellama:13b
Listing Downloaded Models: To see which models you have downloaded, run:
ollama list
This will display all locally available models with their sizes and last modified dates.
Testing a Model: To verify that a model works correctly, you can test it interactively:
ollama run llama2:13b
This will start an interactive chat session. Type a message and press Enter to see the model's response. Type "/bye" to exit.
2.3 STARTING OLLAMA SERVER
Ollama runs as a background service that provides an HTTP API. On most systems, the service starts automatically after installation.
Checking if Ollama is Running: To verify that the Ollama service is running, use:
curl http://localhost:11434/api/tags
If Ollama is running, this will return a JSON response listing available models.
Starting Ollama Manually (if needed): If Ollama is not running, you can start it manually:
On Linux: ollama serve
On macOS: Open the Ollama application from Applications
On Windows: Start the "Ollama" service from the Services control panel
The Ollama server listens on port 11434 by default. This port is used by Reportr to communicate with Ollama.
2.4 CONFIGURING REPORTR TO USE OLLAMA
To configure Reportr to use Ollama instead of cloud-based LLMs, you need to modify the configuration file.
Step 1: Open the Configuration File Open config.yaml in your text editor:
nano config.yaml
or use any text editor you prefer.
Step 2: Modify the LLM Configuration Section Find the llm_config section and modify it as follows:
llm_config:
provider: 'ollama'
model: 'llama2:13b'
base_url: 'http://localhost:11434'
temperature: 0.7
max_tokens: 2000
Explanation of parameters:
- provider: Set to 'ollama' to use Ollama instead of OpenAI or Anthropic
- model: The name of the Ollama model to use (must match a downloaded model)
- base_url: The URL where Ollama is running (default is localhost:11434)
- temperature: Controls randomness (0.0 = deterministic, 1.0 = very random)
- max_tokens: Maximum length of generated responses
Step 3: Save the Configuration File Save the file and exit your text editor.
Step 4: Test the Configuration Run a simple research query to verify that Reportr can communicate with Ollama:
python reportr.py --query "test query about artificial intelligence"
If everything is configured correctly, Reportr will use Ollama to generate the research report. You should see log messages indicating that it is connecting to the local Ollama server.
2.5 ADVANCED OLLAMA CONFIGURATION
Customizing Model Behavior with Modelfiles: Ollama supports Modelfiles, which allow you to customize how a model behaves. You can create a Modelfile to add custom system prompts or adjust parameters.
Create a file named Modelfile with the following content:
FROM llama2:13b
SYSTEM """You are a research assistant helping to synthesize information
from academic papers and technical articles. Provide clear, accurate, and
well-structured summaries. Focus on key findings and important insights."""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
Then create a custom model from this Modelfile:
ollama create reportr-assistant -f Modelfile
Now you can use this custom model in Reportr by setting:
llm_config:
provider: 'ollama'
model: 'reportr-assistant'
Using Multiple Models: You can configure Reportr to use different models for different tasks. For example, you might use a larger model for generating summaries and a smaller, faster model for extracting key findings.
This requires modifying the Reportr source code to support model selection per task, but the configuration would look like:
llm_config:
provider: 'ollama'
models:
summary: 'mixtral:8x7b'
findings: 'mistral:7b'
trends: 'llama2:7b'
Remote Ollama Servers: If you have Ollama running on a different machine (for example, a powerful server with GPUs), you can configure Reportr to use it by changing the base_url:
llm_config:
provider: 'ollama'
model: 'mixtral:8x7b'
base_url: 'http://192.168.1.100:11434'
Replace the IP address with the address of your Ollama server.
3. INSTALLING AND CONFIGURING LLAMA.CPP
Llama.cpp provides more control and flexibility than Ollama, but requires more manual setup. It is recommended for advanced users who want fine-grained control over model execution or who need to use models not available in Ollama.
3.1 INSTALLING LLAMA.CPP
Installation on Linux: First, install the required build tools:
sudo apt-get update
sudo apt-get install build-essential git cmake
Clone the llama.cpp repository:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build llama.cpp:
make
For GPU acceleration with CUDA (if you have an NVIDIA GPU):
make LLAMA_CUBLAS=1
For GPU acceleration with Metal (on macOS with Apple Silicon):
make LLAMA_METAL=1
Installation on macOS: Install Xcode Command Line Tools if not already installed:
xcode-select --install
Clone and build llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
For Apple Silicon Macs, Metal acceleration is automatically enabled.
Installation on Windows: The easiest way to build llama.cpp on Windows is using CMake and Visual Studio.
Install Visual Studio 2019 or later with C++ development tools. Install CMake from https://cmake.org/download/
Clone the repository:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build using CMake:
mkdir build
cd build
cmake ..
cmake --build . --config Release
Verifying Installation: After building, verify that the main executable works:
./main --help
You should see the help message with available options.
3.2 DOWNLOADING MODELS FOR LLAMA.CPP
Llama.cpp uses GGUF format models. These can be downloaded from Hugging Face.
Recommended Model Sources: The following Hugging Face repositories provide high-quality GGUF models:
For Llama 2 models: https://huggingface.co/TheBloke/Llama-2-13B-GGUF
For Mistral models: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
For Mixtral models: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
Downloading a Model: Navigate to the model repository on Hugging Face and download the GGUF file. Choose a quantization level based on your available RAM:
For 16GB+ RAM: Download the Q5_K_M or Q6_K variant
For 8-16GB RAM: Download the Q4_K_M variant
For 4-8GB RAM: Download the Q3_K_M variant
For example, to download Llama 2 13B Q4_K_M:
wget https://huggingface.co/TheBloke/Llama-2-13B-GGUF/resolve/main/llama-2-13b.Q4_K_M.gguf
Place the downloaded model file in a directory such as:
mkdir -p ~/llama-models
mv llama-2-13b.Q4_K_M.gguf ~/llama-models/
3.3 RUNNING LLAMA.CPP SERVER
Llama.cpp includes a server mode that provides an HTTP API compatible with the OpenAI API specification.
Starting the Server: Navigate to the llama.cpp directory and run:
./server -m ~/llama-models/llama-2-13b.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080
Explanation of parameters:
- -m: Path to the model file
- -c: Context size (number of tokens the model can consider)
- --host: Network interface to bind to (0.0.0.0 means all interfaces)
- --port: Port number to listen on
For GPU acceleration (if built with CUDA):
./server -m ~/llama-models/llama-2-13b.Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 8080 -ngl 32
The -ngl parameter specifies how many layers to offload to the GPU. Higher values use more GPU memory but provide faster inference.
Testing the Server: Verify that the server is running by making a test request:
curl http://localhost:8080/v1/models
This should return information about the loaded model.
Running as a Background Service: To run the llama.cpp server as a background service that starts automatically, you can create a systemd service file on Linux.
Create a file at /etc/systemd/system/llama-cpp.service:
[Unit]
Description=Llama.cpp Server
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/path/to/llama.cpp
ExecStart=/path/to/llama.cpp/server -m /path/to/model.gguf -c 4096 --host 0.0.0.0 --port 8080
Restart=always
[Install]
WantedBy=multi-user.target
Replace the paths and username with your actual values.
Enable and start the service:
sudo systemctl enable llama-cpp
sudo systemctl start llama-cpp
Check the status:
sudo systemctl status llama-cpp
3.4 CONFIGURING REPORTR TO USE LLAMA.CPP
Configuring Reportr to use llama.cpp is similar to configuring it for Ollama.
Step 1: Open the Configuration File Open config.yaml in your text editor:
nano config.yaml
Step 2: Modify the LLM Configuration Section Find the llm_config section and modify it as follows:
llm_config:
provider: 'llamacpp'
model: 'llama-2-13b'
base_url: 'http://localhost:8080/v1'
temperature: 0.7
max_tokens: 2000
Explanation of parameters:
- provider: Set to 'llamacpp' to use llama.cpp
- model: A descriptive name for the model (used for logging)
- base_url: The URL where llama.cpp server is running, including /v1 path
- temperature: Controls randomness in generation
- max_tokens: Maximum length of generated responses
Step 3: Save and Test Save the configuration file and test with a query:
python reportr.py --query "artificial intelligence trends"
Reportr will now use your local llama.cpp server for all LLM operations.
4. CONFIGURING REPORTR FOR LOCAL LLMs
This section provides the complete configuration details for using local LLMs with Reportr.
4.1 PROVIDER IMPLEMENTATION
Reportr includes built-in support for Ollama and llama.cpp through provider implementations. The system automatically detects which provider you have configured and uses the appropriate API.
Complete Configuration Example for Ollama:
llm_config:
provider: 'ollama'
model: 'llama2:13b'
base_url: 'http://localhost:11434'
temperature: 0.7
max_tokens: 2000
top_p: 0.9
top_k: 40
repeat_penalty: 1.1
num_ctx: 4096
num_predict: 2000
Additional Ollama-specific parameters:
- top_p: Nucleus sampling parameter (0.0 to 1.0)
- top_k: Top-k sampling parameter
- repeat_penalty: Penalty for repeating tokens (1.0 = no penalty)
- num_ctx: Context window size
- num_predict: Maximum tokens to generate
Complete Configuration Example for Llama.cpp:
llm_config:
provider: 'llamacpp'
model: 'llama-2-13b-q4'
base_url: 'http://localhost:8080/v1'
temperature: 0.7
max_tokens: 2000
top_p: 0.9
top_k: 40
repeat_penalty: 1.1
presence_penalty: 0.0
frequency_penalty: 0.0
Additional llama.cpp-specific parameters:
- presence_penalty: Penalty for using tokens that have appeared (0.0 to 2.0)
- frequency_penalty: Penalty based on token frequency (0.0 to 2.0)
4.2 ENVIRONMENT VARIABLES
Instead of storing configuration in config.yaml, you can use environment variables. This is useful for containerized deployments or when you want to switch between different configurations easily.
For Ollama:
export LLM_PROVIDER=ollama
export LLM_MODEL=llama2:13b
export LLM_BASE_URL=http://localhost:11434
export LLM_TEMPERATURE=0.7
For Llama.cpp:
export LLM_PROVIDER=llamacpp
export LLM_MODEL=llama-2-13b
export LLM_BASE_URL=http://localhost:8080/v1
export LLM_TEMPERATURE=0.7
Reportr will automatically use these environment variables if they are set, overriding the values in config.yaml.
4.3 HYBRID CONFIGURATION
You can configure Reportr to use different providers for different purposes. For example, you might use a local model for most queries but fall back to a cloud model for particularly complex queries.
Example hybrid configuration:
llm_config:
primary_provider: 'ollama'
fallback_provider: 'openai'
ollama:
model: 'llama2:13b'
base_url: 'http://localhost:11434'
temperature: 0.7
max_tokens: 2000
openai:
model: 'gpt-4'
temperature: 0.7
max_tokens: 2000
With this configuration, Reportr will first try to use Ollama. If Ollama is unavailable or returns an error, it will automatically fall back to OpenAI.
You can also configure different providers for different tasks:
llm_config:
providers:
summary: 'ollama'
findings: 'ollama'
complex_analysis: 'openai'
ollama:
model: 'mixtral:8x7b'
base_url: 'http://localhost:11434'
openai:
model: 'gpt-4'
This configuration uses local models for routine tasks but uses GPT-4 for complex analysis tasks that benefit from the highest quality model.
5. MODEL SELECTION AND RECOMMENDATIONS
Choosing the right model depends on your hardware capabilities, quality requirements, and use case.
5.1 MODEL SIZE GUIDELINES
Model Parameter Counts and Memory Requirements:
7 Billion Parameter Models:
- Full precision (FP16): ~14 GB RAM
- Q8 quantization: ~7 GB RAM
- Q4 quantization: ~4 GB RAM
- Q3 quantization: ~3 GB RAM
13 Billion Parameter Models:
- Full precision (FP16): ~26 GB RAM
- Q8 quantization: ~13 GB RAM
- Q4 quantization: ~7 GB RAM
- Q3 quantization: ~5 GB RAM
34 Billion Parameter Models:
- Full precision (FP16): ~68 GB RAM
- Q8 quantization: ~34 GB RAM
- Q4 quantization: ~18 GB RAM
- Q3 quantization: ~13 GB RAM
70 Billion Parameter Models:
- Full precision (FP16): ~140 GB RAM
- Q8 quantization: ~70 GB RAM
- Q4 quantization: ~35 GB RAM
- Q3 quantization: ~26 GB RAM
Mixture of Experts (Mixtral 8x7B):
- Full precision (FP16): ~90 GB RAM
- Q8 quantization: ~45 GB RAM
- Q4 quantization: ~24 GB RAM
- Q3 quantization: ~18 GB RAM
These are approximate values. Actual memory usage depends on context length and other factors.
5.2 RECOMMENDED MODELS BY HARDWARE
For Systems with 4-8 GB Available RAM: Best option: Mistral 7B Q4_K_M ollama pull mistral:7b
Alternative: Llama 2 7B Q3_K_M ollama pull llama2:7b
These models provide acceptable quality for basic summarization tasks while running on modest hardware.
For Systems with 8-16 GB Available RAM: Best option: Llama 2 13B Q4_K_M ollama pull llama2:13b
Alternative: Mistral 7B Q5_K_M or Q6_K ollama pull mistral:7b-q6
These models provide good quality suitable for most research tasks.
For Systems with 16-32 GB Available RAM: Best option: Mixtral 8x7B Q4_K_M ollama pull mixtral:8x7b
Alternative: Llama 2 13B Q8_0 or Llama 2 70B Q3_K_M ollama pull llama2:70b
These models provide excellent quality approaching GPT-3.5 performance.
For Systems with 32+ GB Available RAM and GPU: Best option: Mixtral 8x7B Q5_K_M or Q6_K ollama pull mixtral:8x7b
Alternative: Llama 2 70B Q4_K_M ollama pull llama2:70b
These models provide the highest quality available from open-source models.
For Code and Technical Content: Best option: CodeLlama 13B ollama pull codellama:13b
Alternative: CodeLlama 34B (if you have sufficient RAM) ollama pull codellama:34b
These models are specifically trained on code and technical documentation.
5.3 QUALITY COMPARISON
Approximate quality rankings for research summarization tasks:
Tier 1 (Excellent - Comparable to GPT-4):
- GPT-4 Turbo (cloud)
- Claude 3 Opus (cloud)
Tier 2 (Very Good - Comparable to GPT-3.5):
- Mixtral 8x7B Q5_K or higher
- Llama 2 70B Q4_K or higher
- Claude 3 Sonnet (cloud)
- GPT-3.5 Turbo (cloud)
Tier 3 (Good - Suitable for most research tasks):
- Mixtral 8x7B Q4_K
- Llama 2 70B Q3_K
- Llama 2 13B Q5_K or higher
- Mistral 7B Q6_K
Tier 4 (Acceptable - Basic summarization):
- Llama 2 13B Q4_K
- Mistral 7B Q4_K or Q5_K
- CodeLlama 13B Q4_K
Tier 5 (Limited - Simple tasks only):
- Llama 2 7B Q4_K
- Mistral 7B Q3_K
- Smaller or more heavily quantized models
For Reportr, we recommend using at least Tier 3 models for acceptable results. Tier 2 models provide very good results that are suitable for professional use. Tier 1 cloud models provide the best quality but at the cost of privacy and per-use fees.
6. PERFORMANCE OPTIMIZATION
This section provides guidance on optimizing performance when using local LLMs.
6.1 HARDWARE OPTIMIZATION
CPU Optimization: If running on CPU only, ensure you are using a build with appropriate SIMD instructions for your processor.
For modern Intel/AMD processors with AVX2: make LLAMA_AVX2=1
For newer processors with AVX512: make LLAMA_AVX512=1
Check your CPU capabilities: lscpu | grep -i avx
GPU Optimization: If you have an NVIDIA GPU, using CUDA acceleration provides dramatic speedup.
Build llama.cpp with CUDA support: make LLAMA_CUBLAS=1
When running the server, specify how many layers to offload to GPU: ./server -m model.gguf -ngl 40
The -ngl parameter specifies the number of layers. Higher values use more GPU memory but provide faster inference. Start with a low value and increase until you run out of GPU memory.
For Ollama, GPU acceleration is automatic if you have compatible hardware.
Apple Silicon Optimization: On M1, M2, or M3 Macs, Metal acceleration is automatically enabled. These chips have unified memory, so models can use both CPU and GPU memory seamlessly.
For best performance on Apple Silicon:
- Use Q4_K_M or Q5_K_M quantization
- Ensure you have sufficient free RAM
- Close other memory-intensive applications
Memory Management: Ensure you have sufficient RAM available. Check current memory usage:
On Linux: free -h
On macOS: vm_stat
On Windows: Task Manager > Performance > Memory
If you are running low on memory, consider:
- Using a smaller model
- Using more aggressive quantization (Q3 instead of Q4)
- Reducing context length
- Closing other applications
6.2 CONTEXT LENGTH OPTIMIZATION
The context length determines how much text the model can consider. Longer contexts allow better understanding but require more memory and computation.
For Reportr, the optimal context length depends on the task:
For short summaries (default): num_ctx: 2048
For medium summaries with more context: num_ctx: 4096
For comprehensive analysis of long documents: num_ctx: 8192
For maximum context (if hardware allows): num_ctx: 16384 or 32768
Configure context length in config.yaml:
llm_config:
provider: 'ollama'
model: 'llama2:13b'
num_ctx: 4096
Or for llama.cpp server: ./server -m model.gguf -c 4096
Memory usage increases quadratically with context length, so doubling the context length roughly quadruples memory usage.
6.3 BATCH PROCESSING OPTIMIZATION
When Reportr processes multiple documents, it can batch LLM requests for better efficiency.
Configure batch processing in config.yaml:
report_config:
batch_size: 5
batch_timeout: 300
This tells Reportr to process up to 5 documents at a time with a 5-minute timeout per batch.
For local LLMs, smaller batch sizes are often better because:
- Local models process sequentially (no parallelism benefit)
- Smaller batches provide better progress feedback
- Memory usage is more predictable
Recommended batch sizes:
- For CPU-only: batch_size: 1
- For GPU acceleration: batch_size: 3-5
- For high-end GPUs: batch_size: 5-10
6.4 CACHING AND PERSISTENCE
Ollama and llama.cpp support KV cache persistence, which can speed up repeated queries with similar context.
For llama.cpp, enable cache with: ./server -m model.gguf --cache-prompt
For Ollama, caching is automatic.
Reportr can also cache LLM responses to avoid regenerating identical summaries:
llm_config:
cache_responses: true
cache_ttl: 86400 # 24 hours in seconds
This caches LLM responses for 24 hours, so identical queries return cached results instantly.
7. TROUBLESHOOTING
This section addresses common issues when using local LLMs with Reportr.
7.1 OLLAMA ISSUES
Issue: "Connection refused" error when trying to use Ollama Solution: Verify that Ollama is running: curl http://localhost:11434/api/tags
If this fails, start Ollama: ollama serve
Or on macOS/Windows, launch the Ollama application.
Issue: "Model not found" error Solution: Verify that the model is downloaded: ollama list
If the model is not listed, download it: ollama pull llama2:13b
Ensure the model name in config.yaml exactly matches the name shown by "ollama list".
Issue: Ollama runs out of memory Solution: Use a smaller model or more aggressive quantization: ollama pull llama2:7b
Or use a more quantized version of your current model.
Check available memory: free -h (Linux) vm_stat (macOS)
Issue: Ollama is very slow Solution: Ensure GPU acceleration is working. Check Ollama logs: ollama logs
If GPU is not being used, verify that you have compatible GPU drivers installed.
For NVIDIA GPUs, install CUDA toolkit: https://developer.nvidia.com/cuda-downloads
For AMD GPUs, install ROCm: https://rocmdocs.amd.com/
Issue: Ollama generates poor quality responses Solution: Try a larger or less quantized model: ollama pull mixtral:8x7b
Adjust temperature in config.yaml: temperature: 0.5 # Lower = more focused, higher = more creative
Increase context length: num_ctx: 8192
7.2 LLAMA.CPP ISSUES
Issue: llama.cpp server fails to start Solution: Check that the model file path is correct: ls -lh ~/llama-models/
Verify the model file is not corrupted: file model.gguf
Should show "GGUF model file" or similar.
Check for port conflicts: lsof -i :8080 (Linux/macOS) netstat -ano | findstr :8080 (Windows)
If another process is using port 8080, either stop that process or use a different port: ./server -m model.gguf --port 8081
Issue: llama.cpp server crashes or runs out of memory Solution: Reduce context length: ./server -m model.gguf -c 2048
Use a smaller or more quantized model.
Reduce GPU layer offloading: ./server -m model.gguf -ngl 20
Or disable GPU entirely: ./server -m model.gguf -ngl 0
Issue: llama.cpp is not using GPU Solution: Verify that llama.cpp was built with GPU support: ./server --help | grep -i gpu
Should show GPU-related options like -ngl.
If not, rebuild with GPU support: make clean make LLAMA_CUBLAS=1 # For NVIDIA make LLAMA_METAL=1 # For Apple Silicon
Verify GPU drivers are installed: nvidia-smi # For NVIDIA GPUs
Issue: llama.cpp generates incomplete responses Solution: Increase max_tokens in config.yaml: max_tokens: 4000
Or increase the server's default: ./server -m model.gguf -n 4000
Issue: llama.cpp API returns errors Solution: Verify the API endpoint is correct. The llama.cpp server uses OpenAI- compatible API at /v1/chat/completions.
Test with curl: curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "messages": [{"role": "user", "content": "Hello"}], "temperature": 0.7 }'
If this fails, check server logs for errors.
7.3 REPORTR INTEGRATION ISSUES
Issue: Reportr cannot connect to local LLM Solution: Verify the base_url in config.yaml is correct:
For Ollama: base_url: 'http://localhost:11434'
For llama.cpp: base_url: 'http://localhost:8080/v1'
Test connectivity: curl http://localhost:11434/api/tags # Ollama curl http://localhost:8080/v1/models # llama.cpp
Issue: Reportr generates poor quality reports with local LLM Solution: The model may be too small or too heavily quantized. Try a larger model: ollama pull mixtral:8x7b
Adjust generation parameters in config.yaml: temperature: 0.6 top_p: 0.9 repeat_penalty: 1.2
Increase context to provide more information to the model: num_ctx: 8192
Issue: Local LLM is much slower than cloud models Solution: This is expected. Local models, especially on CPU, are significantly slower than cloud APIs.
To improve speed:
- Use GPU acceleration if available
- Use a smaller model (7B instead of 13B)
- Use more aggressive quantization (Q3 instead of Q4)
- Reduce context length
- Reduce max_tokens
For comparison:
- Cloud API (GPT-4): 1-3 seconds per summary
- Local GPU (Mixtral 8x7B): 10-30 seconds per summary
- Local CPU (Llama 2 13B): 60-180 seconds per summary
Issue: Reportr times out waiting for local LLM Solution: Increase timeout in config.yaml: llm_config: timeout: 600 # 10 minutes
Or use a faster model.
8. COMPARISON: CLOUD VS LOCAL LLMs
This section provides a comprehensive comparison to help you decide when to use cloud versus local LLMs.
8.1 QUALITY COMPARISON
Summary Quality Rankings:
Highest Quality:
- GPT-4 Turbo (cloud)
- Claude 3 Opus (cloud)
- Mixtral 8x7B Q6_K (local, requires 32GB+ RAM)
High Quality: 4. GPT-3.5 Turbo (cloud) 5. Claude 3 Sonnet (cloud) 6. Mixtral 8x7B Q4_K (local, requires 24GB+ RAM) 7. Llama 2 70B Q4_K (local, requires 35GB+ RAM)
Good Quality: 8. Llama 2 13B Q5_K (local, requires 10GB+ RAM) 9. Mistral 7B Q6_K (local, requires 6GB+ RAM) 10. CodeLlama 13B Q4_K (local, requires 7GB+ RAM)
Acceptable Quality: 11. Llama 2 13B Q4_K (local, requires 7GB+ RAM) 12. Mistral 7B Q4_K (local, requires 4GB+ RAM)
For professional research reports, we recommend using options 1-7. Options 8-10 are suitable for internal use or preliminary research. Options 11-12 are best for testing or very basic summarization.
8.2 COST COMPARISON
Cloud Model Costs (approximate, as of February 2026):
GPT-4 Turbo:
- Input: $0.01 per 1K tokens
- Output: $0.03 per 1K tokens
- Typical research report: $0.20-$0.80
GPT-3.5 Turbo:
- Input: $0.0005 per 1K tokens
- Output: $0.0015 per 1K tokens
- Typical research report: $0.01-$0.05
Claude 3 Opus:
- Input: $0.015 per 1K tokens
- Output: $0.075 per 1K tokens
- Typical research report: $0.30-$1.20
Claude 3 Sonnet:
- Input: $0.003 per 1K tokens
- Output: $0.015 per 1K tokens
- Typical research report: $0.06-$0.25
Local Model Costs:
- Initial hardware investment: $0-$5000 (depending on GPU)
- Per-query cost: $0.00 (electricity cost negligible)
- Break-even point: 100-10000 queries (depending on hardware and cloud model)
Cost Analysis Examples:
Light usage (10 queries/month with GPT-3.5):
- Monthly cloud cost: $0.10-$0.50
- Local hardware payback: Never (cloud is cheaper)
- Recommendation: Use cloud
Moderate usage (100 queries/month with GPT-3.5):
- Monthly cloud cost: $1-$5
- Annual cloud cost: $12-$60
- Local hardware payback: 5-10 years (for CPU-only setup)
- Recommendation: Use cloud unless privacy is critical
Heavy usage (100 queries/month with GPT-4):
- Monthly cloud cost: $20-$80
- Annual cloud cost: $240-$960
- Local hardware payback: 1-2 years (for mid-range GPU setup)
- Recommendation: Consider local LLM
Very heavy usage (1000 queries/month with GPT-4):
- Monthly cloud cost: $200-$800
- Annual cloud cost: $2400-$9600
- Local hardware payback: 2-6 months (for high-end GPU setup)
- Recommendation: Strongly consider local LLM
8.3 SPEED COMPARISON
Typical response times for generating a research summary:
Cloud Models:
- GPT-4 Turbo: 2-5 seconds
- GPT-3.5 Turbo: 1-2 seconds
- Claude 3 Opus: 3-6 seconds
- Claude 3 Sonnet: 2-4 seconds
Local Models (CPU only, modern desktop):
- Mixtral 8x7B Q4_K: Not practical (requires too much RAM)
- Llama 2 13B Q4_K: 120-300 seconds
- Llama 2 7B Q4_K: 60-120 seconds
- Mistral 7B Q4_K: 40-90 seconds
Local Models (NVIDIA RTX 3090, 24GB VRAM):
- Mixtral 8x7B Q4_K: 15-30 seconds
- Llama 2 13B Q4_K: 8-15 seconds
- Llama 2 7B Q4_K: 4-8 seconds
- Mistral 7B Q4_K: 3-6 seconds
Local Models (NVIDIA RTX 4090, 24GB VRAM):
- Mixtral 8x7B Q4_K: 10-20 seconds
- Llama 2 13B Q4_K: 5-10 seconds
- Llama 2 7B Q4_K: 3-5 seconds
- Mistral 7B Q4_K: 2-4 seconds
Local Models (Apple M2 Max, 64GB unified memory):
- Mixtral 8x7B Q4_K: 20-40 seconds
- Llama 2 13B Q4_K: 10-20 seconds
- Llama 2 7B Q4_K: 5-10 seconds
- Mistral 7B Q4_K: 4-8 seconds
Cloud models are generally faster, but high-end local hardware can approach cloud speeds, especially for smaller models.
8.4 DECISION MATRIX
Use Cloud LLMs when:
- You need the highest quality results
- You have low to moderate usage volume
- You need fast response times
- You do not have powerful local hardware
- Privacy is not a critical concern
- You want minimal setup and maintenance
- You need access from multiple locations
- You want automatic model updates
Use Local LLMs when:
- Privacy and data sovereignty are critical
- You have high usage volume (1000+ queries/month)
- You have powerful local hardware (especially GPU)
- You need offline operation capability
- You want zero per-query costs
- You need regulatory compliance (HIPAA, GDPR, etc.)
- You want complete control over model versions
- You can tolerate slower response times
- You have technical expertise for setup and maintenance
Use Hybrid Approach when:
- You want to balance quality and cost
- You have variable usage patterns
- You want local LLM as primary with cloud fallback
- You want to use cloud for complex tasks, local for routine tasks
- You are transitioning from cloud to local
- You want to experiment with local while maintaining cloud backup
8.5 RECOMMENDED CONFIGURATIONS
Configuration 1: Budget-Conscious Individual Researcher Hardware: Laptop or desktop with 16GB RAM, no GPU Model: Mistral 7B Q4_K via Ollama Expected performance: Acceptable quality, 60-90 second summaries Use case: Personal research, non-critical applications Cost: $0 per query after initial setup
Configuration 2: Professional Researcher with Privacy Needs Hardware: Workstation with 32GB RAM, NVIDIA RTX 3060 or better Model: Llama 2 13B Q4_K via Ollama Expected performance: Good quality, 10-20 second summaries Use case: Professional research requiring data privacy Cost: $0 per query, hardware investment ~$1500-2500
Configuration 3: Research Team with High Volume Hardware: Server with 64GB RAM, NVIDIA RTX 4090 or A100 Model: Mixtral 8x7B Q4_K via Ollama Expected performance: Very good quality, 10-20 second summaries Use case: Team conducting hundreds of queries monthly Cost: $0 per query, hardware investment ~$3000-8000
Configuration 4: Hybrid for Flexibility Hardware: Desktop with 16GB RAM, modest GPU Primary: Llama 2 13B Q4_K via Ollama (for routine queries) Fallback: GPT-4 Turbo via OpenAI (for complex analysis) Expected performance: Variable quality, optimized cost Use case: Balancing quality, cost, and privacy Cost: ~$20-100/month depending on cloud usage
Configuration 5: Maximum Quality Hardware: High-end workstation or cloud GPU instance Model: Mixtral 8x7B Q6_K or GPT-4 Turbo Expected performance: Excellent quality, fast responses Use case: Critical research requiring highest quality Cost: Either $0 per query (local) or $0.20-0.80 per query (cloud)
CONCLUSION
The addition of local LLM support through Ollama and llama.cpp transforms Reportr from a cloud-dependent tool into a flexible research platform that can operate in diverse environments. Users can now choose the deployment model that best fits their specific requirements, whether that means prioritizing quality, cost, privacy, or offline capability.
For most users, we recommend starting with Ollama and the Llama 2 13B Q4_K model, which provides a good balance of quality, performance, and resource requirements. As you become more familiar with local LLMs, you can experiment with different models and configurations to find the optimal setup for your needs.
The future of AI research tools lies in providing flexibility and choice. Reportr now offers that flexibility, enabling researchers to work effectively whether they are connected to high-speed internet with cloud API access or working offline in a secure facility with local-only infrastructure.