INTRODUCTION
The vision of interacting with computers using natural language has long been a staple of science fiction, promising a future where complex tasks are accomplished through intuitive conversation. With the advent of powerful Large Language Models (LLMs), this vision is rapidly becoming a reality. This article explores the architecture and implementation of an LLM-based Agentic Artificial Intelligence designed to control a desktop interface using natural language. Such an agent can revolutionize productivity by automating repetitive tasks,simplifying complex workflows, and enhancing accessibility for all users.
This agent will enable users to perform a wide array of desktop operations, including opening applications, manipulating windows (moving, selecting, hiding,organizing, listing), configuring and retrieving system settings, interactingwith the command line, and managing files and folders (creating, deleting,moving, finding, navigating). Furthermore, it will support the saving andreplaying of command sequences, as well as the ability to open, read, and evensummarize text files. While the primary focus will be on MacOS for detailedcode examples, we will also discuss how these concepts translate to Windows andLinux environments, ensuring a comprehensive understanding of cross-platformapplicability. We will meticulously separate OS-specific code from OS-neutralcomponents, adhering to clean architecture principles.
CORE CONCEPTS OF AGENTIC AI
At the heart of this desktop control system lies the concept of an "Agentic AI."Unlike simple LLM applications that generate a single response, an agentic AIoperates in a continuous loop, exhibiting a degree of autonomy and goal-directedbehavior.
Large Language Models (LLMs) as the Brain:
The LLM serves as the cognitive core of our agent. Its primary function is tounderstand the user's natural language intent, break down complex requests intoactionable steps, and translate those steps into structured commands that thecomputer can execute. The LLM's vast knowledge base and reasoning capabilitiesallow it to interpret ambiguous instructions, infer context, and generateappropriate responses or actions.
The Agentic Paradigm: Plan, Act, Observe, Reflect:
An agentic system typically follows a cyclical process to achieve its goals:
1. Plan: The agent receives a user request and, using the LLM, formulate a plan of action. This might involve breaking down a large task into smaller, manageable sub-tasks.
2. Act: The agent executes one or more actions based on its plan. These actions are performed through a set of specialized "tools" that interact with the operating system.
3. Observe: After executing an action, the agent observes the outcome. This could be the output of a command, the state of the desktop, or an error message.
4. Reflect: The agent uses the LLM to analyze the observation, determine if the action was successful, and decide on the next step. If the action failed or did not fully achieve the goal, the agent might refine its plan or attempt a different approach. This iterative process allows the agent to adapt and recover from unexpected situations.
Tools: The Hands and Feet of the Agent:
Tools are modular functions or interfaces that enable the agent to interact with its environment, in this case, the desktop operating system. Each tool encapsulates a specific capability, such as "open an application," "listfiles," or "move a window." The LLM, through a mechanism often called "functioncalling" or "tool use," selects the appropriate tool and provides it with thenecessary parameters based on the user's request.
ARCHITECTURAL OVERVIEW
The architecture of our LLM-powered desktop agent can be visualized as a layered system, designed for modularity, extensibility, and separation of concerns.
+-------------------+ +-------------------+
| User Input | --> | Natural Language |
| (Speech/Text) | | Understanding |
+-------------------+ | (LLM) |
| +-------------------+
| |
| V
| +-------------------+
| | Agentic Loop |
| | (Plan, Act, Observe,|
| | Reflect) |
| +-------------------+
| |
| V
| +-------------------+ +-------------------+
| | Tool Orchestrator| --> | Memory |
| | (Selects & Calls | | (Context, |
| | Tools) | | Preferences) |
| +-------------------+ +-------------------+
| |
| V
| +-------------------+
| | OS-Neutral Tool |
| | Abstractions |
| +-------------------+
| |
| V
| +-------------------+
| | OS-Specific Tool |
| | Implementations |
| | (MacOS, Windows, |
| | Linux) |
| +-------------------+
| |
| V
| +-------------------+
+-------------> | Desktop Operating |
| System |
+-------------------+
Figure 1: Agentic Desktop AI Architecture
1. User Input: The process begins with the user providing a request in natural language, either through text or speech (which would then be converted to text).
2. Natural Language Understanding (LLM): The user's request is fed to the Large Language Model. The LLM's role is to interpret the intent, identify the desired actions, and extract any relevant parameters. This involves sending the user's query and the available tool definitions to the LLM API, which then returns either a structured tool call or a natural language response.
3. Agentic Loop: This component orchestrates the entire process. It takes the LLM's initial interpretation, formulates a plan, executes actions through the Tool Orchestrator, observes the results, and reflects on whether the goal has been achieved or if further steps are needed. It maintains the conversation state and manages multi-turn interactions.
4. Tool Orchestrator: This layer acts as an intermediary between the agentic loop and the actual tools. Based on the LLM's output (e.g., a function call request), it selects the correct tool and invokes it with the specifies arguments.
5. OS-Neutral Tool Abstractions: To ensure portability and maintainability, we define a set of abstract tool interfaces or function signatures that represent common desktop operations (e.g., `open_application(name)`, `list_files(path)`). These abstractions hide the underlying OS-specific complexity.
6. OS-Specific Tool Implementations: This layer contains the actual code that interacts with the operating system. For MacOS, this might involve AppleScript or shell commands. For Windows, it could be PowerShell or COM objects, and for Linux, `xdotool` or `wmctrl`. These implementations adhere to the OS-neutral abstractions.
7. Memory: A crucial component for any agent, memory stores conversational history, user preferences, system settings, and any other context that helps the agent make informed decisions and maintain continuity across interactions.
8. Desktop Operating System: This is the environment the agent interacts with, receiving commands and providing feedback.
THE AGENT'S BRAIN: NATURAL LANGUAGE UNDERSTANDING AND PLANNING WITH REAL LLMS
The core intelligence of our agent resides in the Large Language Model. Its ability to understand human language is what makes this system intuitive. Crucially, modern LLMs, especially those exposed via APIs, offer "function calling" or "tool use" capabilities, allowing them to not just generate text, but also structured data representing calls to external functions.
LLM's Role in Parsing User Intent:
When a user says, "Open Safari and then move its window to the top-left corner," the LLM must perform several tasks:
* Identify the intent: The user wants to open an application and then manipulate its window.
* Extract entities: The application name is "Safari." The target position for the window is "top-left corner."
* Sequence actions: Opening Safari must precede moving its window.
Function Calling / Tool Use with a Real LLM:
Instead of simulating, we integrate with a real LLM API (e.g., OpenAI's GPT models, Google's Gemini, or a local LLM like Llama 3 via Ollama). The process involves:
1. Defining Tools for the LLM: We provide the LLM API with descriptions of the available tools, including their names, what they do, and the parameters they accept, along with their types and descriptions. This is typically done using JSON Schema.
2. LLM API Call: The user's prompt, along with the tool definitions and the conversation history, is sent to the LLM API. The LLM processes this input.
3. Structured Output or Text Response: The LLM API responds. This response can be one of two main types:
* A tool call: If the LLM determines that a tool is relevant to the user's request, it generates a JSON object specifying the tool to call and the arguments for that call. For example, for "Open Safari," the LLM might output a structure like:
{
"tool_calls": [
{
"id": "call_abc123",
"function": {
"name": "open_application",
"arguments": "{\\"app_name\\": \\"Safari\\"}"
},
"type": "function"
}
]
}
* A natural language response: If no tool is deemed appropriate, the LLM generates a textual response directly to the user.
4. Execution: Our system then parses this LLM response. If it's a tool call, it extracts the tool name and parameters, finds the corresponding tool implementation in our `ToolOrchestrator`, and executes it. If it's a text response, it's displayed to the user.
Prompt Engineering Considerations:
Crafting effective system prompts is vital for guiding the LLM's behavior. The system prompt provided to the LLM should clearly define its role, the tools it has access to, and instructions on how to use them.
Example System Prompt Structure (for OpenAI API):
"You are a helpful desktop assistant. You have access to the following tools:
[TOOL_DEFINITIONS_JSON]. Your goal is to assist the user by performing actions on their desktop. When a user asks you to do something, determine the appropriate tool(s) to use and respond with a JSON object containing the 'tool_calls' array.
If multiple steps are required, output one tool call at a time and wait for the result. If you need more information, ask the user. If you cannot fulfill a request, explain why."
This prompt guides the LLM to act as an agent, use tools, and handle multi-step interactions. The `TOOL_DEFINITIONS_JSON` would be dynamically generated from our `DesktopTool` instances.
LLM API Integration (Conceptual Python Class):
To integrate with a real LLM, we would use an SDK provided by the LLM provider.
For example, using OpenAI's Python client:
-----------------------------------------------------------------------
import openai
import os
import json
class OpenAILLMAgent:
"""
An LLM agent that uses OpenAI's API for natural language understanding
and tool calling.
"""
def __init__(self, api_key: str, model_name: str = "gpt-4o"):
self.client = openai.OpenAI(api_key=api_key)
self.model_name = model_name
self.messages = [{"role": "system", "content": "You are a helpful desktop assistant. Use the available tools to assist the user."}]
def _add_message(self, role: str, content: str = None, tool_calls: list = None, tool_call_id: str = None, name: str = None):
"""Helper to add messages to the conversation history."""
message = {"role": role}
if content:
message["content"] = content
if tool_calls:
message["tool_calls"] = tool_calls
if tool_call_id:
message["tool_call_id"] = tool_call_id
if name:
message["name"] = name
self.messages.append(message)
def generate_response(self, user_query: str, available_tools: list[dict]) -> tuple[str, dict or None]:
"""
Sends the user query and available tools to the LLM and gets a response.
Returns a tuple: (response_type, data)
response_type can be "tool_call" or "text_response".
data is the tool call dictionary or the text string.
"""
self._add_message("user", content=user_query)
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=self.messages,
tools=available_tools,
tool_choice="auto" # Let the LLM decide whether to call a tool or respond naturally
)
response_message = response.choices[0].message
if response_message.tool_calls:
# The LLM wants to call a tool
self._add_message("assistant", tool_calls=response_message.tool_calls)
# For simplicity, we'll return the first tool call.
# A real agent might handle multiple tool calls or chained calls.
tool_call = response_message.tool_calls[0]
return "tool_call", {
"id": tool_call.id,
"function": {
"name": tool_call.function.name,
"arguments": json.loads(tool_call.function.arguments)
}
}
elif response_message.content:
# The LLM generated a text response
self._add_message("assistant", content=response_message.content)
return "text_response", response_message.content
else:
return "text_response", "The LLM did not provide a clear response or tool call."
except openai.APIError as e:
return "error", f"OpenAI API Error: {e}"
except Exception as e:
return "error", f"An unexpected error occurred with the LLM: {e}"
def provide_tool_output(self, tool_call_id: str, output: str):
"""
Adds the output of a tool execution to the conversation history
so the LLM can use it for subsequent steps or reflection.
"""
self._add_message("tool", content=output, tool_call_id=tool_call_id)
-----------------------------------------------------------------------
To use this `OpenAILLMAgent`, you would need to install the `openai` Python package (`pip install openai`) and set your OpenAI API key as an environment variable (e.g., `export OPENAI_API_KEY='your_api_key_here'`) or pass it directly.
THE AGENT'S HANDS: TOOLING AND OPERATING SYSTEM INTERACTION
The tooling layer is where the abstract commands from the LLM are translated into concrete actions on the operating system. This layer is crucial for the agent's ability to manipulate the desktop environment.
The Critical Role of the Tool Layer:
The tool layer isolates the LLM's reasoning from the complexities of OS-specific APIs and commands. This separation allows the LLM to focus on "what to do" while the tools handle "how to do it."
OS-Neutral Tool Abstractions:
To ensure our agent can potentially work across different operating systems, we define an abstract interface for our tools. This involves creating Python functions that represent common desktop actions, with clear input parameters and expected outputs. These functions will serve as the contract that OS-specific implementations must fulfill. The `parameters_schema` attribute in `DesktopTool` is particularly important as it directly informs the LLM about the tool's arguments in a machine-readable JSON Schema format.
Example OS-Neutral Tool Abstraction (Conceptual Python class):
-----------------------------------------------------------------------
import abc
import json
class DesktopTool(abc.ABC):
"""
Abstract base class for all desktop interaction tools.
Defines the common interface for tools, allowing for OS-specific
implementations to adhere to a consistent structure.
"""
def __init__(self, name: str, description: str, parameters_schema: dict):
self.name = name
self.description = description
# parameters_schema must be a valid JSON Schema for the tool's arguments
self.parameters_schema = parameters_schema
@abc.abstractmethod
def execute(self, **kwargs) -> str:
"""
Executes the tool's action with the given parameters.
Must be implemented by concrete OS-specific tool classes.
"""
pass
def get_tool_definition(self) -> dict:
"""
Returns the tool's definition in a format suitable for LLM function calling.
This format is specific to the LLM API being used (e.g., OpenAI's format).
"""
return {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": self.parameters_schema
}
}
-----------------------------------------------------------------------
OS-Specific Implementations:
Each operating system has its own mechanisms for automation and interaction. Our tool layer will leverage these native capabilities.
MacOS:
MacOS offers powerful scripting capabilities through AppleScript, which can directly interact with applications and the System Events process for UI automation. Additionally, standard Unix-like shell commands are available via the Terminal.
* AppleScript: Used for controlling applications (e.g., Safari, Finder), manipulating windows, and accessing system-level UI elements through `System Events`. AppleScript commands can be executed from Python using the `osascript` command-line utility via the `subprocess` module.
* Shell Commands: Standard commands like `open`, `ls`, `mv`, `rm`, `mkdir`, `find`, `cat` are executed directly using `subprocess`.
Windows (Brief Overview):
On Windows, similar functionalities can be achieved using:
* PowerShell: A powerful scripting language that can interact with the operating system, applications, and COM objects. Python can execute PowerShell scripts via `subprocess`.
* COM Objects: Component Object Model objects provide programmatic access to many Windows features and applications (e.g., `Shell.Application` for file operations). Python libraries like `pywin32` can interface with COM objects.
Linux (Brief Overview):
Linux environments, particularly those with X Window System, offer various tools:
* `xdotool`: A command-line tool to simulate keyboard input and mouse activity, move and resize windows, etc.
* `wmctrl`: A command-line tool to interact with EWMH/NetWM compatible X Window Managers, allowing window listing, activation, moving, and resizing.
* `dbus`: A message bus system for inter-process communication, which can be used to interact with desktop environments like GNOME or KDE.
* Shell Commands: Similar to MacOS, standard Unix commands are used for file system operations and launching applications.
Python's `subprocess` Module as the Bridge:
For cross-platform compatibility and simplicity in demonstrating OS interaction, the Python `subprocess` module is invaluable. It allows our Python agent to execute external commands, including `osascript` (for AppleScript on MacOS), `powershell` (for Windows), or standard shell commands (on all Unix-like systems).
IMPLEMENTING DESKTOP CONTROL CAPABILITIES
Let's delve into specific functionalities, providing OS-neutral abstractions and detailed MacOS implementations.
6.1. OPENING APPLICATIONS
Description: This capability allows the user to launch any installed application on the system by its name. The agent needs to translate the application's common name into the exact executable path or a name recognized by the OS's application launcher.
OS-Neutral Function Signature (Conceptual):
-----------------------------------------------------------------------
OpenApplicationTool(DesktopTool):
def __init__(self):
super().__init__(
name="open_application",
description="Opens a specified application by its name.",
parameters_schema={
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "The name of the application to open (e.g., 'Safari', 'Terminal')."
}
},
"required": ["app_name"]
}
)
def execute(self, app_name: str) -> str:
"""
Executes the command to open the specified application.
"""
raise NotImplementedError("Subclasses must implement this method.")
-----------------------------------------------------------------------
MacOS Implementation Example:
On MacOS, the `open` command-line utility or AppleScript can be used. AppleScript is often more robust for specific application control.
-----------------------------------------------------------------------
import subprocess
import os
class MacOSOpenApplicationTool(OpenApplicationTool):
"""
MacOS specific implementation for opening applications.
Leverages AppleScript via osascript to activate applications.
"""
def __init__(self):
super().__init__()
def execute(self, app_name: str) -> str:
"""
Opens a specified application on MacOS using AppleScript.
Args:
app_name: The name of the application to open (e.g., 'Safari').
Returns:
A string indicating success or failure.
"""
# AppleScript to activate an application.
# 'activate application "AppName"' brings it to the foreground.
applescript_command = f'tell application "{app_name}" to activate'
try:
# Execute the AppleScript using osascript.
# capture_output=True and text=True are good for debugging but
# for simple activation, output might not be critical.
result = subprocess.run(
['osascript', '-e', applescript_command],
capture_output=True,
text=True,
check=True # Raise an exception for non-zero exit codes
)
return f"Successfully opened {app_name}."
except subprocess.CalledProcessError as e:
# If the application doesn't exist or there's another AppleScript error.
return f"Error opening {app_name}: {e.stderr.strip()}"
except FileNotFoundError:
return "Error: osascript command not found. Is MacOS installed correctly?"
except Exception as e:
return f"An unexpected error occurred: {str(e)}"
-----------------------------------------------------------------------
6.2. WINDOW MANAGEMENT
Description: This category encompasses a suite of actions related to desktop windows, including moving them to specific screen coordinates, bringing them to the foreground (selecting), hiding them, organizing them (e.g., tiling), and listing all open windows. These operations are critical for efficient multitasking.
OS-Neutral Function Signatures (Conceptual):
--------------------------------------------------------------------—
class WindowManagementTool(DesktopTool):
# The `execute` method will handle different actions based on the 'action' parameter.
# The `parameters_schema` will define all possible parameters for all actions.
def __init__(self):
super().__init__(
name="window_management",
description="Manages desktop windows (move, select, hide, list).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["move_window", "select_window", "hide_window", "list_windows"], "description": "The specific window action to perform."},
"app_name": {"type": "string", "description": "Name of the application whose window to manage (e.g., 'Safari'). Required for move, select, hide."},
"window_index": {"type": "integer", "description": "1-based index of the window to target. Required for move, select, hide."},
"x": {"type": "integer", "description": "X coordinate for the top-left corner of the window. Required for move_window."},
"y": {"type": "integer", "description": "Y coordinate for the top-left corner of the window. Required for move_window."}
},
"required": ["action"]
}
)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a window management action based on the 'action' parameter.
"""
if action == "move_window":
return self.move_window(kwargs.get("app_name"), kwargs.get("window_index"),
kwargs.get("x"), kwargs.get("y"))
elif action == "select_window":
return self.select_window(kwargs.get("app_name"), kwargs.get("window_index"))
elif action == "hide_window":
return self.hide_window(kwargs.get("app_name"), kwargs.get("window_index"))
elif action == "list_windows":
return self.list_windows()
else:
return f"Error: Unknown window management action '{action}'."
@abc.abstractmethod
def move_window(self, app_name: str, window_index: int, x: int, y: int) -> str:
"""Moves a specific window of an application to new coordinates."""
pass
@abc.abstractmethod
def select_window(self, app_name: str, window_index: int) -> str:
"""Brings a specific window of an application to the foreground."""
pass
@abc.abstractmethod
def hide_window(self, app_name: str, window_index: int) -> str:
"""Hides a specific window of an application."""
pass
@abc.abstractmethod
def list_windows(self) -> str:
"""Lists all open windows with their application names and titles."""
pass
-----------------------------------------------------------------------
MacOS Implementation Examples:
MacOS uses `System Events` in AppleScript to interact with UI elements and windows.
-----------------------------------------------------------------------
import subprocess
import json
class MacOSWindowManagementTool(WindowManagementTool):
"""
MacOS specific implementation for managing windows.
Uses AppleScript via osascript to interact with System Events.
"""
def __init__(self):
super().__init__() # Calls the __init__ of WindowManagementTool which sets schema
def _run_applescript(self, script: str) -> str:
"""Helper to run AppleScript and capture output/errors."""
try:
result = subprocess.run(
['osascript', '-e', script],
capture_output=True,
text=True,
check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
return f"AppleScript Error: {e.stderr.strip()}"
except FileNotFoundError:
return "Error: osascript command not found."
except Exception as e:
return f"An unexpected error occurred: {str(e)}"
def move_window(self, app_name: str, window_index: int, x: int, y: int) -> str:
"""
Moves the specified window of an application to new screen coordinates.
Args:
app_name: The name of the application.
window_index: The 1-based index of the window.
x: The new X coordinate for the top-left corner.
y: The new Y coordinate for the top-left corner.
Returns:
A string indicating success or failure.
"""
applescript = f'''
tell application "System Events"
tell process "{app_name}"
if exists (window {window_index}) then
set position of window {window_index} to {{{x}, {y}}}
return "Successfully moved window {window_index} of {app_name} to ({x}, {y})."
else
return "Error: Window {window_index} of {app_name} not found."
end if
end tell
end tell
'''
return self._run_applescript(applescript)
def select_window(self, app_name: str, window_index: int) -> str:
"""
Brings the specified window of an application to the foreground.
Args:
app_name: The name of the application.
window_index: The 1-based index of the window.
Returns:
A string indicating success or failure.
"""
applescript = f'''
tell application "System Events"
tell process "{app_name}"
if exists (window {window_index}) then
set frontmost of window {window_index} to true
return "Successfully brought window {window_index} of {app_name} to front."
else
return "Error: Window {window_index} of {app_name} not found."
end if
end tell
end tell
'''
# Note: 'activate application "AppName"' also brings the app to front,
# but this targets a specific window.
return self._run_applescript(applescript)
def hide_window(self, app_name: str, window_index: int) -> str:
"""
Hides the specified window of an application.
Args:
app_name: The name of the application.
window_index: The 1-based index of the window.
Returns:
A string indicating success or failure.
"""
applescript = f'''
tell application "System Events"
tell process "{app_name}"
if exists (window {window_index}) then
set visible of window {window_index} to false
return "Successfully hid window {window_index} of {app_name}."
else
return "Error: Window {window_index} of {app_name} not found."
end if
end tell
end tell
'''
return self._run_applescript(applescript)
def list_windows(self) -> str:
"""
Lists all open application windows with their titles and bounds.
Returns:
A JSON string representing the list of windows.
"""
applescript_revised = '''
set output_list to {}
tell application "System Events"
set all_processes to every process whose background only is false
repeat with a_process in all_processes
set app_name to name of a_process
if (count of windows of a_process) > 0 then
repeat with i from 1 to (count of windows of a_process)
set a_window to window i of a_process
try
set window_title to name of a_window
set window_bounds to bounds of a_window
set window_info to (app_name & "|" & i & "|" & window_title & "|" & (item 1 of window_bounds) & "," & (item 2 of window_bounds) & "," & (item 3 of window_bounds) & "," & (item 4 of window_bounds)) as string
set end of output_list to window_info
on error
set window_info to (app_name & "|" & i & "|" & "(Untitled/No Title)" & "|" & "(Unavailable)") as string
set end of output_list to window_info
end try
end repeat
end if
end repeat
end tell
return (output_list as string)
'''
raw_output = self._run_applescript(applescript_revised)
if raw_output.startswith("AppleScript Error") or raw_output.startswith("Error:"):
return raw_output
parsed_windows = []
# AppleScript returns lists as "{item1, item2, ...}"
# We need to strip the braces and split by ", "
items = raw_output.strip('{}').split(', ')
for item in items:
parts = item.split('|')
if len(parts) == 4:
app_name, index, title, bounds_str = parts
bounds = [int(b) for b in bounds_str.split(',')] if bounds_str != "(Unavailable)" else bounds_str
parsed_windows.append({
"app_name": app_name,
"window_index": int(index),
"title": title,
"bounds": bounds
})
else:
# Handle malformed items if any
parsed_windows.append({"raw_data": item, "error": "Malformed window info"})
return json.dumps(parsed_windows, indent=2)
-----------------------------------------------------------------------
6.3. SYSTEM SETTINGS
Description: The agent should be able to read, modify, and potentially restore various system settings. This could include volume levels, display brightness, dark mode status, network configurations, and more.
OS-Neutral Function Signatures (Conceptual):
-----------------------------------------------------------------------
class SystemSettingsTool(DesktopTool):
def __init__(self):
super().__init__(
name="system_settings",
description="Manages MacOS system settings (get, set, restore).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["get_setting", "set_setting", "restore_setting"], "description": "The specific setting action to perform."},
"setting_name": {"type": "string", "description": "Name of the setting (e.g., 'volume', 'dark_mode'). Required for all actions."},
"value": {"type": "string", "description": "New value for the setting (e.g., '50', 'true'). Required for set_setting action."}
},
"required": ["action", "setting_name"]
}
)
self._saved_settings = {} # Simple in-memory storage for restore
def execute(self, action: str, **kwargs) -> str:
"""
Executes a system setting action based on the 'action' parameter.
"""
if action == "get_setting":
return self.get_setting(kwargs.get("setting_name"))
elif action == "set_setting":
return self.set_setting(kwargs.get("setting_name"), kwargs.get("value"))
elif action == "restore_setting":
return self.restore_setting(kwargs.get("setting_name"))
else:
return f"Error: Unknown system setting action '{action}'."
@abc.abstractmethod
def get_setting(self, setting_name: str) -> str:
"""Retrieves the current value of a specified system setting."""
pass
@abc.abstractmethod
def set_setting(self, setting_name: str, value: str) -> str:
"""Sets a specified system setting to a new value."""
pass
@abc.abstractmethod
def restore_setting(self, setting_name: str) -> str:
"""Restores a specified system setting to a previously saved value."""
pass
-----------------------------------------------------------------------
MacOS Implementation Examples:
Many system settings can be controlled via AppleScript or `defaults` command-line utility. For simplicity, we will demonstrate volume control.
-----------------------------------------------------------------------
import subprocess
class MacOSSystemSettingsTool(SystemSettingsTool):
"""
MacOS specific implementation for managing system settings.
Uses AppleScript for some settings and shell commands for others.
"""
def __init__(self):
super().__init__() # Calls the __init__ of SystemSettingsTool which sets schema
self._saved_settings = {} # Simple in-memory storage for restore
def _run_applescript(self, script: str) -> str:
"""Helper to run AppleScript and capture output/errors."""
try:
result = subprocess.run(
['osascript', '-e', script],
capture_output=True,
text=True,
check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
return f"AppleScript Error: {e.stderr.strip()}"
except FileNotFoundError:
return "Error: osascript command not found."
except Exception as e:
return f"An unexpected error occurred: {str(e)}"
def get_setting(self, setting_name: str) -> str:
"""
Retrieves the current value of a specified system setting on MacOS.
Args:
setting_name: The name of the setting (e.g., 'volume', 'dark_mode').
Returns:
The current value of the setting as a string, or an error message.
"""
if setting_name == "volume":
applescript = 'output volume of (get volume settings)'
result = self._run_applescript(applescript)
if not result.startswith("AppleScript Error"):
return f"Current volume is: {result}"
return result
elif setting_name == "dark_mode":
# Dark mode can be checked via 'defaults read -g AppleInterfaceStyle'
try:
result = subprocess.run(
['defaults', 'read', '-g', 'AppleInterfaceStyle'],
capture_output=True, text=True, check=True
)
return "Dark Mode is enabled."
except subprocess.CalledProcessError:
# If the key doesn't exist, Dark Mode is not enabled (Light Mode)
return "Dark Mode is disabled (Light Mode is enabled)."
except Exception as e:
return f"Error checking dark mode: {str(e)}"
else:
return f"Setting '{setting_name}' is not supported for reading."
def set_setting(self, setting_name: str, value: str) -> str:
"""
Sets a specified system setting to a new value on MacOS.
Args:
setting_name: The name of the setting (e.g., 'volume', 'dark_mode').
value: The new value for the setting (e.g., '50', 'true').
Returns:
A string indicating success or failure.
"""
if setting_name == "volume":
try:
volume_level = int(value)
if not 0 <= volume_level <= 100:
return "Volume level must be between 0 and 100."
# Save current volume before setting
current_volume_script = 'output volume of (get volume settings)'
current_volume = self._run_applescript(current_volume_script)
if not current_volume.startswith("AppleScript Error"):
self._saved_settings['volume'] = current_volume
applescript = f'set volume output volume {volume_level}'
result = self._run_applescript(applescript)
if not result.startswith("AppleScript Error"):
return f"Successfully set volume to {volume_level}."
return result
except ValueError:
return "Invalid volume level. Please provide an integer between 0 and 100."
elif setting_name == "dark_mode":
# Dark mode can be toggled using a shell command or AppleScript
# For simplicity, we'll use a shell command that relies on a third-party tool
# or a more complex AppleScript. A direct 'defaults write' might require a logout/login.
# A common approach is to use 'osascript -e "tell app \\"System Events\\" to tell app \\"Appearance\\" to set dark mode to true/false"'
# However, direct setting via AppleScript for dark mode is not straightforward.
# A more reliable way often involves 'defaults write NSGlobalDomain AppleInterfaceStyle Dark'
# followed by restarting the Dock or logging out.
# For demonstration, we'll use a simplified AppleScript that might not work on all macOS versions
# or require specific accessibility permissions.
# A more robust solution might involve a dedicated utility.
# For this example, let's assume a simple toggle script.
# Save current dark mode status before setting
current_dark_mode = self.get_setting("dark_mode")
self._saved_settings['dark_mode'] = current_dark_mode
if value.lower() == "true":
applescript = 'tell application "System Events" to tell appearance preferences to set dark mode to true'
# A more reliable way:
# applescript = 'do shell script "defaults write NSGlobalDomain AppleInterfaceStyle Dark"'
# This often requires restarting applications or logging out.
# For a seamless change, third-party tools or more complex UI scripting are needed.
# Let's use a common approach for toggling, assuming it works for the user's setup.
applescript = 'tell application "System Events" to tell appearance preferences to set dark mode to true'
return self._run_applescript(applescript)
elif value.lower() == "false":
applescript = 'tell application "System Events" to tell appearance preferences to set dark mode to false'
return self._run_applescript(applescript)
else:
return "Invalid value for dark_mode. Use 'true' or 'false'."
else:
return f"Setting '{setting_name}' is not supported for setting."
def restore_setting(self, setting_name: str) -> str:
"""
Restores a specified system setting to a previously saved value.
Args:
setting_name: The name of the setting to restore.
Returns:
A string indicating success or failure.
"""
if setting_name in self._saved_settings:
saved_value = self._saved_settings[setting_name]
return self.set_setting(setting_name, saved_value)
else:
return f"No saved value found for setting '{setting_name}'."
-----------------------------------------------------------------------
6.4. CONSOLE INTERACTION
Description: This capability allows the agent to open a terminal window and execute arbitrary shell commands. This is crucial for advanced users and for tasks that are best handled via the command line, such as package management, script execution, or system diagnostics.
OS-Neutral Function Signatures (Conceptual):
-----------------------------------------------------------------------
class ConsoleTool(DesktopTool):
def __init__(self):
super().__init__(
name="console_interaction",
description="Interacts with the console (open, execute commands).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["open_console", "execute_command"], "description": "The specific console action to perform."},
"command": {"type": "string", "description": "Shell command to execute (for 'execute_command' action)."},
"in_new_terminal": {"type": "boolean", "description": "Whether to execute in a new terminal window (for 'execute_command' action)."}
},
"required": ["action"]
}
)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a console interaction action based on the 'action' parameter.
"""
if action == "open_console":
return self.open_console()
elif action == "execute_command":
return self.execute_command(kwargs.get("command"), kwargs.get("in_new_terminal", False))
else:
return f"Error: Unknown console action '{action}'."
@abc.abstractmethod
def open_console(self) -> str:
"""Opens a new console/terminal window."""
pass
@abc.abstractmethod
def execute_command(self, command: str, in_new_terminal: bool = False) -> str:
"""Executes a shell command, optionally in a new terminal window."""
pass
-----------------------------------------------------------------------
MacOS Implementation Example:
On MacOS, the `Terminal.app` or `iTerm2` (if installed) can be controlled via AppleScript. Shell commands are executed directly via `subprocess`.
-----------------------------------------------------------------------
import subprocess
import shlex # For safe command splitting
class MacOSConsoleTool(ConsoleTool):
"""
MacOS specific implementation for console interaction.
Uses AppleScript to open Terminal and subprocess for command execution.
"""
def __init__(self):
super().__init__() # Calls the __init__ of ConsoleTool which sets schema
def _run_applescript(self, script: str) -> str:
"""Helper to run AppleScript and capture output/errors."""
try:
result = subprocess.run(
['osascript', '-e', script],
capture_output=True,
text=True,
check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
return f"AppleScript Error: {e.stderr.strip()}"
except FileNotFoundError:
return "Error: osascript command not found."
except Exception as e:
return f"An unexpected error occurred: {str(e)}"
def open_console(self) -> str:
"""
Opens a new Terminal application window on MacOS.
Returns:
A string indicating success or failure.
"""
applescript = 'tell application "Terminal" to activate'
return self._run_applescript(applescript)
def execute_command(self, command: str, in_new_terminal: bool = False) -> str:
"""
Executes a shell command on MacOS.
Args:
command: The shell command string to execute.
in_new_terminal: If True, the command runs in a new Terminal window.
If False, it runs silently in the background (Python's subprocess).
Returns:
The standard output and standard error of the command, or an error message.
"""
if in_new_terminal:
# To execute in a new Terminal window, we use AppleScript.
# 'do script' will run the command in a new tab/window.
# We need to escape the command for AppleScript.
escaped_command = command.replace('"', '\\"') # Basic escaping
applescript = f'''
tell application "Terminal"
activate
do script "{escaped_command}"
end tell
'''
return self._run_applescript(applescript)
else:
# Execute silently in the background using subprocess.
try:
# shlex.split is good for splitting a command string into a list of arguments
# while respecting quotes.
# However, for simple commands, it might be overkill, and sometimes
# 'shell=True' is used, but it's generally less secure.
# For this example, we'll assume the command is safe or pre-validated.
# If shell=False, command must be a list of arguments.
# If shell=True, command is a single string.
result = subprocess.run(
command,
shell=True, # Be cautious with shell=True due to security implications
capture_output=True,
text=True,
check=False # Do not raise exception for non-zero exit codes here,
# we want to return stderr as part of the result.
)
if result.returncode == 0:
return f"Command executed successfully.\nOutput:\n{result.stdout.strip()}"
else:
return f"Command failed with exit code {result.returncode}.\nOutput:\n{result.stdout.strip()}\nError:\n{result.stderr.strip()}"
except FileNotFoundError:
return f"Error: Command '{command.split()[0]}' not found."
except Exception as e:
return f"An unexpected error occurred during command execution: {str(e)}"
---------------------------------------------------------------
6.5. FILE SYSTEM OPERATIONS
Description: This set of tools provides comprehensive control over the file system, allowing the agent to explore directories, create new files and folders, delete existing ones, move items, and search for files based on various criteria. It also includes the ability to navigate to user-specified directories.
OS-Neutral Function Signatures (Conceptual):
-----------------------------------------------------------------------
class FileSystemTool(DesktopTool):
def __init__(self):
super().__init__(
name="file_system_operations",
description="Manages files and folders (list, create, delete, move, find, change directory).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["list_directory", "create_file", "create_directory", "delete_item", "move_item", "find_files", "change_directory"], "description": "The specific file system action to perform."},
"path": {"type": "string", "description": "Path to file or directory. Required for list, create_file, create_directory, delete, change_directory."},
"source_path": {"type": "string", "description": "Source path for move operation. Required for move_item."},
"destination_path": {"type": "string", "description": "Destination path for move operation. Required for move_item."},
"content": {"type": "string", "description": "Content for new file. Optional for create_file."},
"name_pattern": {"type": "string", "description": "Name pattern (e.g., '*.txt') for finding files. Optional for find_files."},
"content_pattern": {"type": "string", "description": "Content pattern (regex) for finding files. Optional for find_files."}
},
"required": ["action"]
}
)
self._current_directory = os.path.expanduser("~") # Agent's conceptual current directory
def execute(self, action: str, **kwargs) -> str:
"""
Executes a file system operation based on the 'action' parameter.
"""
if action == "list_directory":
return self.list_directory(kwargs.get("path"))
elif action == "create_file":
return self.create_file(kwargs.get("path"), kwargs.get("content", ""))
elif action == "create_directory":
return self.create_directory(kwargs.get("path"))
elif action == "delete_item":
return self.delete_item(kwargs.get("path"))
elif action == "move_item":
return self.move_item(kwargs.get("source_path"), kwargs.get("destination_path"))
elif action == "find_files":
return self.find_files(kwargs.get("directory"), kwargs.get("name_pattern"), kwargs.get("content_pattern"))
elif action == "change_directory":
return self.change_directory(kwargs.get("path"))
else:
return f"Error: Unknown file system action '{action}'."
@abc.abstractmethod
def list_directory(self, path: str) -> str:
"""Lists contents of a directory."""
pass
@abc.abstractmethod
def create_file(self, path: str, content: str = "") -> str:
"""Creates a new file with optional content."""
pass
@abc.abstractmethod
def create_directory(self, path: str) -> str:
"""Creates a new directory."""
pass
@abc.abstractmethod
def delete_item(self, path: str) -> str:
"""Deletes a file or empty directory."""
pass
@abc.abstractmethod
def move_item(self, source_path: str, destination_path: str) -> str:
"""Moves a file or directory."""
pass
@abc.abstractmethod
def find_files(self, directory: str, name_pattern: str = None,
content_pattern: str = None) -> str:
"""Finds files matching criteria within a directory."""
pass
@abc.abstractmethod
def change_directory(self, path: str) -> str:
"""Changes the agent's current working directory (conceptual)."""
pass
-----------------------------------------------------------------------
MacOS Implementation Examples:
File system operations on MacOS can be performed using standard shell commands (e.g., `ls`, `mkdir`, `rm`, `mv`, `find`) or via AppleScript interacting with the `Finder` application. Shell commands are generally more direct and efficient for these tasks.
-----------------------------------------------------------------------
import subprocess
import os
import json
class MacOSFileSystemTool(FileSystemTool):
"""
MacOS specific implementation for file system operations.
Uses standard shell commands via subprocess.
"""
def __init__(self):
super().__init__() # Calls the __init__ of FileSystemTool which sets schema
self._current_directory = os.path.expanduser("~") # Agent's conceptual current directory
def _run_shell_command(self, command: list[str]) -> tuple[int, str, str]:
"""Helper to run shell commands and capture output/errors."""
try:
result = subprocess.run(
command,
capture_output=True,
text=True,
check=False # We handle non-zero exit codes explicitly
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except FileNotFoundError:
return 127, "", f"Error: Command '{command[0]}' not found."
except Exception as e:
return 1, "", f"An unexpected error occurred: {str(e)}"
def list_directory(self, path: str) -> str:
"""
Lists the contents of a directory on MacOS.
Args:
path: The path to the directory.
Returns:
A string listing the directory contents or an error message.
"""
full_path = os.path.join(self._current_directory, os.path.expanduser(path))
returncode, stdout, stderr = self._run_shell_command(['ls', '-F', full_path])
if returncode == 0:
return f"Contents of {full_path}:\n{stdout}"
else:
return f"Error listing directory {full_path}: {stderr}"
def create_file(self, path: str, content: str = "") -> str:
"""
Creates a new file with optional content on MacOS.
Args:
path: The path for the new file.
content: Optional content to write to the file.
Returns:
A string indicating success or failure.
"""
full_path = os.path.join(self._current_directory, os.path.expanduser(path))
try:
with open(full_path, 'w') as f:
f.write(content)
return f"Successfully created file: {full_path}"
except Exception as e:
return f"Error creating file {full_path}: {str(e)}"
def create_directory(self, path: str) -> str:
"""
Creates a new directory on MacOS.
Args:
path: The path for the new directory.
Returns:
A string indicating success or failure.
"""
full_path = os.path.join(self._current_directory, os.path.expanduser(path))
returncode, stdout, stderr = self._run_shell_command(['mkdir', '-p', full_path])
if returncode == 0:
return f"Successfully created directory: {full_path}"
else:
return f"Error creating directory {full_path}: {stderr}"
def delete_item(self, path: str) -> str:
"""
Deletes a file or directory on MacOS.
Args:
path: The path to the file or directory to delete.
Returns:
A string indicating success or failure.
"""
full_path = os.path.join(self._current_directory, os.path.expanduser(path))
# Use 'rm -rf' for force recursive deletion of directories, be careful!
# For a safer agent, consider 'rm' for files and 'rmdir' for empty dirs,
# or prompt for confirmation for recursive deletion.
returncode, stdout, stderr = self._run_shell_command(['rm', '-rf', full_path])
if returncode == 0:
return f"Successfully deleted: {full_path}"
else:
return f"Error deleting {full_path}: {stderr}"
def move_item(self, source_path: str, destination_path: str) -> str:
"""
Moves a file or directory on MacOS.
Args:
source_path: The current path of the item.
destination_path: The new path for the item.
Returns:
A string indicating success or failure.
"""
full_source_path = os.path.join(self._current_directory, os.path.expanduser(source_path))
full_destination_path = os.path.join(self._current_directory, os.path.expanduser(destination_path))
returncode, stdout, stderr = self._run_shell_command(['mv', full_source_path, full_destination_path])
if returncode == 0:
return f"Successfully moved {full_source_path} to {full_destination_path}"
else:
return f"Error moving {full_source_path} to {full_destination_path}: {stderr}"
def find_files(self, directory: str, name_pattern: str = None,
content_pattern: str = None) -> str:
"""
Finds files matching criteria within a directory on MacOS.
Args:
directory: The directory to search within.
name_pattern: Optional regex pattern for file names.
content_pattern: Optional regex pattern for file content.
Returns:
A JSON string listing found files or an error message.
"""
full_directory = os.path.join(self._current_directory, os.path.expanduser(directory))
find_command = ['find', full_directory, '-type', 'f']
if name_pattern:
find_command.extend(['-name', name_pattern])
returncode, stdout, stderr = self._run_shell_command(find_command)
if returncode != 0:
return f"Error finding files in {full_directory}: {stderr}"
found_files = stdout.splitlines()
results = []
if content_pattern:
# Filter by content pattern using 'grep'
for f_path in found_files:
grep_command = ['grep', '-q', content_pattern, f_path]
grep_returncode, _, _ = self._run_shell_command(grep_command)
if grep_returncode == 0: # grep returns 0 if pattern found
results.append(f_path)
else:
results = found_files
return json.dumps({"found_files": results}, indent=2)
def change_directory(self, path: str) -> str:
"""
Changes the agent's conceptual current working directory.
This does not change the Python script's actual CWD, but updates
the base path for subsequent file system operations.
Args:
path: The new directory path.
Returns:
A string indicating success or failure.
"""
new_path = os.path.join(self._current_directory, os.path.expanduser(path))
if os.path.isdir(new_path):
self._current_directory = os.path.abspath(new_path)
return f"Agent's current directory changed to: {self._current_directory}"
else:
return f"Error: Directory not found or not a directory: {new_path}"
-----------------------------------------------------------------------
6.6. COMMAND SEQUENCE MANAGEMENT
Description: This advanced capability allows users to save a series of actions performed by the agent as a named sequence, and then replay that sequence later with a single command. This is immensely useful for automating complex, multi-step workflows or for creating custom macros.
Mechanism: Instead of saving raw natural language, the agent saves the structured tool calls (e.g., the JSON objects generated by the LLM) that constitute a sequence. This ensures that the replayed sequence is robust and deterministic.
Implementation Concept:
The agent would maintain a dictionary or a persistent storage (like a JSON file) mapping a user-defined name to a list of structured tool call dictionaries.
-----------------------------------------------------------------------
import json
import os
class CommandSequenceTool(DesktopTool):
"""
Tool for saving and replaying sequences of commands.
Stores structured tool calls.
"""
def __init__(self, sequence_file="command_sequences.json"):
super().__init__(
name="command_sequence_management",
description="Manages saving and replaying sequences of commands.",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["save_sequence", "replay_sequence", "list_saved_sequences"], "description": "The specific sequence action to perform."},
"sequence_name": {"type": "string", "description": "Name for the command sequence. Required for save and replay."},
"commands": {"type": "array", "items": {"type": "object"}, "description": "List of structured command objects to save. Required for save_sequence."}
},
"required": ["action"]
}
)
self.sequence_file = os.path.expanduser(sequence_file)
self.saved_sequences = {} # Initialize to empty, then load
self._load_sequences()
def _load_sequences(self):
"""Loads saved command sequences from a JSON file."""
if os.path.exists(self.sequence_file):
with open(self.sequence_file, 'r') as f:
try:
self.saved_sequences = json.load(f)
except json.JSONDecodeError:
self.saved_sequences = {}
print(f"Warning: Could not decode {self.sequence_file}. Starting with empty sequences.")
else:
self.saved_sequences = {}
def _save_sequences(self):
"""Saves current command sequences to a JSON file."""
with open(self.sequence_file, 'w') as f:
json.dump(self.saved_sequences, f, indent=2)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a command sequence action based on the 'action' parameter.
"""
if action == "save_sequence":
return self.save_sequence(kwargs.get("sequence_name"), kwargs.get("commands"))
elif action == "replay_sequence":
# replay_sequence needs the tool_orchestrator, which is not a direct parameter
# from the LLM. It must be passed from the agentic loop.
# For this tool's direct execution, we'll return an error or handle it conceptually.
return "Error: 'replay_sequence' must be handled by the agentic controller, not directly by the tool."
elif action == "list_saved_sequences":
return self.list_saved_sequences()
else:
return f"Error: Unknown command sequence action '{action}'."
def save_sequence(self, sequence_name: str, commands: list[dict]) -> str:
"""
Saves a list of structured command objects under a given name.
Args:
sequence_name: The name to save the sequence under.
commands: A list of dictionaries, where each dictionary represents
a tool call (e.g., {"tool_name": "open_application", "parameters": {"app_name": "Safari"}}).
Returns:
A string indicating success or failure.
"""
if not isinstance(commands, list) or not all(isinstance(cmd, dict) for cmd in commands):
return "Error: Commands must be a list of dictionaries."
self.saved_sequences[sequence_name] = commands
self._save_sequences()
return f"Command sequence '{sequence_name}' saved successfully."
def replay_sequence(self, sequence_name: str, tool_orchestrator) -> str:
"""
Replays a previously saved command sequence.
This method is called by the AgenticDesktopController, not directly by the LLM.
Args:
sequence_name: The name of the sequence to replay.
tool_orchestrator: An instance of the Tool Orchestrator to execute commands.
Returns:
A string indicating the outcome of the replay.
"""
if sequence_name not in self.saved_sequences:
return f"Error: Command sequence '{sequence_name}' not found."
sequence = self.saved_sequences[sequence_name]
results = []
print(f"Replaying sequence '{sequence_name}' with {len(sequence)} steps...")
for i, command_data in enumerate(sequence):
tool_name = command_data.get("function", {}).get("name")
parameters = command_data.get("function", {}).get("arguments", {})
tool_call_id = command_data.get("id", f"replay_step_{i+1}") # Generate ID if not present
if not tool_name:
results.append(f"Step {i+1}: Invalid command data (missing 'tool_name').")
continue
try:
print(f" Executing step {i+1}: Calling {tool_name} with {parameters}")
result = tool_orchestrator.execute_tool(tool_name, **parameters)
results.append(f"Step {i+1} ({tool_name}): {result}")
except Exception as e:
results.append(f"Step {i+1} ({tool_name}) Error: {str(e)}")
# Decide if replay should stop on error
break
return f"Replay of '{sequence_name}' completed:\n" + "\n".join(results)
def list_saved_sequences(self) -> str:
"""
Lists the names of all currently saved command sequences.
Returns:
A string listing the sequence names.
"""
if not self.saved_sequences:
return "No command sequences saved."
sequence_names = ", ".join(self.saved_sequences.keys())
return f"Saved command sequences: {sequence_names}"
-----------------------------------------------------------------------
6.7. TEXT FILE OPERATIONS
Description: This capability allows the agent to interact with text files, specifically opening them, reading their content, and providing a summary of their content. The summarization aspect leverages the LLM's natural language processing abilities.
OS-Neutral Function Signatures (Conceptual):
-----------------------------------------------------------------------
class TextFileTool(DesktopTool):
def __init__(self):
super().__init__(
name="text_file_operations",
description="Performs operations on text files (open, read, summarize).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["open_text_file", "read_text_file", "summarize_text_file"], "description": "The specific text file action to perform."},
"path": {"type": "string", "description": "Path to the text file. Required for all actions."},
},
"required": ["action", "path"]
}
)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a text file operation based on the 'action' parameter.
"""
if action == "open_text_file":
return self.open_text_file(kwargs.get("path"))
elif action == "read_text_file":
return self.read_text_file(kwargs.get("path"))
elif action == "summarize_text_file":
# Summarization requires the LLM agent, which is not a direct tool parameter.
# It must be handled by the agentic loop.
return "Error: 'summarize_text_file' must be handled by the agentic controller, not directly by the tool."
else:
return f"Error: Unknown text file action '{action}'."
@abc.abstractmethod
def open_text_file(self, path: str) -> str:
"""Opens a text file in the default text editor."""
pass
@abc.abstractmethod
def read_text_file(self, path: str) -> str:
"""Reads and returns the entire content of a text file."""
pass
# Note: summarize_text_file is handled by the agentic controller, not directly by the tool.
# The tool will provide the content, and the controller will pass it to the LLM.
-----------------------------------------------------------------------
MacOS Implementation Examples:
Opening files can be done with the `open` command. Reading files is a standard Python file operation. Summarization involves sending the file content back to the LLM for processing.
-----------------------------------------------------------------------
import subprocess
import os
class MacOSTextFileTool(TextFileTool):
"""
MacOS specific implementation for text file operations.
Uses 'open' command for opening and Python for reading.
Summarization delegates to the LLM.
"""
def __init__(self):
super().__init__() # Calls the __init__ of TextFileTool which sets schema
def _run_shell_command(self, command: list[str]) -> tuple[int, str, str]:
"""Helper to run shell commands and capture output/errors."""
try:
result = subprocess.run(
command,
capture_output=True,
text=True,
check=False
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except FileNotFoundError:
return 127, "", f"Error: Command '{command[0]}' not found."
except Exception as e:
return 1, "", f"An unexpected error occurred: {str(e)}"
def open_text_file(self, path: str) -> str:
"""
Opens a text file on MacOS using the default application.
Args:
path: The path to the text file.
Returns:
A string indicating success or failure.
"""
full_path = os.path.expanduser(path)
if not os.path.exists(full_path):
return f"Error: File not found at {full_path}"
returncode, stdout, stderr = self._run_shell_command(['open', full_path])
if returncode == 0:
return f"Successfully opened file: {full_path}"
else:
return f"Error opening file {full_path}: {stderr}"
def read_text_file(self, path: str) -> str:
"""
Reads and returns the entire content of a text file on MacOS.
Args:
path: The path to the text file.
Returns:
The content of the file as a string, or an error message.
"""
full_path = os.path.expanduser(path)
if not os.path.exists(full_path):
return f"Error: File not found at {full_path}"
if not os.path.isfile(full_path):
return f"Error: Path is not a file: {full_path}"
try:
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read()
return content # Return raw content for summarization by LLM
except UnicodeDecodeError:
return f"Error: Could not read file {full_path} with UTF-8 encoding. Try a different encoding or ensure it's a text file."
except Exception as e:
return f"Error reading file {full_path}: {str(e)}"
-----------------------------------------------------------------------
MEMORY AND CONTEXT MANAGEMENT
For an agent to be truly intelligent and helpful, it needs memory. Memory allows the agent to maintain context across multiple interactions, remember user preferences, recall past actions, and store system states.
Why it's needed:
* Maintaining state: If a user asks to "move *that* window," the agent needs to remember which window "that" refers to from a previous interaction.
* User preferences: Remembering preferred applications for certain file types, or default directories.
* Command history: Enabling "undo" actions or replaying past sequences.
* System settings: Storing original settings before making temporary changes, allowing for a "restore" function.
* LLM Context: The `OpenAILLMAgent` class maintains a `self.messages` list, which is crucial for the LLM to understand the ongoing conversation and previous tool outputs, enabling multi-turn interactions and complex task execution.
Strategies for Memory:
1. In-memory dictionaries: For short-term context within a single session, Python dictionaries are sufficient. This is what was demonstrated with `_saved_settings` in the `MacOSSystemSettingsTool`.
2. JSON files: For persistent storage across sessions, saving memory content to a JSON file is a straightforward approach. The `CommandSequenceTool` demonstrates this by saving sequences to `command_sequences.json`.
3. Database (e.g., SQLite): For more complex, structured, and queryable memory, a lightweight database can be employed. This would be suitable for storing detailed logs of actions, user profiles, or complex desktop configurations.
The LLM itself, through its conversational history (`self.messages` in our `OpenAILLMAgent`), inherently provides a form of memory, allowing it to maintain context during multi-turn dialogues and reason about past actions and their outcomes. This is critical for the "Reflect" step of the agentic loop.
CHALLENGES AND CONSIDERATIONS
Building an agentic AI for desktop control presents several challenges and important considerations.
Security Implications:
Giving an AI agent direct control over your operating system means granting it significant power. Malicious prompts or vulnerabilities in the agent's code could lead to data deletion, unauthorized access, or system compromise. Robust input validation, permission models, and careful sandboxing of tool execution are paramount. Users should be aware of the risks and only use trusted agents. When using external LLM APIs, ensure your API keys are stored securely (e.g., environment variables) and never hardcoded.
Robustness and Error Handling:
Operating systems are complex, and applications can behave unpredictably. Tools must be designed to handle errors gracefully, provide informative feedback, and ideally, allow the agent to attempt recovery strategies. For instance, if an application fails to open, the agent might try again or suggest an alternative. The LLM's ability to "reflect" on tool outputs and adjust its plan is key here.
Performance:
Executing commands via `subprocess` or AppleScript introduces overhead. For highly responsive interactions, minimizing the number of inter-process communications and optimizing script execution times is important. Additionally, each call to a remote LLM API introduces network latency, which can impact the perceived responsiveness of the agent. Batching requests or using local LLMs can mitigate this.
Ethical Considerations:
An agent with broad desktop control capabilities could inadvertently or intentionally perform actions that the user did not fully intend or understand. Transparency about what the agent is doing, clear confirmation steps for destructive actions, and an "undo" mechanism are crucial for building trust and ensuring ethical operation. The agent should always prioritize user control and safety.
LLM Cost and Rate Limits:
Using remote LLM APIs incurs costs based on token usage (input and output) and model complexity. Developers must be mindful of these costs and implement strategies like caching, summarization of conversation history, or choosing cost-effective models when appropriate. API rate limits also need to be considered to avoid service interruptions.
CONCLUSION
The development of an LLM-powered agentic AI for desktop control marks a significant leap towards more intuitive and efficient human-computer interaction. By combining the natural language understanding capabilities of real LLMs with a robust tooling layer, we can create intelligent assistants that automate complex desktop tasks, streamline workflows, and enhance accessibility.
This article has detailed the core architectural components, from the LLM's role in interpreting user intent via function calling to the OS-specific implementations that execute commands. We explored how to build tools for opening applications, managing windows, adjusting system settings, interacting with the console, performing file system operations, managing command sequences, and handling text files. While focusing on MacOS for practical examples, the principles discussed are broadly applicable to Windows and Linux, emphasizing the importance of OS-neutral abstractions.
The journey to a fully autonomous and perfectly reliable desktop agent is ongoing, fraught with challenges related to security, robustness, ethical design, and managing LLM interactions. However, the foundational concepts and implementationstrategies outlined here provide a solid starting point for any employee looking to leverage the power of agentic AI to enhance their daily productivity and reshape their interaction with digital workspaces. The future of desktop computing is conversational, and with agentic AI, we are building the bridges to that future.
ADDENDUM: FULL RUNNING EXAMPLE - AGENTIC DESKTOP CONTROLLER WITH OPENAI LLM
This addendum provides a runnable Python script demonstrating the core concepts of an agentic desktop controller using OpenAI's LLM for natural language understanding and tool calling. It includes a basic agent loop and a few of the MacOS-specific tools implemented earlier. This example will show how the agent can open an application, list files, and move a window, all driven by a real LLM.
To run this example on MacOS, ensure you have Python 3 installed. You will need to install the `openai` Python library.
Installation:
pip install openai
API Key Setup:
You need an OpenAI API key. It is strongly recommended to set it as an environment variable:
export OPENAI_API_KEY='your_openai_api_key_here'
Replace 'your_openai_api_key_here' with your actual key.
-----------------------------------------------------------------------
import subprocess
import os
import json
import abc
import time
import openai # Import the OpenAI library
# ==============================================================================
# 0. Base Tool Abstraction (OS-Neutral)
# ==============================================================================
class DesktopTool(abc.ABC):
"""
Abstract base class for all desktop interaction tools.
Defines the common interface for tools, allowing for OS-specific
implementations to adhere to a consistent structure.
"""
def __init__(self, name: str, description: str, parameters_schema: dict):
self.name = name
self.description = description
# parameters_schema must be a valid JSON Schema for the tool's arguments
self.parameters_schema = parameters_schema
@abc.abstractmethod
def execute(self, **kwargs) -> str:
"""
Executes the tool's action with the given parameters.
Must be implemented by concrete OS-specific tool classes.
"""
pass
def get_tool_definition(self) -> dict:
"""
Returns the tool's definition in a format suitable for LLM function calling.
This format is specific to the LLM API being used (e.g., OpenAI's format).
"""
return {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": self.parameters_schema
}
}
# ==============================================================================
# 1. MacOS Specific Tool Implementations (from article)
# ==============================================================================
# Helper to run AppleScript
def _run_applescript(script: str) -> str:
"""Helper to run AppleScript and capture output/errors."""
try:
result = subprocess.run(
['osascript', '-e', script],
capture_output=True,
text=True,
check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
return f"AppleScript Error: {e.stderr.strip()}"
except FileNotFoundError:
return "Error: osascript command not found. Is MacOS installed correctly?"
except Exception as e:
return f"An unexpected error occurred: {str(e)}"
# Helper to run Shell Commands
def _run_shell_command(command: list[str]) -> tuple[int, str, str]:
"""Helper to run shell commands and capture output/errors."""
try:
result = subprocess.run(
command,
capture_output=True,
text=True,
check=False # We handle non-zero exit codes explicitly
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except FileNotFoundError:
return 127, "", f"Error: Command '{command[0]}' not found."
except Exception as e:
return 1, "", f"An unexpected error occurred: {str(e)}"
# MacOSOpenApplicationTool
class MacOSOpenApplicationTool(DesktopTool):
"""
MacOS specific implementation for opening applications.
Leverages AppleScript via osascript to activate applications.
"""
def __init__(self):
super().__init__(
name="open_application",
description="Opens a specified application by its name.",
parameters_schema={
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "The name of the application to open (e.g., 'Safari', 'Terminal')."
}
},
"required": ["app_name"]
}
)
def execute(self, app_name: str) -> str:
"""
Opens a specified application on MacOS using AppleScript.
Args:
app_name: The name of the application to open (e.g., 'Safari').
Returns:
A string indicating success or failure.
"""
applescript_command = f'tell application "{app_name}" to activate'
return _run_applescript(applescript_command)
# MacOSWindowManagementTool (simplified for example)
class MacOSWindowManagementTool(DesktopTool):
"""
MacOS specific implementation for managing windows.
Uses AppleScript via osascript to interact with System Events.
This version focuses on `move_window` and `list_windows`.
"""
def __init__(self):
super().__init__(
name="window_management",
description="Manages desktop windows (move, list).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["move_window", "list_windows"], "description": "The specific window action to perform."},
"app_name": {"type": "string", "description": "Name of the application whose window to manage (e.g., 'Safari'). Required for move_window."},
"window_index": {"type": "integer", "description": "1-based index of the window to target. Required for move_window."},
"x": {"type": "integer", "description": "X coordinate for the top-left corner of the window. Required for move_window."},
"y": {"type": "integer", "description": "Y coordinate for the top-left corner of the window. Required for move_window."}
},
"required": ["action"]
}
)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a window management action.
"""
if action == "move_window":
return self.move_window(kwargs.get("app_name"), kwargs.get("window_index"),
kwargs.get("x"), kwargs.get("y"))
elif action == "list_windows":
return self.list_windows()
else:
return f"Error: Unknown window management action '{action}'."
def move_window(self, app_name: str, window_index: int, x: int, y: int) -> str:
"""
Moves the specified window of an application to new screen coordinates.
"""
applescript = f'''
tell application "System Events"
tell process "{app_name}"
if exists (window {window_index}) then
set position of window {window_index} to {{{x}, {y}}}
return "Successfully moved window {window_index} of {app_name} to ({x}, {y})."
else
return "Error: Window {window_index} of {app_name} not found."
end if
end tell
end tell
'''
return _run_applescript(applescript)
def list_windows(self) -> str:
"""
Lists all open application windows with their titles and bounds.
Returns:
A JSON string representing the list of windows.
"""
applescript_revised = '''
set output_list to {}
tell application "System Events"
set all_processes to every process whose background only is false
repeat with a_process in all_processes
set app_name to name of a_process
if (count of windows of a_process) > 0 then
repeat with i from 1 to (count of windows of a_process)
set a_window to window i of a_process
try
set window_title to name of a_window
set window_bounds to bounds of a_window
set window_info to (app_name & "|" & i & "|" & window_title & "|" & (item 1 of window_bounds) & "," & (item 2 of window_bounds) & "," & (item 3 of window_bounds) & "," & (item 4 of window_bounds)) as string
set end of output_list to window_info
on error
set window_info to (app_name & "|" & i & "|" & "(Untitled/No Title)" & "|" & "(Unavailable)") as string
set end of output_list to window_info
end try
end repeat
end if
end repeat
end tell
return (output_list as string)
'''
raw_output = _run_applescript(applescript_revised)
if raw_output.startswith("AppleScript Error") or raw_output.startswith("Error:"):
return raw_output
parsed_windows = []
items = raw_output.strip('{}').split(', ')
for item in items:
parts = item.split('|')
if len(parts) == 4:
app_name, index, title, bounds_str = parts
bounds = [int(b) for b in bounds_str.split(',')] if bounds_str != "(Unavailable)" else bounds_str
parsed_windows.append({
"app_name": app_name,
"window_index": int(index),
"title": title,
"bounds": bounds
})
else:
parsed_windows.append({"raw_data": item, "error": "Malformed window info"})
return json.dumps(parsed_windows, indent=2)
# MacOSFileSystemTool (simplified for example)
class MacOSFileSystemTool(DesktopTool):
"""
MacOS specific implementation for file system operations.
Uses standard shell commands via subprocess.
This version focuses on `list_directory`.
"""
def __init__(self):
super().__init__(
name="file_system_operations",
description="Manages files and folders (list).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["list_directory"], "description": "The specific file system action to perform."},
"path": {"type": "string", "description": "Path to directory to list."},
},
"required": ["action", "path"]
}
)
self._current_directory = os.path.expanduser("~") # Agent's conceptual current directory
def execute(self, action: str, **kwargs) -> str:
"""
Executes a file system operation.
"""
if action == "list_directory":
return self.list_directory(kwargs.get("path"))
else:
return f"Error: Unknown file system action '{action}'."
def list_directory(self, path: str) -> str:
"""
Lists the contents of a directory on MacOS.
"""
full_path = os.path.join(self._current_directory, os.path.expanduser(path))
returncode, stdout, stderr = _run_shell_command(['ls', '-F', full_path])
if returncode == 0:
return f"Contents of {full_path}:\n{stdout}"
else:
return f"Error listing directory {full_path}: {stderr}"
# MacOSTextFileTool (simplified for example)
class MacOSTextFileTool(DesktopTool):
"""
MacOS specific implementation for text file operations.
Uses 'open' command for opening and Python for reading.
Summarization delegates to the LLM.
"""
def __init__(self):
super().__init__(
name="text_file_operations",
description="Performs operations on text files (read).",
parameters_schema={
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["read_text_file"], "description": "The specific text file action to perform."},
"path": {"type": "string", "description": "Path to the text file."},
},
"required": ["action", "path"]
}
)
def execute(self, action: str, **kwargs) -> str:
"""
Executes a text file operation based on the 'action' parameter.
"""
if action == "read_text_file":
return self.read_text_file(kwargs.get("path"))
else:
return f"Error: Unknown text file action '{action}'."
def read_text_file(self, path: str) -> str:
"""
Reads and returns the entire content of a text file on MacOS.
Args:
path: The path to the text file.
Returns:
The content of the file as a string, or an error message.
"""
full_path = os.path.expanduser(path)
if not os.path.exists(full_path):
return f"Error: File not found at {full_path}"
if not os.path.isfile(full_path):
return f"Error: Path is not a file: {full_path}"
try:
with open(full_path, 'r', encoding='utf-8') as f:
content = f.read()
return content # Return raw content for summarization by LLM
except UnicodeDecodeError:
return f"Error: Could not read file {full_path} with UTF-8 encoding. Try a different encoding or ensure it's a text file."
except Exception as e:
return f"Error reading file {full_path}: {str(e)}"
# ==============================================================================
# 2. Tool Orchestrator
# ==============================================================================
class ToolOrchestrator:
"""
Manages and executes desktop tools.
"""
def __init__(self, tools: list[DesktopTool]):
self.tools = {tool.name: tool for tool in tools}
def get_tool_definitions_for_llm(self) -> list[dict]:
"""
Returns a list of all tool definitions suitable for LLM function calling.
"""
return [tool.get_tool_definition() for tool in self.tools.values()]
def execute_tool(self, tool_name: str, **parameters) -> str:
"""
Executes a specific tool with the given parameters.
"""
tool = self.tools.get(tool_name)
if not tool:
return f"Error: Tool '{tool_name}' not found."
print(f"DEBUG: Executing tool '{tool_name}' with parameters: {parameters}")
return tool.execute(**parameters)
# ==============================================================================
# 3. Real LLM Agent (OpenAI Integration)
# ==============================================================================
class OpenAILLMAgent:
"""
An LLM agent that uses OpenAI's API for natural language understanding
and tool calling.
"""
def __init__(self, api_key: str, model_name: str = "gpt-4o"):
if not api_key:
raise ValueError("OpenAI API key must be provided.")
self.client = openai.OpenAI(api_key=api_key)
self.model_name = model_name
self.messages = [{"role": "system", "content": "You are a helpful desktop assistant. Use the available tools to assist the user. If a tool call is needed, provide only the tool call. If you need to respond in natural language, do so directly. If you read a file, summarize it for the user."}]
def _add_message(self, role: str, content: str = None, tool_calls: list = None, tool_call_id: str = None, name: str = None):
"""Helper to add messages to the conversation history."""
message = {"role": role}
if content is not None: # Allow empty string content
message["content"] = content
if tool_calls:
message["tool_calls"] = tool_calls
if tool_call_id:
message["tool_call_id"] = tool_call_id
if name:
message["name"] = name
self.messages.append(message)
def generate_response(self, user_query: str, available_tools: list[dict]) -> tuple[str, dict or str or None]:
"""
Sends the user query and available tools to the LLM and gets a response.
Returns a tuple: (response_type, data)
response_type can be "tool_call", "text_response", or "error".
data is the tool call dictionary, the text string, or an error message.
"""
print(f"\nLLM: Processing user query: '{user_query}'")
self._add_message("user", content=user_query)
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=self.messages,
tools=available_tools,
tool_choice="auto" # Let the LLM decide whether to call a tool or respond naturally
)
response_message = response.choices[0].message
if response_message.tool_calls:
# The LLM wants to call a tool
self._add_message("assistant", tool_calls=response_message.tool_calls)
# Return all tool calls, the agentic loop will process them.
# OpenAI's API often returns a single tool call for simple requests.
return "tool_call", response_message.tool_calls
elif response_message.content:
# The LLM generated a text response
self._add_message("assistant", content=response_message.content)
return "text_response", response_message.content
else:
return "text_response", "The LLM did not provide a clear response or tool call."
except openai.APIError as e:
print(f"LLM API Error: {e}")
return "error", f"OpenAI API Error: {e}"
except Exception as e:
print(f"LLM Unexpected Error: {e}")
return "error", f"An unexpected error occurred with the LLM: {e}"
def provide_tool_output(self, tool_call_id: str, output: str, tool_name: str):
"""
Adds the output of a tool execution to the conversation history
so the LLM can use it for subsequent steps or reflection.
"""
self._add_message("tool", content=output, tool_call_id=tool_call_id, name=tool_name)
# ==============================================================================
# 4. Agentic Loop
# ==============================================================================
class AgenticDesktopController:
"""
The main agent orchestrator, managing the LLM interaction and tool execution.
"""
def __init__(self, llm_agent: OpenAILLMAgent, tool_orchestrator: ToolOrchestrator):
self.llm_agent = llm_agent
self.tool_orchestrator = tool_orchestrator
self.available_tools_for_llm = self.tool_orchestrator.get_tool_definitions_for_llm()
def run_agent_step(self, user_input: str) -> str:
"""
Executes a single step of the agentic loop:
1. User Input -> 2. LLM Plan -> 3. Tool Act -> 4. Observe/Reflect
"""
print(f"\n--- User: {user_input} ---")
response_type, data = self.llm_agent.generate_response(user_input, self.available_tools_for_llm)
if response_type == "error":
return f"Agent Error: {data}"
elif response_type == "text_response":
print(f"Agent: {data}")
return data
elif response_type == "tool_call":
# The LLM returned one or more tool calls.
# We iterate through them and execute.
all_tool_results = []
for tool_call in data:
tool_id = tool_call.id
tool_name = tool_call.function.name
parameters = json.loads(tool_call.function.arguments) # Arguments are JSON string
print(f"Agent: Calling tool '{tool_name}' (ID: {tool_id}) with parameters: {parameters}...")
# Special handling for summarization: read file content, then ask LLM to summarize
if tool_name == "text_file_operations" and parameters.get("action") == "summarize_text_file":
file_path = parameters.get("path")
if not file_path:
tool_output = "Error: Path not provided for summarization."
else:
# First, read the file content using the tool
read_result = self.tool_orchestrator.execute_tool("text_file_operations", action="read_text_file", path=file_path)
if read_result.startswith("Error"):
tool_output = read_result
else:
# Now, ask the LLM to summarize this content
print(f"Agent: Asking LLM to summarize content from {file_path}...")
summary_prompt = f"Please summarize the following text:\n\n{read_result}"
# Temporarily add the content to LLM's history for summarization
# and then get a text response. This is a multi-turn LLM interaction.
summary_response_type, summary_data = self.llm_agent.generate_response(summary_prompt, []) # No tools for summarization query
if summary_response_type == "text_response":
tool_output = f"Summary of {file_path}: {summary_data}"
else:
tool_output = f"Error during summarization: {summary_data}"
else:
# Execute the tool normally
tool_output = self.tool_orchestrator.execute_tool(tool_name, **parameters)
print(f"Agent: Tool '{tool_name}' output:\n{tool_output}")
all_tool_results.append(tool_output)
# Provide the tool's output back to the LLM for context
self.llm_agent.provide_tool_output(tool_id, tool_output, tool_name)
# After executing tools, ask the LLM for a final natural language response
# based on the tool outputs.
print("Agent: All tools executed. Asking LLM for final response...")
final_response_type, final_data = self.llm_agent.generate_response("Based on the previous tool executions, what is the next step or final response to the user?", self.available_tools_for_llm)
if final_response_type == "text_response":
print(f"Agent: {final_data}")
return final_data
elif final_response_type == "tool_call":
# If LLM suggests another tool call after tool output, we could loop again.
# For this example, we'll just report it as a text response.
return f"Agent: The LLM suggested another tool call after previous actions: {json.dumps(final_data, indent=2)}"
else:
return f"Agent Error after tool execution: {final_data}"
return "Agent: No action taken." # Should ideally not be reached
# ==============================================================================
# Main Execution Block
# ==============================================================================
if __name__ == "__main__":
print("Initializing Agentic Desktop Controller for MacOS with OpenAI LLM...")
# Get OpenAI API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
print("Error: OPENAI_API_KEY environment variable not set.")
print("Please set it before running the script: export OPENAI_API_KEY='your_api_key_here'")
exit(1)
# 1. Initialize Tools
macos_open_app_tool = MacOSOpenApplicationTool()
macos_window_tool = MacOSWindowManagementTool()
macos_file_tool = MacOSFileSystemTool()
macos_text_file_tool = MacOSTextFileTool() # For reading files
all_tools = [
macos_open_app_tool,
macos_window_tool,
macos_file_tool,
macos_text_file_tool
]
# 2. Initialize Tool Orchestrator
orchestrator = ToolOrchestrator(all_tools)
# 3. Initialize OpenAI LLM Agent
llm_agent = OpenAILLMAgent(api_key=openai_api_key)
# 4. Initialize Agentic Controller
agent_controller = AgenticDesktopController(llm_agent, orchestrator)
print("\nAgent initialized. Type your commands (e.g., 'Open Safari', 'List files in Documents').")
print("Type 'exit' to quit.")
# Main interaction loop
while True:
user_input = input("\nYour command: ")
if user_input.lower() == 'exit':
print("Exiting agent. Goodbye!")
break
# Give some time for API calls and OS actions
time.sleep(0.1)
agent_controller.run_agent_step(user_input)
time.sleep(1) # Give some time for OS actions to visually complete
-----------------------------------------------------------------------
How to Run This Example:
1. Install OpenAI library: Open your Terminal and run:
pip install openai
2. Set your OpenAI API Key: In your Terminal, before running the script, set the environment variable:
export OPENAI_API_KEY='your_openai_api_key_here'
(Replace `'your_openai_api_key_here'` with your actual API key from OpenAI).
3. Save the code: Save the entire code block above as a Python file, for example, `real_desktop_agent.py`.
4. Run the script: Navigate to the directory where you saved the file in your Terminal and run:
python3 real_desktop_agent.py
5. Interact with the agent: The agent will prompt you for commands. Try these:
* `Open Safari`
* `List files in my Desktop folder`
* `List all windows` (This will show you the window index for Safari if it's open)
* `Move Safari window to top-left` (This assumes Safari is open and is window 1)
* `Move Safari window to center`
* `Create a file named test.txt on my Desktop with content "Hello from the agent!"`
* `Read the content of test.txt on my Desktop`
* `Summarize the file test.txt on my Desktop`
* `exit`
This running example demonstrates the fundamental flow using a real LLM: user input is sent to the OpenAI API, which then decides on a tool to call or generates a natural language response. The `ToolOrchestrator` executes the appropriate OS-specific tool, and the result is observed and reported back to the LLM for context, enabling multi-turn interactions and more complex reasoning.