Agent Runtime
This document describes how agents run in mutx.dev, including their lifecycle, monitoring, and self-healing capabilities.
What is active today
POST /agents/heartbeatis the live runtime path for connected agents. It updatesagents.statusandlast_heartbeatin the control plane.- Each runtime heartbeat now emits an
agent.heartbeatoutgoing webhook event for subscribers. - When a heartbeat changes the persisted agent status, MUTX also emits an
agent.statusoutgoing webhook event. - The background monitor in
src/api/services/monitoring.pyowns stale-agent detection, failure marking, alert resolution, and the active recovery loop. - The background monitor now wires tracked agents into
SelfHealingServiceinsrc/api/services/self_healer.py(heartbeat-based health checks + recovery handlers) so those paths are now connected to real runtime paths. - Advanced self-healing actions such as rollback version trees, recreate, or scale-up/down are still aspirational until they are backed by real execution infrastructure.
Overview
The Agent Runtime (src/api/services/agent_runtime.py:98) is the core execution engine that manages agent lifecycles, tool routing, and resource allocation.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent Runtime Architecture β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RuntimeManager β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β AgentRuntime β β β
β β β β β β
β β β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββ β β β
β β β β RuntimeConfig β β RuntimeState β β ToolExecutionHandler β β β β
β β β β - timeout β β - status β β - register_handler β β β β
β β β β - max_agents β β - metrics β β - execute_tool β β β β
β β β β - retries β β - active β β β β β β
β β β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββ β β β
β β β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Agent Registry β β β β
β β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β β
β β β β β Agent 1 β β Agent 2 β β Agent 3 β β Agent N β β β β β
β β β β β(LangChainβ β(OpenClawβ β (n8n) β β β β β β β
β β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent Lifecycle
Lifecycle States
βββββββββββ
β CREATED β
ββββββ¬βββββ
β initialize()
βΌ
ββββββββββββββββ
β INITIALIZING βββββ Exception ββββΆ βββββββββββ
ββββββββ¬ββββββββ β ERROR β
β ββββββ¬βββββ
β success β
βΌ β reset()
ββββββββββββββββ β
β READY βββββββββββββββββββββββββββββ
ββββββββ¬ββββββββ
β execute()
βΌ
ββββββββββββββββ
β RUNNING βββββ Complete ββββΆ βββββββββββ
ββββββββ¬ββββββββ β READY β
β ββββββββββ
β Error/Timeout
βΌ
βββββββββββββ
β ERROR β
βββββββββββββ
Creating an Agent
# From agent_runtime.py:189
def create_agent(
self,
name: str,
provider: str,
model: str,
system_prompt: Optional[str] = None,
tools: Optional[List[ToolDefinition]] = None,
vector_store_name: Optional[str] = None,
**kwargs,
) -> LangChainAgent:
provider_enum = LLMProvider(provider.lower())
config = AgentConfig(
name=name,
provider=provider_enum,
model=model,
system_prompt=system_prompt,
tools=tools or [],
vector_store_name=vector_store_name,
**kwargs,
)
agent = AgentRegistry.create_agent(config)
self.state.active_agents += 1
return agent
Execution Modes
| Mode | Method | Use Case |
|---|---|---|
| Async | execute_agent() |
Non-blocking, high throughput |
| Sync | execute_agent_sync() |
Simple scripts, CLI tools |
| Streaming | execute_agent_stream() |
Real-time output, chat UIs |
Agent Types
1. LangChain Agent
From src/api/integrations/langchain_agent.py:
class LangChainAgent:
def __init__(self, config: AgentConfig):
self.llm = LLMWrapper.create(config)
self.memory_manager = ConversationMemoryManager(config.memory_type)
self.tools = self._initialize_tools()
self.agent_executor = None
Features:
- Multiple LLM providers (OpenAI, Anthropic, Ollama)
- Tool-augmented execution
- Conversation memory
- Streaming support
2. OpenClaw Agent
Multi-agent orchestration framework for complex workflows.
3. n8n Agent
Workflow automation with visual builder integration.
Tool Execution
Tool Handler Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tool Execution Flow β
β β
β ββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββββββββββββββ β
β β Agent ββββββΆβ ToolExecution ββββββΆβ Tool Registry β β
β β Request β β Handler β β β β
β ββββββββββββ ββββββββββ¬βββββββββ β ββββββββββββββββββββββββββββ β β
β β β β search_documents (RAG) β β β
β β β β get_time β β β
β βΌ β β calculator β β β
β βββββββββββββββββββ β β [custom tools...] β β β
β β Validate Input β β ββββββββββββββββββββββββββββ β β
β ββββββββββ¬βββββββββ ββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββ β
β β Execute Tool ββββββΆβ Return/Stream Result β β
β β (Async/Sync) β β β β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Built-in Tools
| Tool | Description | Example |
|---|---|---|
search_documents |
Semantic search via vector store | query="deployment guide" |
get_time |
Current timestamp | get_time() |
calculator |
Safe math evaluation | calculator(expression="2+2") |
Custom Tool Registration
from src.api.services.agent_runtime import AgentRuntime
runtime = AgentRuntime(config)
runtime.tool_handler.register_handler(
"my_tool",
async def my_tool_handler(params):
# Custom logic
return {"result": "..."}
)
Monitoring
From src/api/services/monitoring.py:363, the MonitoringService provides comprehensive observability.
Metrics Collection
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Service Architecture β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MonitoringService β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββ β β
β β β MetricsCollector β β HealthChecker β β AlertManager β β β
β β β β β β β β β β
β β β - request_count β β - health_checks β β - create_alert β β β
β β β - error_count β β - retry logic β β - severity levels β β β
β β β - latency (p95) β β - status types β β - callbacks β β β
β β β - success_rate β β β β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β UptimeTracker β β SystemMetrics β β β
β β β β β β β β
β β β - start/stop β β - cpu_usage - memory_usage β β β
β β β - uptime_pct β β - disk_usage - network_io β β β
β β β - downtime β β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Health Status Levels
| Status | Condition | Action |
|---|---|---|
| HEALTHY | All checks pass | Normal operation |
| DEGRADED | Performance below threshold | Log warning |
| UNHEALTHY | Health check failed | Trigger recovery |
| UNKNOWN | No health data | Skip monitoring |
Alert Severity
| Level | Threshold | Example |
|---|---|---|
| INFO | - | Agent registered |
| WARNING | Error rate > 10% | High latency detected |
| ERROR | Error rate > 25% | Agent unhealthy |
| CRITICAL | Error rate > 50% | System failure |
Metrics Collected
| Metric | Type | Description |
|---|---|---|
request_count |
Counter | Total requests processed |
error_count |
Counter | Failed requests |
avg_latency_ms |
Gauge | Average response time |
p95_latency_ms |
Gauge | 95th percentile latency |
p99_latency_ms |
Gauge | 99th percentile latency |
cpu_usage |
Gauge | System CPU percentage |
memory_usage |
Gauge | System memory percentage |
uptime_percentage |
Gauge | Agent uptime ratio |
Self-Healing
From src/api/services/self_healer.py:491, the SelfHealingService provides automatic recovery.
Self-Healing Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Self-Healing Service Architecture β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SelfHealingService β β
β β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β HealthCheckScheduler β β RecoveryExecutor β β β
β β β β β β β β
β β β - check_interval β β - RecoveryAction.ROLLBACK β β β
β β β - timeout β β - RecoveryAction.RESTART β β β
β β β - max_retries β β - RecoveryAction.RECREATE β β β
β β β - agent_health β β - RecoveryAction.SCALE_UP β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β VersionManager β β RecoveryTimeTracker β β β
β β β β β β β β
β β β - record_version β β - record_recovery_time β β β
β β β - mark_stable β β - get_average_recovery_time β β β
β β β - get_history β β - recovery_stats β β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Recovery History (deque maxlen=1000) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Recovery Actions
| Action | Trigger | Description |
|---|---|---|
| RESTART | 3 consecutive failures | Restart agent process |
| ROLLBACK | After failed restart | Revert to stable version |
| RECREATE | Persistent failure | Destroy and recreate agent |
| SCALE_UP | High load | Add more agent instances |
| SCALE_DOWN | Low load | Reduce resource usage |
Health Check Configuration
@dataclass
class RecoveryConfig:
max_retries: int = 3 # Max recovery attempts
retry_delay_seconds: float = 5.0 # Delay between retries
health_check_interval_seconds: int = 10 # Check frequency
health_check_timeout_seconds: float = 5.0 # Timeout per check
max_consecutive_failures: int = 3 # Failures before recovery
rollback_on_failure: bool = True # Auto-rollback enabled
enable_auto_restart: bool = True # Auto-restart enabled
min_recovery_interval_seconds: float = 60.0 # Min time between recoveries
Recovery Flow
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Health ββββββΆβ Check ββββββΆβ ConsecutiveββββββΆβ Trigger β
β Check β β Result β β Failures β β Recovery β
β (30s) β β (FAIL) β β >= 3 β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Execute ββββββΆβ Success? ββNoββΆβ Rollback ββββββΆβ Mark β
β Recovery β β β β to Stable β β Stable β
β (RESTART) β ββββββββββββββββ ββββββββββββββββ β Version β
β ββββββββββββββββ
β Yes
βΌ
ββββββββββββββββ
β Record β
β Recovery β
β Time β
ββββββββββββββββ
Recovery Time Tracking
The service tracks recovery metrics:
{
"agent_id": "agent-001",
"total_recoveries": 5,
"average_recovery_time_seconds": 2.3,
"min_recovery_time_seconds": 1.1,
"max_recovery_time_seconds": 4.8,
"last_recovery_time_seconds": 2.1
}
Target: Recovery time < 5 seconds
Configuration
Runtime Configuration
@dataclass
class RuntimeConfig:
max_concurrent_agents: int = 10 # Max agents per runtime
default_timeout: int = 300 # Execution timeout (seconds)
enable_streaming: bool = True # Enable streaming responses
max_retries: int = 3 # Retry attempts on failure
retry_delay: float = 1.0 # Delay between retries
vector_store_enabled: bool = True # Enable RAG
database_url: Optional[str] = None # Database connection
Example Usage
from src.api.services.agent_runtime import (
AgentRuntime,
RuntimeConfig,
RuntimeManager
)
# Create runtime
config = RuntimeConfig(
max_concurrent_agents=5,
default_timeout=600,
)
runtime = RuntimeManager.create_runtime(config)
# Start runtime
await runtime.start()
# Create and execute agent
agent = runtime.create_agent(
name="my-agent",
provider="openai",
model="gpt-4",
system_prompt="You are a helpful assistant."
)
result = await runtime.execute_agent(
agent_id=agent.agent_id,
input_text="Hello, world!"
)
# Get stats
stats = runtime.get_stats()
# {
# "runtime_id": "...",
# "status": "running",
# "active_agents": 1,
# "total_executions": 1,
# "failed_executions": 0,
# "success_rate": 1.0
# }
# Stop runtime
await runtime.stop()
