Skip to content

Benchmark & Baseline Tracking

HART OS tracks performance across multiple dimensions — agent quality, model latency, world model health, coding tool efficiency, and HevolveAI research benchmarks. Every upgrade is gated on benchmark comparison. Every agent snapshot is versioned for regression detection.

Architecture

                    ┌─────────────────────────┐
                    │   BenchmarkRegistry     │
                    │   (7 built-in adapters)  │
                    └─────────┬───────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         │                    │                    │
    Fast Tier            Heavy Tier          Dynamic Tier
    ┌─────────┐         ┌──────────┐       ┌──────────────┐
    │ModelReg │         │QuantiPhy │       │ Git-installed │
    │WorldMdl │         │Embodied  │       │   adapters    │
    │Regress  │         │QwenEnc   │       │              │
    │Guardrail│         └──────────┘       └──────────────┘
    └─────────┘          (GPU required)     (runtime)

         ┌─────────────────────────┐
         │  AgentBaselineService   │
         │  (per-agent snapshots)  │
         └─────────┬───────────────┘
                   │
         ┌─────────┼──────────┐
         │         │          │
    Recipe      Lightning   Trust/
    Metrics     Metrics     Evolution
    (actions,   (reward,    (score,
     duration,   trend,      generation,
     success)    errors)     specialization)

         ┌────────────────────────┐
         │  CodingBenchmarkTracker│
         │  (SQLite per-task)     │
         └────────────────────────┘
         (task_type, tool, model, time, success)

Benchmark Registry

File: integrations/agent_engine/benchmark_registry.py

Built-in Adapters

Adapter Tier Metrics Source
ModelRegistryAdapter fast Per-model latency (ms), accuracy (score) ModelRegistry
WorldModelAdapter fast Flush rate, correction density, hivemind queries WorldModelBridge
RegressionAdapter fast Test pass rate (excluding nested_task) pytest
GuardrailAdapter fast Guardrail integrity verified (bool) hive_guardrails
QuantiPhyAdapter heavy Physics reasoning benchmark HevolveAI (GPU, 4GB VRAM)
EmbodiedValidationAdapter heavy Performance, forgetting, memory benchmarks HevolveAI (GPU, 2GB VRAM)
QwenEncoderAdapter heavy Encoder throughput (tokens/sec) HevolveAI (GPU, 2GB VRAM)

HevolveAI Integration

The three heavy-tier adapters import directly from the hevolveai sibling package:

# Conditional import — gracefully skips if hevolveai not installed
from hevolveai.tests.benchmarks.quantiphy_benchmark import QuantiPhyBenchmark
from hevolveai.embodied_ai.validation.benchmark import (
    PerformanceBenchmark, ForgettingBenchmark, MemoryBenchmark)
from hevolveai.embodied_ai.models.qwen_benchmark import benchmark_llamacpp

Uses importlib.util.find_spec('hevolveai') to check availability at runtime.

Snapshot Storage

Snapshots stored at agent_data/benchmarks/{version}.json:

{
  "version": "v2.1.0",
  "git_sha": "abc123",
  "captured_at": "2026-02-24T12:00:00",
  "tier": "fast",
  "metrics": {
    "gpt-4o_latency_ms": {"value": 850, "direction": "lower", "unit": "ms"},
    "gpt-4o_accuracy": {"value": 0.92, "direction": "higher", "unit": "score"},
    "flush_rate": {"value": 0.87, "direction": "higher", "unit": "ratio"},
    "test_pass_rate": {"value": 0.98, "direction": "higher", "unit": "ratio"},
    "guardrail_integrity": {"value": 1.0, "direction": "higher", "unit": "bool"}
  }
}

Upgrade Safety Check

registry.is_upgrade_safe(old_version, new_version) -> (bool, reason)

Compares all fast-tier metrics between versions. 5% regression threshold — any metric degrading more than 5% blocks the upgrade:

  • "higher" metrics (accuracy, pass rate): new < old * 0.95 = regression
  • "lower" metrics (latency): new > old * 1.05 = regression

Dynamic Benchmark Installation

registry.discover_and_install(
    repo_url='https://github.com/org/benchmark',
    name='custom_benchmark',
    requires_gpu=True,
    min_vram_gb=4.0
)

Installs to ~/.hevolve/benchmarks/ via git clone + pip install. Coding agent at regional compute-heavy nodes can install dynamically.

Core Methods

registry.capture_snapshot(version, git_sha, tier='fast')  # 'fast' | 'heavy' | 'all'
registry.is_upgrade_safe(old_version, new_version)         # (bool, reason)
registry.get_latest_results()                              # For federation delta
registry.list_benchmarks()                                 # All registered adapters
registry.register_benchmark(adapter)                       # Custom adapter

Agent Baseline Service

File: integrations/agent_engine/agent_baseline_service.py (665 lines)

Purpose

Captures unified performance snapshots of agents at creation time and whenever recipe, prompt, or intelligence changes. Enables per-agent regression detection.

Snapshot Structure

Stored at agent_data/baselines/{prompt_id}_{flow_id}/v{N}.json:

{
  "version": 3,
  "trigger": "recipe_change",
  "timestamp": "2026-02-24T12:00:00",
  "user_id": "user_123",
  "recipe_metrics": {
    "action_count": 5,
    "total_expected_duration": 120,
    "success_rates": {"action_1": 0.95, "action_2": 0.88},
    "dead_ends": 1,
    "effective_fallbacks": 2
  },
  "lightning_metrics": {
    "avg_reward": 0.82,
    "total_reward": 41.0,
    "reward_trend": "improving",
    "execution_count": 50,
    "error_rate": 0.04,
    "avg_duration": 2.3
  },
  "benchmark_metrics": {
    "test_pass_rate": 0.98,
    "model_registry_accuracy": 0.91
  },
  "trust_evolution": {
    "composite_trust_score": 0.87,
    "generation": 3,
    "specialization_path": "coding/python",
    "evolution_xp": 1250
  }
}

Triggers

Trigger When
creation Agent first created
recipe_change Recipe file modified (debounced: skip if <60s since creation)
prompt_change Prompt definition updated
intelligence_change World model stats shift detected by AgentDaemon

Core Methods

AgentBaselineService.capture_snapshot(prompt_id, flow_id, trigger, user_id, user_prompt)
AgentBaselineService.validate_against_baseline(prompt_id, flow_id)  # CI/CD gate
AgentBaselineService.compare_snapshots(prompt_id, flow_id, old_v, new_v)  # Delta
AgentBaselineService.compute_trend(prompt_id, flow_id)  # improving/declining/stable
AgentBaselineService.get_latest_snapshot(prompt_id, flow_id)
AgentBaselineService.list_snapshots(prompt_id, flow_id)

Regression Detection

validate_against_baseline() checks:

  • Recipe success rates per action: regression if <95% of baseline
  • Benchmark pass rate: regression if <95% of baseline
  • Returns: {passed: bool, regressions: [list], baseline_version: int}

Trend Analysis

compute_trend() analyzes reward and duration trends across all snapshots:

{
  "trend": "improving",
  "snapshot_count": 7,
  "reward_trend": "improving",
  "duration_trend": "stable"
}

Coding Benchmark Tracker

File: integrations/coding_agent/benchmark_tracker.py (249 lines)

SQLite-backed performance tracking for coding agent tools, tasks, and models.

Database: agent_data/coding_benchmarks.db

Tables

benchmarks — Individual task execution records:

Column Type Purpose
task_type TEXT e.g., "code_review", "bug_fix"
tool_name TEXT e.g., "pylint", "ruff"
model_name TEXT e.g., "gpt-4o", "llama-3"
user_id TEXT Requesting user
completion_time_s REAL Execution duration
success INTEGER 1=pass, 0=fail
offloaded INTEGER 1=hive task
timestamp TEXT ISO-8601

hive_routing — Aggregated best-tool routing from hive peers.

Core Methods

tracker.record(task_type, tool_name, completion_time_s, success, model_name, user_id, offloaded)
tracker.get_best_tool(task_type)        # Local benchmark-based routing
tracker.get_hive_best_tool(task_type)   # Peer-aggregated routing
tracker.get_summary()                   # Dashboard data
tracker.export_learning_delta()         # Compact delta for hive
tracker.import_hive_delta(aggregated)   # Consume peer benchmarks

Min samples: 5 records required before a tool is considered "benchmarked".

Hive-Wide Learning

The coding daemon exports benchmark deltas every 10 ticks (~5 min):

# coding_daemon.py
if self._tick_count % 10 == 0:
    self._sync_benchmark_deltas()

FederatedAggregator picks up the delta and distributes to peers. Each peer imports the aggregated data for hive-wide tool routing intelligence.

AutoResearch Integration

File: integrations/coding_agent/autoevolve_code_tools.py

The autoresearch engine feeds into the benchmark stack at three points:

# After each experiment iteration
tracker.record(task_type='autoresearch', tool_name='aider_native_backend',
               completion_time_s=result.duration_s, success=not result.error)

# After each improvement (commit)
AgentBaselineService.capture_snapshot(trigger='autoresearch_improvement')

# After session ends (report save)
tracker.export_learning_delta()  # enriched with autoresearch session summary

The autoresearch hypothesis generator also queries tracker.get_best_tool('autoresearch') to include benchmark-informed context in its prompts. See auto-evolve.md.

PR Review Service Integration

File: integrations/agent_engine/pr_review_service.py

Uses baseline validation as a build breaker gate:

PR Review Pipeline:
  1. Fetch PR diff stats
  2. Run pre-commit checks (ruff lint)
  3. Run test suite
  4. Validate baseline (no regression)    ← BUILD BREAKER
  5. Classify change complexity

Decision Matrix:
  Tests Pass + No Regression + Simple  → AUTO-APPROVE
  Tests Pass + No Regression + Complex → FLAG for steward
  Tests Pass + Regression              → AUTO-REJECT
  Tests Fail                           → AUTO-REJECT

Agent Daemon Integration

File: integrations/agent_engine/agent_daemon.py

Background daemon periodically validates agent performance:

# Every 2*remediate_every ticks
result = AgentBaselineService.validate_against_baseline(prompt_id, flow_id)
if result and not result.get('passed', True):
    capture_baseline_async(...)  # Auto-snapshot on regression

Federation Integration

File: integrations/agent_engine/federated_aggregator.py

Benchmark results are synced across the hive via gossip protocol:

def _get_benchmark_results(self) -> dict:
    """Pull latest benchmark results + coding agent deltas."""
    from integrations.coding_agent.benchmark_tracker import get_benchmark_tracker
    coding_delta = get_benchmark_tracker().export_learning_delta()
    results['coding_benchmarks'] = coding_delta.get('coding_benchmarks', {})

Aggregated data flows back to each node's tool router for hive-optimized task routing.

Test Coverage

Test File Coverage
tests/unit/test_agent_baseline_service.py (424 lines) Snapshots, versioning, regression, trends
tests/unit/test_federation_upgrade.py BenchmarkRegistry adapters, is_upgrade_safe()
tests/unit/test_coding_tool_backends.py get_best_tool() with min samples