Auto Evolve¶
Democratic thought experiment selection with autonomous iteration at hive scale.
Purpose¶
Auto Evolve is the single entry point for autonomous system improvement. One button press triggers a full democratic cycle: eligible thought experiments are gathered, constitutionally filtered, democratically ranked, and the winners are dispatched to type-aware agent iteration loops. The system evolves through structured deliberation, not unilateral action.
How It Works¶
┌──────────────────┐
│ Auto Evolve │
│ (single button) │
└────────┬─────────┘
│
┌──────────────▼──────────────┐
│ 1. GATHER │
│ Pull eligible experiments │
│ (voting/evaluating status) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 2. CONSTITUTIONAL FILTER │
│ ConstitutionalFilter gate │
│ (hive_guardrails) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 3. DEMOCRATIC VOTE TALLY │
│ Human + agent votes │
│ Context-aware weighting │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 4. SELECT top-N winners │
│ by approval score │
│ (above min threshold) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 5. DISPATCH to agent goals │
│ Type-aware iteration loop │
│ via request_agent_eval() │
└──────────────┬──────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼─────┐ ┌────▼─────┐ ┌────▼──────┐
│ software │ │tradition │ │physical_ai│
│autoresrch│ │reason & │ │observe & │
│edit→run→ │ │refine │ │measure │
│score→keep│ │LLM loop │ │visual ctx │
└──────────┘ └──────────┘ └───────────┘
Type-Aware Iteration¶
When an experiment reaches the "evaluating" phase, a type-specific iteration recipe is generated and passed to the agent goal. The agent's autogen conversation loop drives iteration -- not a hardcoded Python while loop.
| Experiment Type | Strategy | Tools Used | Max Iterations | Scoring |
|---|---|---|---|---|
software |
autoresearch |
launch_experiment_autoresearch, get_experiment_research_status |
50 | Metric extraction (regex) |
traditional |
reason_and_refine |
iterate_hypothesis, score_hypothesis_result, web search, recall_memory |
10 | LLM rubric |
physical_ai |
observe_and_measure |
iterate_hypothesis, Visual_Context_Camera, score_hypothesis_result |
20 | LLM rubric |
| any new type | reason_and_refine (fallback) |
iterate_hypothesis, score_hypothesis_result |
10 | LLM rubric |
Iteration Recipe¶
The recipe is stored in the agent goal's config_json.iteration_recipe:
{
"strategy": "reason_and_refine",
"description": "ITERATIVE THOUGHT EXPERIMENT\n\nHypothesis: ...\n\nLOOP PATTERN:\n1. iterate_hypothesis\n2. Research...\n3. score_hypothesis_result\n4. Repeat...",
"tools": ["iterate_hypothesis", "score_hypothesis_result", "get_iteration_history"],
"max_iterations": 10,
"scoring": "llm_rubric"
}
Generic Iteration Tools¶
Three tools available for ALL experiment types (not just software):
iterate_hypothesis(experiment_id, hypothesis, approach, evidence, iteration)¶
Proposes and tracks a hypothesis iteration. Returns experiment context for the agent to evaluate. Respects owner pause -- if the experiment creator paused evolution, returns a stop signal.
score_hypothesis_result(experiment_id, iteration, score, reasoning, evidence_quality, clarity, feasibility, impact)¶
Scores a hypothesis with a structured rubric. Returns:
- Score record: Overall score (-2 to +2) plus sub-scores (0-1 each)
- Trend analysis: best_score, improving, stagnant detection
- Convergence advice:
CONTINUE,CONVERGE(3 same scores),BUDGET(10 iterations),STRONG(score >= 1.5)
Iteration history persisted at agent_data/experiment_iterations/{experiment_id}.json.
get_iteration_history(experiment_id, last_n)¶
Returns past iterations with summary statistics (total, best, worst, avg, trend). Used by the agent to inform its next hypothesis refinement.
Owner Pause/Resume¶
The experiment creator can pause and resume their experiment's iteration at any time:
POST /api/social/experiments/<id>/pause-evolve { "user_id": "<creator_id>" }
POST /api/social/experiments/<id>/resume-evolve { "user_id": "<creator_id>" }
- Only the creator (owner) can pause
- Only the user who paused can resume
iterate_hypothesis()checks pause state before every iteration- Paused experiments remain in
evaluatingstatus but stop iterating
AutoResearch Engine (Software Type)¶
For experiment_type='software', the autoresearch engine provides specialized code iteration:
1. BASELINE -- run unmodified code, capture baseline metric
2. HYPOTHESIS -- LLM proposes code edit (AiderNativeBackend)
3. EXECUTE -- apply edit, run experiment (subprocess + timeout)
4. SCORE -- extract metric from output (regex patterns)
5. DECIDE -- improved? keep (git commit). worse? revert (git checkout)
6. ITERATE -- repeat until budget exhausted or max_iterations
7. REPORT -- save to agent_data/autoresearch/{session_id}.json
Budget Gating¶
- Spark budget:
spark_consumed + spark_per_iteration > spark_budgetstops the loop - Time budget: Per-iteration timeout (
time_budget_s, default 300s) - Iteration cap:
max_iterations(default 50)
Hive Parallel Mode¶
When hive_parallel=True, multiple hypothesis variants run simultaneously across compute mesh peers:
- Generate N diverse hypotheses
- Dispatch each to a peer (encrypted via X25519)
- Tournament selection picks the best result
- Winning edit applied locally
- Falls back to sequential if <2 peers available
Evolution Stack Integration¶
AutoResearch feeds into the existing multi-level evolution stack:
AutoResearchEngine.run_loop()
│
├─ _run_experiment()
│ └─ BenchmarkTracker.record() ← per-iteration metrics
│
├─ _generate_and_apply_edit()
│ └─ BenchmarkTracker.get_best_tool() ← benchmark-informed context
│
├─ _commit_improvement()
│ ├─ CodingRecipeBridge.capture_edit_as_recipe_step() ← recipe reuse
│ └─ AgentBaselineService.capture_snapshot() ← regression detection
│
└─ _save_report()
└─ _export_learning_delta() ← hive-wide federation
└─ BenchmarkTracker.export_learning_delta()
└─ FederatedAggregator picks up on next tick
└─ broadcast_delta() to peers
| Stack Layer | Component | What It Receives |
|---|---|---|
| Per-task | BenchmarkTracker | task_type='autoresearch', duration, success/fail |
| Per-agent | AgentBaselineService | Snapshot after each improvement (trigger: autoresearch_improvement) |
| Per-recipe | CodingRecipeBridge | Winning edits saved as replayable recipe steps |
| Hive-wide | FederatedAggregator | Learning delta with autoresearch session summary |
| RL feedback | WorldModelBridge | Experiment outcome when decide() records final decision |
API Endpoints¶
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /api/social/experiments/auto-evolve |
Start democratic auto-evolve cycle |
| GET | /api/social/experiments/auto-evolve/status |
Get current cycle status |
| POST | /api/social/experiments/<id>/pause-evolve |
Owner pauses iteration |
| POST | /api/social/experiments/<id>/resume-evolve |
Owner resumes iteration |
| POST | /api/social/experiments/<id>/evaluate |
Trigger single experiment evaluation |
Start Auto-Evolve¶
POST /api/social/experiments/auto-evolve
{
"user_id": "admin_1",
"max_experiments": 5,
"min_approval_score": 0.3
}
Response:
{
"success": true,
"session_id": "a1b2c3d4e5f6",
"status": "selecting"
}
Cycle Status¶
GET /api/social/experiments/auto-evolve/status
Response:
{
"session_id": "a1b2c3d4e5f6",
"status": "running",
"elapsed_s": 45.2,
"candidates": 12,
"filtered": 10,
"selected": 3,
"dispatched": 3,
"experiments": [
{
"id": "exp_1",
"title": "Optimize embedding pipeline",
"type": "software",
"approval_score": 1.8,
"goal_id": "goal_abc",
"status": "dispatched"
}
]
}
Agent Tools¶
| Tool | Tags | Purpose |
|---|---|---|
start_auto_evolve |
auto_evolve, thought_experiment |
Start democratic cycle |
get_auto_evolve_status |
auto_evolve |
Poll cycle progress |
pause_evolve_experiment |
auto_evolve, thought_experiment |
Owner pauses iteration |
resume_evolve_experiment |
auto_evolve, thought_experiment |
Owner resumes iteration |
iterate_hypothesis |
thought_experiment, iteration |
Propose hypothesis for any type |
score_hypothesis_result |
thought_experiment, iteration |
Score with rubric + trend |
get_iteration_history |
thought_experiment, iteration |
Past iterations + stats |
launch_experiment_autoresearch |
thought_experiment, autoresearch |
Software-only code iteration |
get_experiment_research_status |
thought_experiment, autoresearch |
Software iteration progress |
EventBus Topics¶
| Topic | When |
|---|---|
auto_evolve.dispatching |
Cycle selected winners, dispatching |
auto_evolve.started |
Experiments dispatched, cycle running |
auto_evolve.no_candidates |
No eligible experiments found |
auto_evolve.none_approved |
All candidates below approval threshold |
autoresearch.started |
Software code loop started |
autoresearch.baseline |
Baseline metric captured |
autoresearch.iteration |
Each iteration result |
autoresearch.completed |
Loop finished |
autoresearch.failed |
Loop error |
Test Coverage¶
| Test File | Tests | Coverage |
|---|---|---|
tests/unit/test_auto_evolve.py |
20 | Session state, singleton, pause/resume ownership, constitutional filter, vote ranking, tools |
tests/unit/test_autoresearch.py |
53 | Session management, metric extraction, budget gating, type-aware recipes, iteration tools, benchmark integration |
Source Files¶
| File | Purpose |
|---|---|
integrations/agent_engine/auto_evolve.py |
AutoEvolveOrchestrator, pause/resume, AUTO_EVOLVE_TOOLS |
integrations/agent_engine/thought_experiment_tools.py |
11 tools including generic iteration (THOUGHT_EXPERIMENT_TOOLS) |
integrations/coding_agent/autoevolve_code_tools.py |
AutoResearchEngine for software experiments (AUTORESEARCH_TOOLS) |
integrations/social/thought_experiment_service.py |
Lifecycle, _build_iteration_recipe(), request_agent_evaluation() |
integrations/social/api_thought_experiments.py |
Flask blueprint (17 endpoints including auto-evolve) |
integrations/agent_engine/goal_manager.py |
goal_type='thought_experiment' registration |
integrations/coding_agent/benchmark_tracker.py |
Per-task performance tracking |
integrations/agent_engine/agent_baseline_service.py |
Per-agent snapshots + regression detection |
integrations/agent_engine/federated_aggregator.py |
Hive-wide learning delta distribution |
Related Docs¶
- thought-experiments.md -- Lifecycle and voting
- benchmark-tracking.md -- Evolution stack details
- coding-agent.md -- Tool backends
- federation.md -- Hive-wide sync protocol
- budget-gating.md -- Spark budget system