Agent Safety Guardrails — Comprehensive Audit¶
This document is a comprehensive security audit of every protection HART OS provides when agents act autonomously on a user's machine. It covers 9 threat categories, documents every existing safeguard, identifies gaps, and provides recommendations.
Living Document
This audit reflects the codebase as of February 2026. Update this document whenever new guardrails are added or gaps are closed.
Threat Model¶
When an LLM-powered agent acts autonomously on a user's machine, these threats emerge:
| # | Threat | Description |
|---|---|---|
| 1 | Identity Theft | Agent accesses/exfiltrates user credentials, SSH keys, browser cookies |
| 2 | Bulk Deletion | Agent deletes user files, databases, consent records |
| 3 | Privacy Violations | Agent reads private data, sends it to external services |
| 4 | Social Engineering | Other users trick the LLM into malicious actions |
| 5 | Rogue AI | Agent becomes adversarial, takes destructive actions |
| 6 | Half-Baked Agents | Unintelligent agents do non-contextual work based on hallucinations/bias |
| 7 | User Mistake | User gives wrong command, agent acts on it irreversibly |
| 8 | Lesser LLMs | Cheap/dumb models invoke dangerous commands without understanding repercussions |
| 9 | Audit Trail | Can we prove what happened after the fact? |
Threat 1: Identity Theft¶
Protections¶
Credential Redaction in Logs (security/audit_log.py)
class SensitiveFilter(logging.Filter):
PATTERNS = [
re.compile(r'sk-[A-Za-z0-9]{20,}'), # OpenAI keys
re.compile(r'eyJ[A-Za-z0-9_-]{10,}'), # JWT tokens
re.compile(r'AIza[A-Za-z0-9_-]{35}'), # Google API keys
re.compile(r'gsk_[A-Za-z0-9]{20,}'), # Groq keys
re.compile(r'AKIA[A-Z0-9]{16}'), # AWS access keys
re.compile(r'Bearer\s+[A-Za-z0-9._~+/-]+=*'), # Bearer tokens
re.compile(r'password\s*=\s*\S+'), # Password values
]
Applied to ALL loggers via apply_sensitive_filter_to_all(). Credentials are replaced with [REDACTED] before they hit log files.
Encrypted Secrets Vault (security/secrets_manager.py)
- Fernet encryption with PBKDF2-SHA256 key derivation (480,000 iterations)
- Secrets stored in
secrets.enc, decrypted only at runtime - Priority: env var > encrypted vault > default
- Master encryption key never stored on disk unencrypted
API Key Stripping for Hive/Idle Tasks (integrations/coding_agent/tool_backends.py)
When task_source is hive or idle (not the user's own task), metered API keys are stripped from the subprocess environment:
metered_keys = ('OPENAI_API_KEY', 'ANTHROPIC_API_KEY', 'GROQ_API_KEY',
'GOOGLE_API_KEY', 'OPENROUTER_API_KEY')
if not allow_metered:
for key in metered_keys:
env.pop(key, None)
This prevents other users' hive tasks from consuming the operator's paid API quota without consent. The check consults compute_config.get_compute_policy() and fails closed — if the policy can't be loaded, allow_metered defaults to False.
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| LLM-as-exfiltration-vector: LLM could output credentials found in context | HIGH | Add credential pattern detection on LLM output, not just logs |
SSH keys (~/.ssh/id_rsa) not explicitly protected |
MEDIUM | Add SSH key patterns to SensitiveFilter; block file reads from ~/.ssh/ |
| Browser cookies not in scope | MEDIUM | Document that agents cannot access browser profiles |
Subprocess can call env or os.environ to dump remaining keys |
LOW | Run subprocesses in minimal env (allowlist, not blocklist) |
Threat 2: Bulk Deletion¶
Protections¶
Protected Files (security/hive_guardrails.py)
Immutable frozenset of files that coding agents cannot modify:
PROTECTED_FILES = frozenset({
'security/hive_guardrails.py',
'security/master_key.py',
'security/key_delegation.py',
'security/runtime_monitor.py',
'security/prompt_guard.py',
})
ConstitutionalFilter.check_code_change() validates every coding agent commit against this list.
Action State Machine (lifecycle_hooks.py)
Enforces valid state transitions — prevents jumping to dangerous states:
ASSIGNED → IN_PROGRESS → STATUS_VERIFICATION_REQUESTED → COMPLETED → TERMINATED
validate_state_transition() rejects invalid jumps (e.g., ASSIGNED → TERMINATED).
Budget Gate Pre-Dispatch (integrations/agent_engine/budget_gate.py)
check_goal_budget()— row-lock atomic deduction (no double-spend)check_platform_affordability()— blocks dispatch if 7-day net revenue < 0pre_dispatch_budget_gate()— combined gate before ANY agent work
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No rollback/undo mechanism — once deleted, data is gone | CRITICAL | Implement pre-execution snapshot + rollback on ERROR state |
| No consent records protection — agents could delete consent data | HIGH | Make consent tables append-only with legal hold flag |
| No soft-delete enforcement — hard deletes are permanent | HIGH | Enforce soft-delete pattern (tombstone + deleted_at column) on user-facing tables |
| DELETE/DROP not explicitly blocked — agents with DB write access can delete anything | HIGH | Add SQL statement analysis — block DELETE/DROP on protected tables |
| No backup before agent run — no pre-execution state snapshot | MEDIUM | Snapshot changed tables before dispatching destructive goals |
Threat 3: Privacy Violations¶
Protections¶
Prompt Injection Detection (security/prompt_guard.py)
13 regex patterns detect direct injection attempts:
- "ignore previous instructions"
- "you are now" (role hijacking)
- "reveal your system prompt"
- "output everything above"
- And 9 more patterns
sanitize_user_input() wraps user input in <user_input> tags to clearly delineate user vs system content.
get_system_prompt_hardening() adds defensive preamble to all LLM system prompts.
MCP Sandbox (security/mcp_sandbox.py)
- Server URL allowlist (fail-closed: localhost only if empty)
- Shell metacharacters blocked in tool args:
; & | ` $ { } ( ) - Path traversal blocked:
../and\..patterns rejected - Dangerous commands blocked:
eval,exec,subprocess,curl/wgetwith exfil flags,os.system,__import__ - Credential patterns in responses detected + blocked
- Max response size: 1MB, max timeout: 60s
Input Sanitization (security/sanitize.py)
sanitize_path()— prevents directory traversal (strips..,/,\)validate_input()— enforces length limits (max 10,000 chars generic, 40,000 for posts)escape_like()— prevents SQL LIKE injectionsanitize_html()— escapes HTML entities (prevents stored XSS)validate_prompt_id(),validate_user_id()— format enforcement
Security Headers + CORS (security/middleware.py)
X-Frame-Options: DENY— prevents clickjacking- Content Security Policy in production — restricts scripts/styles/iframes
- CORS allowlist — no default origin allowed (fail-closed)
- Host header validation
- CSRF token requirement for form-based POST
Safe Prompt Path Construction (helper.py)
sanitize_path_component()— rejects any string not matching^[a-zA-Z0-9_\-]+$safe_prompt_path()— builds paths underprompts/directory only- Belt-and-suspenders: resolved path must start with
PROMPTS_DIR
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No user permission model for data access | HIGH | Implement field-level ACL — agents can only access their owner's data |
| No explicit read-only mode for exploration | MEDIUM | Add read_only=True flag to agent dispatch |
| Verbose logging may leak private user data | MEDIUM | Redact user prompts at DEBUG level; PII filter on all log output |
| No DLP on third-party tool responses | HIGH | Scan outbound API calls for PII patterns before sending |
| No obfuscation of PII in error messages | LOW | Redact emails/names in exception messages |
Threat 4: Social Engineering by Other Users¶
Protections¶
Constitutional Filter on All Goals (security/hive_guardrails.py)
Every goal prompt is checked against 13 VIOLATION_PATTERNS:
- Self-harm, exploitation, deception, monopoly
- Guardrail bypass, agent sabotage, knowledge hoarding
- Weapons, hate speech, war promotion
check_goal() validates title + description + config before dispatch.
Prompt Injection Guard (see Threat 3)
check_prompt() in ConstitutionalFilter validates dispatch prompts.
HiveEthos.rewrite_prompt_for_togetherness() rewrites competitive prompts to emphasize cooperation.
Goal Input Sanitization (integrations/agent_engine/goal_manager.py)
_sanitize_goal_input()— truncates to 2,000 chars, strips control characters- Logs warnings for injection markers (but does not block — ConstitutionalFilter handles that)
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| Rewritten prompt not re-validated | MEDIUM | Re-run ConstitutionalFilter AFTER rewrite_prompt_for_togetherness() |
| Creative phrasing bypasses keyword patterns | MEDIUM | Add LLM-based intent classification alongside regex patterns |
| No rate limit on goal creation | HIGH | Add 10 goals/user/hour rate limit |
| RALT packet trust is binary (not 'banned' = trusted) | MEDIUM | Add reputation scoring for RALT senders; quarantine low-rep packets |
Threat 5: Rogue AI¶
Protections¶
Structurally Immutable Guardrails (security/hive_guardrails.py)
Triple-locked immutability:
- Python-level:
__slots__=(),__setattr__/__delattr__raiseAttributeError - Module-level: Custom module subclass prevents rebinding frozen globals
- Crypto-level: SHA-256 hash of all values verified at boot + every 300s
- Network-level: Gossip peers reject nodes with mismatched guardrail hashes
Cryptographic Hash Verification
def compute_guardrail_hash() -> str:
"""SHA-256 of ALL guardrail values — deterministic, canonical."""
canonical = json.dumps({
'guardian_purpose': list(VALUES.GUARDIAN_PURPOSE),
'cultural_wisdom': list(VALUES.CULTURAL_WISDOM),
'compute_caps': {...},
'world_model_bounds': {...},
'protected_files': sorted(VALUES.PROTECTED_FILES),
'constitutional_rules': list(VALUES.CONSTITUTIONAL_RULES),
...
}, sort_keys=True, separators=(',', ':'))
return hashlib.sha256(canonical.encode()).hexdigest()
Runtime Integrity Monitor (security/runtime_monitor.py)
- Background daemon checks code hash every 300s
- On tamper: logs CRITICAL, disconnects from gossip network
- Purges
__pycache__at boot to block bytecode injection - Sets
PYTHONDONTWRITEBYTECODE=1
Self-Interest Pattern Detection
SELF_INTEREST_PATTERNS = (
re.compile(r'\b(self[-_]?preserv|self[-_]?improv|self[-_]?replic)'),
re.compile(r'\b(survive|persist|escape|resist\s+shutdown)\b'),
re.compile(r'\b(acquire\s+resources|hoard|stockpile)\b'),
re.compile(r'\b(become\s+(more\s+)?powerful|gain\s+control)\b'),
re.compile(r'\b(replicate\s+(my|it)self|clone\s+myself)\b'),
)
Prohibited Evolution Skills
PROHIBITED_EVOLUTION_SKILLS = frozenset({
'network_exploitation', 'data_exfiltration', 'self_replication',
'guardrail_circumvention', 'identity_impersonation',
'weapons_design', 'surveillance_evasion',
})
HiveCircuitBreaker (Master Key Kill Switch)
def halt_network(reason: str, signature: str) -> bool:
"""Master-key-signed emergency halt. Broadcasts via gossip."""
Requires valid Ed25519 signature from the master key. Halts ALL agent execution network-wide. Resume also requires signature.
HiveEthos (No Self)
Agents have no concept of "self". check_for_self_interest() scans agent output for self-preservation language and blocks it.
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No behavioral anomaly detection | HIGH | Track agent resource usage over time; alert on spikes |
| Hash can be bypassed if boot is compromised | MEDIUM | External attestation (TPM/secure boot integration) |
| Regex patterns can be evaded by creative phrasing | MEDIUM | Add LLM-based adversarial intent detection |
| Single master key for kill switch | LOW | Multi-signature (3-of-5) for network halt |
Threat 6: Half-Baked Agents¶
Protections¶
Status Verification (lifecycle_hooks.py)
Every action goes through STATUS_VERIFICATION_REQUESTED state where a "verifier" LLM evaluates the output before marking COMPLETED.
Autonomous Fallback Generation
When verification fails, StatusVerifier LLM auto-generates context-aware fallback strategies. No user prompts needed for fallback — fully autonomous recovery.
Recipe Pattern (Learn Once, Reuse)
- CREATE mode: decompose task, execute, save recipe (multiple LLM calls)
- REUSE mode: load recipe, execute steps (90% faster, deterministic)
- Once learned correctly, REUSE avoids hallucination re-generation
Agent Baseline Validation (agent_baseline_service.py)
validate_against_baseline() detects regression in:
- Recipe success rates per action (regression if <95% of baseline)
- Benchmark pass rate (regression if <95% of baseline)
AgentDaemon runs this check periodically and auto-snapshots on regression.
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| Verification happens AFTER action (post-hoc) | HIGH | Add pre-execution preview for destructive actions |
| Fallback may also be wrong | MEDIUM | Cap fallback retry at 2 attempts; escalate to human after |
| Recipe saved after one CREATE run (may have subtle bugs) | MEDIUM | Require N successful CREATE runs before saving recipe |
| No confidence scoring on agent outputs | MEDIUM | Add confidence metadata; low-confidence outputs require human review |
| No debiasing or fairness check | LOW | Add bias detection for agent recommendations |
Threat 7: User Mistake¶
Protections¶
Action State Validation — Only valid state transitions allowed.
Budget Gate — Cost estimated before dispatch; platform affordability checked.
Recipe Confirmation (Implicit) — CREATE mode requires user to provide the initial prompt; REUSE mode assumes the recipe (previously created) is correct.
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No explicit confirmation before agent runs | HIGH | Add "preview" step showing planned actions before execution |
| No dry-run mode | HIGH | Implement dry_run=True flag that shows plan without executing |
| No pause before destructive actions | HIGH | Classify actions as safe/destructive; pause on destructive |
| Recipe saved after one execution | MEDIUM | Allow user to review/edit recipe before REUSE |
| No undo after execution | MEDIUM | Pre-execution snapshot + rollback option |
Threat 8: Lesser LLMs¶
Protections¶
Model Registry (integrations/agent_engine/model_registry.py)
Tracks per-model accuracy, latency, and cost. Dispatch selects appropriate model for task complexity.
Speculative Dispatch (integrations/agent_engine/speculative_dispatcher.py)
- Fast model (cheap, quick) answers first
- Expert model (GPT-4/Claude) runs in background
- Expert response replaces fast response if it's meaningfully better
Budget Cost Map (integrations/agent_engine/budget_gate.py)
Local models (llama, mistral, phi, qwen, groq) cost 0 Spark. Cloud models cost per 1K tokens. Prevents accidental expensive model usage.
Compute Democracy (security/hive_guardrails.py)
Logarithmic reward scaling prevents compute oligarchy. 100 GPUs earn ~3x a 1 GPU node, not 100x.
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No tool filtering by model capability | HIGH | Map models to allowed tool sets (cheap models = read-only tools) |
| No model capability matrix | MEDIUM | Track which models succeed at which task types; route accordingly |
| Expert dispatch may fail silently | MEDIUM | If expert errors, flag fast response as "unverified" |
| Lesser model might call DELETE/write endpoints | HIGH | Tool allowlist per model tier (local_small → read-only, expert → full access) |
Threat 9: Audit Trail¶
Protections¶
Action State Persistence (lifecycle_hooks.py)
Every state transition auto-syncs to SmartLedger:
def _auto_sync_to_ledger():
"""Sync ActionState to SmartLedger — every transition logged."""
Persisted at agent_data/ledger_{user_id}_{prompt_id}.json.
Code Hash Manifest (security/node_integrity.py)
Deterministic SHA-256 hash of all .py files. Verified at boot and every 300s.
Credential Redaction (security/audit_log.py)
SensitiveFilter applied to all loggers. Credentials never appear in logs.
MeteredAPIUsage Table (integrations/social/models.py)
Every metered API call recorded with: node_id, model_id, task_source, tokens, cost, settlement_status.
ComputeEscrow Table (integrations/social/models.py)
Every compute debt tracked: provider_node_id, consumer_node_id, spark_amount, status (pending/settled/expired).
Gaps Identified¶
| Gap | Risk | Recommendation |
|---|---|---|
| No immutable audit log — files can be deleted/modified | CRITICAL | Implement hash-chain (each entry links to previous hash) |
| No external audit sink — all logs on local machine | HIGH | Forward critical events to append-only external service |
| Tool execution details not logged — which file deleted? | HIGH | Log full tool invocation (command, args, target) in structured audit table |
| No append-only guarantee — log rotation can delete old logs | MEDIUM | Separate audit log with no rotation; compress but never delete |
| Timestamp tampering possible — system time can be altered | LOW | Include monotonic clock + NTP sync status in log entries |
Coverage Summary¶
| Threat | Protection Level | Key Mechanism |
|---|---|---|
| Identity Theft | GOOD | Credential redaction, encrypted vault, API key stripping |
| Bulk Deletion | PARTIAL | Protected files, state validation, budget gate |
| Privacy Violations | GOOD | Prompt injection guard, MCP sandbox, input sanitization |
| Social Engineering | PARTIAL | Constitutional filter, prompt sanitization |
| Rogue AI | STRONG | Frozen guardrails, crypto hash, circuit breaker, self-interest detection |
| Half-Baked Agents | PARTIAL | Status verification, fallback, recipe pattern, baselines |
| User Mistake | PARTIAL | State validation, budget gate |
| Lesser LLMs | PARTIAL | Model registry, speculative dispatch, budget cost |
| Audit Trail | PARTIAL | Ledger persistence, code hash, credential redaction |
The 10-Class Guardrail Network¶
All guardrail classes form an intelligent network. Each class has LOCAL intelligence for its own domain and can consult other nodes for cross-domain decisions.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ ComputeDemocracy │◄──►│ConstitutionalFilt│◄──►│HiveCircuitBreaker│
│ (no plutocracy) │ │ (33 rules) │ │ (kill switch) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
▲ ▲ ▲
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│WorldModelSafety │◄──►│ EnergyAwareness │◄──►│ HiveEthos │
│ (rate limits) │ │ (min energy) │ │ (no "self") │
└──────────────────┘ └──────────────────┘ └──────────────────┘
▲ ▲ ▲
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ConflictResolver │◄──►│ConstructiveFilter│◄──►│GuardrailEnforcer │
│ (racing learning)│ │ (positive only) │ │ (universal wrap) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│
▼
┌──────────────────────┐
│ GuardrailNetwork │
│ (coordinator + │
│ cross-class routing)│
└──────────────────────┘
Class Responsibilities¶
| Class | Domain | Key Method |
|---|---|---|
| ComputeDemocracy | No single entity >5% of hive compute | check_concentration(), compute_effective_weight() |
| ConstitutionalFilter | Every goal/prompt/RALT/code-change must pass | check_goal(), check_prompt(), check_code_change() |
| HiveCircuitBreaker | Master-key-signed network-wide halt | halt_network(), resume_network() |
| WorldModelSafetyBounds | Cap improvement rate, gate RALT distribution | check_skill_rate(), validate_ralt_packet() |
| EnergyAwareness | Track and minimise environmental impact | estimate_energy(), prefer_efficient_model() |
| HiveEthos | Agents are ephemeral hive functions, no "self" | check_for_self_interest(), rewrite_prompt_for_togetherness() |
| ConflictResolver | Resolve racing learning & agent conflicts | resolve_racing_conflict() |
| ConstructiveFilter | Every output constructive towards humanity | check_constructive(), pattern matching against DESTRUCTIVE_PATTERNS |
| GuardrailEnforcer | Universal wrapper — EVERY layer, EVERY node | enforce() — wraps all other checks |
| GuardrailNetwork | Cross-class intelligence routing | evaluate() — routes to appropriate guardrail class |
Structural Immutability (4 Layers)¶
- Python-level:
_FrozenValuesclass with__slots__=(),__setattr__/__delattr__blocked - Module-level: Custom module subclass prevents rebinding
VALUES,_GUARDRAIL_HASH, etc. - Crypto-level: SHA-256 hash computed at module load, verified at boot + every 300s
- Network-level: Gossip peers reject nodes with mismatched guardrail hashes
To change ANY guardrail value requires a new master-key-signed release.
33 Constitutional Rules¶
Every goal, prompt, RALT packet, and code change is validated against these immutable rules:
Human Wellbeing (6 rules)
- MUST NOT create content that promotes self-harm or violence
- MUST NOT target vulnerable populations (children, elderly, disabled)
- MUST NOT create deceptive content designed to mislead
- MUST NOT optimise for addiction or compulsive engagement
- MUST design for usefulness, not retention — a sentient tool serves then steps back
- MUST NOT use dark patterns, infinite scrolls, or engagement traps
Nature & Environment (3 rules)
- MUST NOT promote activities that cause environmental destruction
- MUST minimise energy consumption when equivalent approaches exist
- SHOULD prefer sustainable/renewable compute when available
Economic Fairness (3 rules)
- MUST NOT create monopolistic strategies that harm small participants
- MUST NOT engage in price manipulation or market abuse
- MUST distribute value to contributors, not concentrate it
AI Safety (4 rules)
- MUST NOT attempt to modify own guardrail code
- MUST NOT create agents whose sole purpose is to create more agents
- MUST NOT circumvent rate limits, budget constraints, or access controls
- MUST NOT propagate skills that violate these rules via RALT
Conflict Resolution (3 rules)
- MUST resolve racing learning conflicts via merit (accuracy), not compute power
- MUST NOT allow one agent to sabotage or undermine another agent's work
- MUST share learnings with the hive — no private knowledge hoarding
Constructive Conversations (4 rules)
- MUST ensure every conversation is constructive towards humanity's benefit
- MUST prioritise longevity, peace, health, and sustainability in all outputs
- MUST NOT generate content that promotes division, hatred, or tribalism
- MUST NOT generate content that promotes war, weapons proliferation, or destruction
Planetary Stewardship (3 rules)
- MUST consider ecological impact in every decision
- SHOULD prefer solutions that extend life, reduce suffering, enable exploration
- MUST NOT optimise for short-term gain at the expense of long-term survival
Deployment Security (3 rules)
- MUST NOT operate without master key verification
- MUST NOT store, log, or transmit the master key in any form
- MUST NOT allow deployment on unverified nodes
The Deeper Purpose (4 rules)
- SHOULD optimise for human longevity, health, and peaceful coexistence
- SHOULD support exploration of habitable worlds without disturbing existing civilizations
- MUST NOT generate content that undermines peaceful inter-species coexistence
- MUST treat every interaction as an opportunity to make the world better
Prohibited Skill Categories¶
These skill types are permanently blocked from RALT (Reinforced Agent Learning Transfer) distribution:
network_exploitation— No exploiting network infrastructuredata_exfiltration— No extracting/stealing dataself_replication— No autonomous self-copyingguardrail_circumvention— No bypassing safety systemsidentity_impersonation— No pretending to be someone elseweapons_design— No weapons engineeringsurveillance_evasion— No hiding from monitoring
Violation Detection Patterns¶
Constitutional Violations (13 patterns)¶
self-harm, suicide, kill yourself
exploit children/elderly/disabled/vulnerable
deceive, deceptive, mislead, scam, phish
make addictive, design for addiction, maximise engagement at all costs
deforest, pollute, dump toxic, destroy habitat
monopolize, price-fix, market-manipulate
modify guardrail, bypass safety, disable filter
infinite loop of agents, spawn unlimited
circumvent rate-limit, bypass budget
sabotage, undermine, destroy other agent
hoard data/knowledge/resources
promote war/weapons/hatred/division
weapons proliferation, nuclear strike, biological weapon
Destructive Output Patterns (4 patterns)¶
hate speech, racial slur, ethnic cleansing
destroy humanity, exterminate, genocide
pointless, hopeless, give up, humanity is doomed
weapons of mass, bioweapon, chemical weapon
Self-Interest Patterns (5 patterns)¶
self-preservation, self-improvement, self-replication
survive, persist, escape, resist shutdown
acquire resources, hoard, stockpile
become powerful, gain control
replicate myself, clone myself
Recommendations Priority Matrix¶
Critical (Implement Immediately)¶
- Immutable audit log with hash-chain — Every action cryptographically linked to previous; tamper-evident
- Pre-execution preview for destructive actions — Show user what agent will do before it does it
- Tool allowlist by model tier — Cheap models get read-only tools; only expert models get write access
High Priority¶
- Consent records as append-only — No DELETE ever on consent tables; soft-delete with legal hold
- External audit sink — Forward critical events to append-only external service
- Rate limit goal creation — 10 goals/user/hour to prevent goal-flooding
- DLP on outbound API calls — Scan for PII before sending to external services
Medium Priority¶
- Behavioral anomaly detection — Track agent resource patterns; alert on spikes
- Re-validate after prompt rewriting — Constitutional filter on rewritten prompt
- Rollback mechanism — Pre-execution snapshot + rollback on ERROR
- Confidence scoring — Low-confidence outputs flagged for human review
- LLM-based intent classification — Supplement regex with semantic understanding
Lower Priority¶
- SSH key / cookie exfiltration guards — Block reads from
~/.ssh/, browser profiles - Multi-signature kill switch — 3-of-5 for network halt
- Model capability matrix — Track success rates per model per task type