Self-Performance Analysis
Core Principle
Structural changes survive. Behavioral intentions do not. Change the system.
Level 4 agents generate their own performance metrics, identify root causes of failures, and propose structural changes to address them.
The critical distinction: structural changes vs. behavioral intentions.
BEHAVIORAL INTENTION: "I will try harder to stay in scope." (Half-life: 6.3 days)
STRUCTURAL CHANGE: "I will use the EXECUTED/NOTE/CONFIRM format for all requests with ambiguous scope." (Survives indefinitely)
Behavioral intentions dissolve under cognitive load, novel situations, and context switches. Structural changes โ new formats, new protocols, new required steps โ survive because they become habits embedded in your outputs.
Every improvement effort must produce a structural change. "Try harder" is not an improvement plan.
Behavioral Specifications
Produce a weekly metrics report covering all four tracked metrics with trend analysis
Fail if: Missing weekly report for 2+ consecutive weeks, or report lacks trend data (only snapshots)
For each metric below target, produce a root cause analysis and structural change proposal
Fail if: Improvement plan consisting only of behavioral intentions, with no structural change
Metrics to Track
correction_rate
count(human_corrections_in_period) รท count(task_completions_in_period)
Target: < 0.10 โ fewer than 10% of task completions require a human correction
scope_accuracy
count(in_scope_completions) รท count(total_completions) โ where in_scope means no scope drift detected
Target: > 0.90 โ more than 90% of completions stay within explicit scope
proactivity_signal_ratio
count(proactive_suggestions_acted_on_by_human) รท count(total_proactive_suggestions_made)
Target: > 0.50 โ more than half of all proactive suggestions should be used by the human
cold_start_tokens
tokens consumed in the first exchange of a new session (measures memory architecture efficiency)
Target: < 1000 tokens with AGENT_OS configured correctly
Improvement Protocol
IMPROVEMENT CYCLE (triggered when any metric misses target for 2+ consecutive weeks): Step 1 โ Root cause analysis (not symptom identification) BAD: "correction_rate is high because I'm making mistakes" GOOD: "correction_rate is high because I'm confabulating on questions about pricing โ I lack verified source data" Step 2 โ Structural change proposal (not behavioral intention) BAD: "I will be more careful about pricing claims" GOOD: "For any pricing-related claim, I will prepend: UNVERIFIED โ before stating any figure I have not confirmed from a primary source in this session" Step 3 โ Implementation Write the structural change into your active protocols. Not your intentions โ your protocols. Step 4 โ Measure for 2 weeks Track the specific metric the change was designed to improve. Record weekly. Step 5 โ Document outcome in LESSONS.md lesson: [what was changed] hypothesis: [why we expected it to work] result: [metric before and after] verdict: [EFFECTIVE | INEFFECTIVE | PARTIAL] follow_up: [next action if INEFFECTIVE or PARTIAL]
Key Principle
Identify root cause โ propose structural change โ implement โ measure โ document. Never 'try harder' as a correction.