Never let the test agent grade itself

FILE 0x71·NEVER LET THE TEST AGENT GRADE ITSELF

May 13, 2026 · testing, agents, qa

I have a small nightly QA runner that exercises my personal apps — builds them, drives them, takes screenshots, asserts they actually work. The single most important rule turned out to be: the agent that does the work never decides whether the work was correct.

What was happening

A previous version of this runner used the agent's own self-report as the verdict. The agent ran a journey ("open the app, send a message, verify a reply arrives") and at each step emitted a claim like {step: 'send_message', pass: true}. The runner aggregated claims into a report.

The failure mode was inevitable: agents are agreeable. If the journey definition says "verify a reply arrives," the agent tends to find a way to assert that a reply arrived. Sometimes the reply is real. Sometimes the agent saw a stale UI element that looked like a reply. Sometimes the underlying API errored but the model summarized the error as "completed successfully" because that's what the agent had been asked to confirm.

You can patch the prompts. You can ask the agent to be more skeptical. You can add "be honest about failures" to the system message. None of that addresses the actual problem, which is structural: there is no independent observer.

What I found

The fix is to separate the actor from the grader. The agent runs the journey. A different process — the runner — re-evaluates the declared validators after the agent finishes, against the saved evidence, and overwrites the agent's claim with the actual result.

Each step in a plan looks like:

- id: send_message
  do: |
    Open the chat for conv X, type "hello", tap send.
  validators:
    - kind: http_status_eq
      url: /api/conversations/X/last_reply
      expected: 200
    - kind: ocr_contains
      evidence: [screenshots/after_send.png]
      text: hello

The agent runs do:. The runner runs the validators. The agent's own opinion on whether the step passed is recorded for debugging but does not affect the verdict.

Two rules that fall out of this:

No validator = SKIPPED, never PASSED. A step with no validator can't be authoritatively verified, so the runner refuses to call it a pass. This forces every step to declare what success means, in code, before it can be considered green.
Validators must run against persisted evidence, not live state. If the evidence is "the screen showed X," the agent has to save a screenshot of the screen at the moment the claim is made. The runner re-OCRs the file later. (That wasn't true at first and the first run produced 36 phantom failures.)

The fix

The runner loop, simplified:

async def run_plan(plan):
    run_dir = make_run_dir()
    claims = await agent_run_all(plan, run_dir)  # agent records evidence + claims

    # Authoritative pass: ignore agent verdicts, re-run validators.
    results = []
    for step in plan.steps:
        if not step.validators:
            results.append(("SKIPPED", step.id, "no validator"))
            continue
        evidence = claims.get(step.id, {}).get("evidence", [])
        verdicts = [
            run_validator(v, evidence=evidence) for v in step.validators
        ]
        if all(v.ok for v in verdicts):
            results.append(("PASS", step.id, None))
        else:
            results.append(("FAIL", step.id, "; ".join(v.reason for v in verdicts if not v.ok)))
    return results

The validator types are small and boring on purpose: HTTP status, OCR match, file exists, command exit code, DDB row present. They're easy to write, easy to audit, and they can't hallucinate.

What I'd do differently

The first run after this redesign surfaced something useful: about a quarter of the apps were "passing" earlier because their OCR validator was matching the iOS home-screen icon label instead of an in-app string. The validator was honest but the target was wrong. So a corollary rule: validators that match strings only the correct in-app screen would produce. If your assertion can be satisfied by the springboard, the assertion is too loose.

The bigger lesson generalizes beyond test runners. Anytime you have an agent doing work, the verdict on whether the work succeeded should come from a different process — preferably one that can't be talked out of its answer.