OCRing the live screen was the wrong validator
I built a small nightly QA runner for my personal iOS apps. It builds each app, boots a simulator, installs and launches, takes a screenshot, and OCRs the screenshot to confirm the right app actually rendered. The first clean run reported 36 failures. Almost none of them were real.
What was happening
The validator I called ocr_contains was POSTing to a small OCR
helper running on the Mac mini. The helper called Apple's Vision
framework against whatever was currently on screen.
The runner's design: agents run all the journeys first, recording evidence as they go. Then a separate "authoritative re-validation" pass replays every assertion against the saved evidence. That deferred re-validation is what made the OCR step lie — by the time the runner got around to re-OCRing for plan #5, the simulator was halfway through plan #20. The "live screen" the helper read was three apps later than the one the assertion cared about.
The 27 iOS plans almost all failed at the OCR step with "0 matches found." The four that "passed" did so by accident — their target string ("Cass", "Check on Mine", etc.) happened to also appear in the iOS springboard's icon labels. False pass either way.
What I found
The four pre-OCR steps (build, boot, install, launch) all passed
cleanly because their validators checked disk artifacts — test
-d .app, xcrun simctl getenv, file existence. Those don't care
about transient screen state. Only the OCR step did.
So the design flaw was specifically: an assertion that depended on transient state, evaluated outside the window where that state held.
The fix
Stop OCRing the live screen. Each journey already saves a PNG into its evidence directory at the moment the assertion is made. Run OCR against that file instead.
def _ocr_count(text, evidence=None):
if evidence:
png = _first_existing_png(evidence)
if png:
count = _ocr_local_file(png, text)
return count, "file"
# fallback for journeys that didn't save a screenshot
return _ocr_live(text), "live"
_ocr_local_file SCPs the PNG to the Mac, runs a small Swift
helper that uses VNRecognizeTextRequest, and counts
case-insensitive matches in the result. The helper script is
about thirty lines:
// ~/bin/ocr-file.swift — invoked via swift run
import Vision
import AppKit
let path = CommandLine.arguments[1]
let image = NSImage(contentsOfFile: path)!
let cg = image.cgImage(forProposedRect: nil, context: nil, hints: nil)!
let req = VNRecognizeTextRequest { req, _ in
let observations = req.results as? [VNRecognizedTextObservation] ?? []
let lines = observations.compactMap { $0.topCandidates(1).first?.string }
print(lines.joined(separator: "\n"))
}
req.recognitionLevel = .accurate
req.revisionNumber = 3
try VNImageRequestHandler(cgImage: cg).perform([req])
Validation against the prior night's saved screenshots — that's the nice thing about saving evidence: you can re-grade old runs without re-running them:
- A todos screen previously failing at "0 live matches" now passes at "1 file match."
- A QR utility now correctly passes with 2 file matches.
- A passkey utility that was passing because the springboard shows "Password" stays a problem — the validator can't fix it, the OCR target needs to be tightened to a string that only the in-app screen contains.
What I'd do differently
The original validator was a slow-to-fail bug because the live screen usually showed what you wanted while you were developing the plan interactively. The deferred re-validation pattern only broke the assumption at scale. I'd add a unit test that asserts "validator output for a given step is invariant under subsequent state changes" — that would have caught the issue the first time two plans ran back to back.
Bonus lesson: when an assertion uses an app-name string that's also visible on the home screen, you have a guaranteed false- positive class. Pick targets that can only appear in-app.