UI Automation beat my screenshot loop by 17x
I had built a Windows desktop agent around the obvious loop: take a screenshot, send the PNG to a vision LLM, ask where to click, click it, repeat. It worked. It was also miserably slow. A typical five-step UI flow took close to a minute and a half. Before throwing more model at it, I benchmarked the alternative: Windows UI Automation (UIA), which exposes the live element tree without rendering anything.
What was happening
The agent had /screenshot, /click, /type endpoints and a Bearer-token-protected HTTP API on the target machine. Each tick of the loop did roughly:
- Capture full screen, encode PNG, base64.
- POST to a vision model, wait for tool call.
- Map returned coordinates back to a click.
PNG capture alone was ~1.8 seconds round-trip from another machine on the LAN (286 KB images). The model call added another ~1.8 seconds per step. A clean five-step flow was 18-25 seconds in the happy path. Anything occluding the target window broke the loop entirely.
What I found
The agent already had /ui/tree, /ui/find, /ui/invoke, /ui/set_text endpoints sitting unused. They wrap pywinauto's UIA backend, which lets you address controls by AutomationId and call invoke() directly, no mouse coordinates involved.
I wrote a bench: Calculator, ten iterations of clear, 7, *, 8, =, read 56. Six steps per iteration. Comparing UIA invoke chains to bare screenshot capture (no LLM), then estimating the realistic OCR loop with model latency layered back on.
Median times across ten iterations:
- UIA invoke chain: 725 ms (per step ~110 ms)
- Screenshot capture only, no LLM: 12,150 ms
- Realistic vision loop with model calls: ~22,950 ms estimated
UIA was about 17x faster than just capturing the screenshots, and ~32x faster than the full vision loop.
The surprising part: during the bench, a Microsoft account OOBE dialog was covering Calculator the entire time. UIA still invoked the buttons by AutomationId. A screenshot loop would have been clicking on the wrong window.
The fix
One pywinauto bug got in the way. On 0.6.9, calling descendants(title_re=..., auto_id=...) blows up because it forwards those kwargs to IUIA.build_condition(), which only accepts (process, class_name, title, control_type, content_only). I replaced the kwarg path with a manual enumerate-and-filter:
all_descendants = parent.descendants()
matches = [
e for e in all_descendants
if (auto_id is None or e.element_info.automation_id == auto_id)
and (title_re is None or re.search(title_re, e.window_text() or ""))
]
Not elegant. Works.
What I'd do differently
I should have benched the underlying APIs before committing to the vision-loop architecture. The "general intelligence" version felt more flexible, but UIA gives you a structured tree, stable IDs, and invocation without focus. For repetitive flows (form fills, settings toggles, password resets) a cached path of AutomationIds replays in under a second. The right shape is probably UIA for everything addressable, with vision fallback only for canvas-style apps that don't expose a real tree.