From 23% to 93%: How Agentic Vision Solves the Instrument Reading Problem
The headline number — instrument reading accuracy jumping from 23% on the prior model to 93% with agentic vision on ER 1.6 — represents a qualitative shift from a capability that was essentially broken to one that is production-ready. The 86% baseline accuracy alone would be notable, but the agentic vision pipeline pushes it into territory where autonomous facility inspection becomes commercially viable.
What makes this technically distinctive is the multi-step reasoning approach. Rather than attempting to read a gauge in a single inference pass, the model takes intermediate steps: first zooming into an image to get a better read of small details, then using pointing and code execution to estimate proportions, and finally applying world knowledge for interpretation. As the DeepMind blog explains, reading instruments requires the model to "precisely perceive a variety of inputs — including the needles, liquid level, container boundaries, tick marks." This decomposition of a perceptual task into an agentic workflow — where the model decides what additional information it needs and acts to gather it — is a fundamentally different architecture than simply scaling up a vision model. For comparison, Gemini 3.0 Flash achieved only 67% on the same task, suggesting that raw model scale alone does not solve this problem without the specialized embodied reasoning and agentic pipeline.


