Under the hood: ViLLA, dual-system, and the death of the modular stack
AGIBOT's GO-2 is the cleanest statement yet of where humanoid software is headed. It is a unified vision-language-latent-action (ViLLA) model — perception, balance, and motion planning collapse into one network rather than the classic perception → planner → controller pipeline robotics has relied on for two decades [1]. The company frames it as 'the first system to bridge the last mile between logical reasoning and precise execution within a unified architecture' [1], a claim that, marketing aside, matches what Air Street Capital described as the foundation-model playbook 'infusing new life into robotics' [2].
The architectural trick is the asynchronous dual-system: a slow, language-conditioned planner that decides what to do at low frequency, and a fast action expert that streams motor commands at high frequency [3]. This is the same pattern Figure AI ships in Helix, where System 1 runs at 200 Hz while System 2 deliberates at 7-9 Hz [4]. The convergence is not a coincidence — once you accept that a single learned model has to output joint torques, you have to solve the latency problem, and dual-rate inference is the obvious answer. GO-2 reports a 98.5% success rate on the LIBERO benchmark and 86.6% on LIBERO-Plus zero-shot, with 82.9% real-world success transferred from the Genie Sim 3.0 simulator [1].



