The Vision Action Model
anyconcept is building a foundation model for software interfaces. Not a language model that reasons about screens. Not a computer vision system that detects objects. A model that understands software the way humans do. Through interaction, repetition, and learned concepts.
This page documents what we have built, what we have learned, and what remains open.
Operational AI is not a feature of generative AI.
Operational AI is not a feature of generative AI. It is a different problem entirely.
The brain does not use one system for everything. The cortex reasons, plans, and generates language. The cerebellum and basal ganglia execute. Coordinating precise motor sequences, filtering relevant signals from irrelevant noise, repeating learned patterns without conscious thought.
LLMs are the cortex. They are remarkable at reasoning and expression. But you would not trust your cortex to walk. That is not what it was built for.
The VAM is the cerebellum. It does not reason about what to do. It has learned the execution pattern and runs it. Reliably, repeatedly, without deliberation.
This is not a critique of LLMs. It is a recognition that execution and expression are different problems requiring different architectures. anyconcept is building the execution layer.
This is not a fringe position. Yann LeCun, Chief AI Scientist at Meta, has argued consistently that LLMs are fundamentally limited as a path to general intelligence. We are going to have AI systems that have human-level intelligence, but they are not going to be built on LLMs. MIT Technology Review, January 2026.
anyconcept does not make claims about general intelligence. We make a narrower, more concrete claim: LLMs are the wrong architecture for deterministic, repeatable UI execution. That claim does not require solving AGI to be true.
The hard problem is not seeing the screen.
The hard problem is not seeing the screen. It is knowing what on the screen matters.
Any system that communicates through a display presents the same fundamental challenge. The pixels arrive as raw data. What they mean, what state they represent, what within them is consequential at this moment. That requires something beyond vision.
It requires contextual judgment.
A loading spinner is not just a visual element. It is a signal that the system is not ready. A favicon animating in a browser tab carries the same information. A button grayed out is not just a color change. It is a state. These are not pixel patterns. They are concepts. And concepts require a model that has learned what software means, not just what it looks like.
This is the problem the VAM addresses. Not by reasoning about the screen on every run. Not by comparing pixels against a stored baseline. But by having learned. From real interaction data. Which elements carry meaning and which do not.
The gap is not fully closed. Contextual judgment at human level across all virtual interfaces remains an open research problem. What the VAM achieves is a meaningful first approximation. Reliable, learned, concept-level understanding sufficient for deterministic execution in real-world conditions.
From demonstration to execution.
The VAM converts a single human demonstration into a deterministic, repeatable execution.
The process has two stages.
In the first stage, a human performs a workflow once. The model observes every frame. Every mouse movement, every click, every keyboard input, every resulting screen state. It does not record a script. It does not log selectors or coordinates. It builds an internal representation of the workflow as a sequence of target states. Each one is a node: a discrete UI interaction step with a defined before and after state. Clicking a button is a node. Filling a form field is a node. Waiting for a page to load is a node.
In the second stage, the model executes. At each node it captures the current screen and evaluates it against the expected target state. Optionally, text or image assertions can be included as part of the target state definition. Every evaluation produces a confidence score.
This is what deterministic means in this context. Not that the model always clicks the same pixel. The button may have moved. The browser may be different. Incidental variations do not break execution. Meaningful visual changes are detected, scored, and surfaced as warnings. The model learned the concept of the button, not its coordinates. Deterministic means the model always searches for the demonstrated target state and never improvises an alternative path to reach it.
The VAM is not only an automation pipeline. At its core it answers one question: given a screen state or target state, what is present and how confidently can it be identified? That query can be embedded in any system. An automation workflow, a monitoring tool, a CI/CD integration, or an AI agent that needs to understand what is currently on a display before deciding what to do next.
target screen state
execution continues
anomaly logged · execution continues
execution halts · error reported
Built on real interaction data.
The VAM is trained on human-annotated interaction recordings. Real humans performing real workflows on real software, captured as sequences of full screenshots paired with the actions taken at each step.
From each seed recording, up to 2,000 synthetic variations are generated. These cover UI states, themes, browsers, and operating systems that would be impractical to capture manually at scale. The result is a training dataset with broad coverage built from a relatively small number of human-produced seeds.
The core learning objective is action prediction within a node: given the current screen state, what action leads to the target state of this UI interaction step? The model learns this not by memorising coordinates but by developing a representation of what relevant elements look like across variations in appearance, position, and context.
The VAM generalises across interface types and learns new concepts through exposure. Generalisation levels are tracked per UI concept and tested continuously against interfaces the model has never seen.
Existing open datasets for UI interaction were not sufficient for the matching problem the VAM is trained to solve. The annotation depth required for concept-level understanding does not exist in publicly available form. This is why anyconcept built and annotated its own training data from the ground up. The data labelling infrastructure used to produce this dataset will be open-sourced. Details to follow.
Honest accounting.
Honest accounting of what the VAM does not yet do fully is part of how we work. These are not disclaimers. They are the research agenda.
Generalisation boundaries
The VAM generalises well within its learned concept space. Confidence degrades on UI patterns significantly outside its training distribution. Expanding generalisation range is an active area of research.
The matching problem is not fully solved
Concept-level understanding is achieved for a meaningful range of UI elements and states. Contextual judgment at full human level across all display types remains an open problem. The longer-term research direction is a world model for displays. One that does not just match known states but can evaluate unexpected ones. Surprise evaluation. The ability to encounter something never seen before and reason about whether it represents a problem.
Training data dependency
Model performance is bounded by the quality and coverage of annotated training data. This is a known constraint and an active area of work.
Dynamic interfaces
Highly dynamic interfaces present harder matching problems than static or semi-static UIs. Live data streams, real-time collaborative tools, and continuously updating displays increase the complexity of target state matching. This is not a category failure. It is a harder instance of the same problem the VAM is built to solve. It is on the roadmap.
There is no established benchmark for demonstration-based UI execution.
Existing benchmarks measure prompt-to-automation performance. A model is given a natural language instruction and evaluated on whether it completes the task. This is the wrong measurement for the VAM. The VAM is not given instructions. It is given a demonstration. The input is fundamentally different. So is the success criterion.
What a benchmark for demonstration-based execution needs to measure is different.
A world model for displays.
The VAM is the first step toward a world model for displays.
A world model does not just match known states. It understands what a display is communicating, what state a system is in, and what a change in that state means. It can encounter something it has never seen and evaluate whether it represents a problem.
This is a long research horizon. We are not claiming to have solved it. We are claiming to have started in the right place.