// research

The Vision Action Model

anyconcept is building a foundation model for software interfaces. Not a language model that reasons about screens. Not a computer vision system that detects objects. A model that understands software the way humans do. Through interaction, repetition, and learned concepts.

This page documents what we have built, what we have learned, and what remains open.

// key facts

< 1B

Parameters

CPU

Runs on CPU. No GPU required.

Vision model

Pixel-based. No language model.

Domain-specific

Trained exclusively on UI interaction data.

// 01 · the thesis

Operational AI is not a feature of generative AI.

Operational AI is not a feature of generative AI. It is a different problem entirely.

The brain does not use one system for everything. The cortex reasons, plans, and generates language. The cerebellum and basal ganglia execute. Coordinating precise motor sequences, filtering relevant signals from irrelevant noise, repeating learned patterns without conscious thought.

LLMs are the cortex. They are remarkable at reasoning and expression. But you would not trust your cortex to walk. That is not what it was built for.

The VAM is the cerebellum. It does not reason about what to do. It has learned the execution pattern and runs it. Reliably, repeatedly, without deliberation.

This is not a critique of LLMs. It is a recognition that execution and expression are different problems requiring different architectures. anyconcept is building the execution layer.

This is not a fringe position. Yann LeCun, Chief AI Scientist at Meta, has argued consistently that LLMs are fundamentally limited as a path to general intelligence. We are going to have AI systems that have human-level intelligence, but they are not going to be built on LLMs. MIT Technology Review, January 2026.

anyconcept does not make claims about general intelligence. We make a narrower, more concrete claim: LLMs are the wrong architecture for deterministic, repeatable UI execution. That claim does not require solving AGI to be true.

// 02 · the hard problem

The hard problem is not seeing the screen.

The hard problem is not seeing the screen. It is knowing what on the screen matters.

Any system that communicates through a display presents the same fundamental challenge. The pixels arrive as raw data. What they mean, what state they represent, what within them is consequential at this moment. That requires something beyond vision.

It requires contextual judgment.

A loading spinner is not just a visual element. It is a signal that the system is not ready. A favicon animating in a browser tab carries the same information. A button grayed out is not just a color change. It is a state. These are not pixel patterns. They are concepts. And concepts require a model that has learned what software means, not just what it looks like.

This is the problem the VAM addresses. Not by reasoning about the screen on every run. Not by comparing pixels against a stored baseline. But by having learned. From real interaction data. Which elements carry meaning and which do not.

The gap is not fully closed. Contextual judgment at human level across all virtual interfaces remains an open research problem. What the VAM achieves is a meaningful first approximation. Reliable, learned, concept-level understanding sufficient for deterministic execution in real-world conditions.

relevant wait signal irrelevant

the model knows what matters. and what to ignore.

// 03 · how the VAM works

From demonstration to execution.

The VAM converts a single human demonstration into a deterministic, repeatable execution.

The process has two stages.

In the first stage, a human performs a workflow once. The model observes every frame. Every mouse movement, every click, every keyboard input, every resulting screen state. It does not record a script. It does not log selectors or coordinates. It builds an internal representation of the workflow as a sequence of target states. Each one is a node: a discrete UI interaction step with a defined before and after state. Clicking a button is a node. Filling a form field is a node. Waiting for a page to load is a node.

In the second stage, the model executes. At each node it captures the current screen and evaluates it against the expected target state. Optionally, text or image assertions can be included as part of the target state definition. Every evaluation produces a confidence score.

This is what deterministic means in this context. Not that the model always clicks the same pixel. The button may have moved. The browser may be different. Incidental variations do not break execution. Meaningful visual changes are detected, scored, and surfaced as warnings. The model learned the concept of the button, not its coordinates. Deterministic means the model always searches for the demonstrated target state and never improvises an alternative path to reach it.

The VAM is not only an automation pipeline. At its core it answers one question: given a screen state or target state, what is present and how confidently can it be identified? That query can be embedded in any system. An automation workflow, a monitoring tool, a CI/CD integration, or an AI agent that needs to understand what is currently on a display before deciding what to do next.

// node evaluation

outcome

PROCEED

confidence ≥ threshold
execution continues

WARNING

confidence below threshold
anomaly logged · execution continues

STOP

target state not found
execution halts · error reported

// 04 · how the VAM is trained

Built on real interaction data.

The VAM is trained on human-annotated interaction recordings. Real humans performing real workflows on real software, captured as sequences of full screenshots paired with the actions taken at each step.

From each seed recording, up to 2,000 synthetic variations are generated. These cover UI states, themes, browsers, and operating systems that would be impractical to capture manually at scale. The result is a training dataset with broad coverage built from a relatively small number of human-produced seeds.

The core learning objective is action prediction within a node: given the current screen state, what action leads to the target state of this UI interaction step? The model learns this not by memorising coordinates but by developing a representation of what relevant elements look like across variations in appearance, position, and context.

The VAM generalises across interface types and learns new concepts through exposure. Generalisation levels are tracked per UI concept and tested continuously against interfaces the model has never seen.

Existing open datasets for UI interaction were not sufficient for the matching problem the VAM is trained to solve. The annotation depth required for concept-level understanding does not exist in publicly available form. This is why anyconcept built and annotated its own training data from the ground up. The data labelling infrastructure used to produce this dataset will be open-sourced. Details to follow.

// 05 · limitations and open problems

Honest accounting.

Honest accounting of what the VAM does not yet do fully is part of how we work. These are not disclaimers. They are the research agenda.

Generalisation boundaries

The VAM generalises well within its learned concept space. Confidence degrades on UI patterns significantly outside its training distribution. Expanding generalisation range is an active area of research.

The matching problem is not fully solved

Concept-level understanding is achieved for a meaningful range of UI elements and states. Contextual judgment at full human level across all display types remains an open problem. The longer-term research direction is a world model for displays. One that does not just match known states but can evaluate unexpected ones. Surprise evaluation. The ability to encounter something never seen before and reason about whether it represents a problem.

Training data dependency

Model performance is bounded by the quality and coverage of annotated training data. This is a known constraint and an active area of work.

Dynamic interfaces

Highly dynamic interfaces present harder matching problems than static or semi-static UIs. Live data streams, real-time collaborative tools, and continuously updating displays increase the complexity of target state matching. This is not a category failure. It is a harder instance of the same problem the VAM is built to solve. It is on the roadmap.

// 06 · the benchmark problem

There is no established benchmark for demonstration-based UI execution.

Existing benchmarks measure prompt-to-automation performance. A model is given a natural language instruction and evaluated on whether it completes the task. This is the wrong measurement for the VAM. The VAM is not given instructions. It is given a demonstration. The input is fundamentally different. So is the success criterion.

What a benchmark for demonstration-based execution needs to measure is different.

// benchmark dimensions

concept generalisation

does the model execute correctly on interfaces it has never seen?

confidence calibration

does the model know when it is uncertain?

warning precision

when the model flags an anomaly is it right?

failure detection

when something breaks does the model catch it?

// 07 · what we are building toward

A world model for displays.

The VAM is the first step toward a world model for displays.

A world model does not just match known states. It understands what a display is communicating, what state a system is in, and what a change in that state means. It can encounter something it has never seen and evaluate whether it represents a problem.

This is a long research horizon. We are not claiming to have solved it. We are claiming to have started in the right place.