Computer use AI is the technical capability that makes a modern desktop AI agent possible. The model looks at a screenshot of your screen, decides where to click and what to type, and your agent application carries the action out on your behalf. This guide walks through how the loop actually runs, what the public benchmarks really show, and where the capability still breaks.
What is computer use AI?#
Computer use AI is the ability of a frontier model to control a graphical computer environment by seeing pixels and producing mouse and keyboard actions. Anthropic introduced it publicly in October 2024 with Claude 3.5 Sonnet, framing it directly: "developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text" (Anthropic, 2024).
The capability has two components that are often conflated. The first is the model — a vision-language model trained to look at a screenshot and output a structured action, such as {"action": "left_click", "coordinate": [500, 300]}. The second is the harness — the application code that captures the screenshot, sends it to the model, parses the action, executes it on a real computer, and feeds the result back in. The model decides; the harness acts.
OpenAI shipped a comparable system in January 2025 under the name Operator, powered by a model called Computer-Using Agent (CUA). MIT Technology Review described the mechanism plainly: Operator "takes screenshots of a computer screen and scans the pixels to figure out what actions it can take" (MIT Technology Review, 2025). Google DeepMind shipped a similar system around the same time under the name Mariner. Different vendors, same pattern.
This matters for desktop AI because it removes the old assumption that automation requires APIs. With computer use AI, an agent can drive any application that has a screen — including legacy desktop software, internal tools, and apps that have never had an API at all.
How the agent loop actually runs#
Every real implementation uses some flavor of an agent loop. The Anthropic docs describe the four steps directly (Anthropic Docs, 2026):
- Provide the tool and a prompt. The agent app sends the model a message such as "save a picture of a cat to my desktop" alongside the computer use tool definition.
- The model decides to act. It returns a
tool_usecontent block — for example, a screenshot request — and the API response carriesstop_reason: "tool_use". - Your application executes the action. The harness captures the requested screenshot (or runs the click), encodes the result, and returns it to the model as a
tool_result. - Repeat until done. The model either issues another tool call or finishes with a plain-text answer. The repetition is the loop.
Concretely, a single round looks like this:
// 1. Model output
{
"stop_reason": "tool_use",
"content": [
{ "type": "tool_use",
"id": "toolu_01...",
"name": "computer",
"input": { "action": "left_click", "coordinate": [842, 311] }
}
]
}
// 2. Harness executes the click on the real desktop, then sends back
{
"role": "user",
"content": [
{ "type": "tool_result",
"tool_use_id": "toolu_01...",
"content": [{ "type": "image", "source": { /* new screenshot */ } }]
}
]
}
That is the entire mechanism. Sophisticated agents wrap it with retry logic, max-iteration caps, permission prompts, screen-region zoom, and tool composition with shell or text-editor tools. But underneath, the cycle is screenshot → action → screenshot.
The harness — not the model — owns the dangerous parts. The harness decides what counts as a sensitive action, when to ask the human to approve, and what to log. A desktop AI agent like Lapu AI lives almost entirely in the harness, even though the model is what gets the marketing.
The tools Claude and CUA actually use#
Anthropic's current computer use tool exposes a small, fixed set of actions. The full list, from the public docs:
| Action | Purpose |
|---|---|
screenshot | Capture the current display |
left_click | Click at [x, y] |
right_click, middle_click, double_click, triple_click | Other mouse buttons / multi-clicks |
mouse_move | Move the cursor |
left_click_drag | Click and drag between coordinates |
left_mouse_down, left_mouse_up | Fine-grained click control |
type | Type a text string |
key | Press a key combo (e.g. ctrl+s) |
hold_key | Hold a key for a duration |
scroll | Scroll a direction by an amount |
wait | Pause between actions |
zoom | View a region at full resolution (computer_20251124) |
The newer tool version (computer_20251124) adds the zoom action — the model can ask the harness to give it a higher-resolution crop of a region it cannot read clearly. This is a direct response to the most common failure mode of the previous tool: small text in toolbars and menus that disappeared after the API downsampled the screenshot.
In practice, an agent rarely uses computer use in isolation. The Anthropic quickstart pairs it with the bash tool and the str_replace_based_edit_tool (Anthropic Docs, 2026). For desktop AI on macOS or Windows, you typically extend that further with accessibility-API tools, file-system tools, and app-specific automation. The model picks the right tool for each step; computer use is the fallback when nothing more direct exists.
What the OSWorld numbers really mean#
The reference benchmark for this capability is OSWorld — 369 real computer tasks across Ubuntu, Windows, and macOS environments, built by researchers from the University of Hong Kong, Salesforce, CMU, and Waterloo. The original paper reported that humans solved 72.36% of tasks while the best model at the time scored 12.24% (Xie et al., 2024).
That gap has closed faster than almost any other AI benchmark in recent memory.
| Model | OSWorld score | Released |
|---|---|---|
| Best system at OSWorld launch | 12.24% | early 2024 |
| Claude 3.5 Sonnet (computer use beta) | 14.9% (screenshot-only) | Oct 2024 |
| Claude Sonnet 4 | 42.2% | mid 2025 |
| Claude Sonnet 4.5 | 61.4% | late 2025 |
| Claude Sonnet 4.6 | 72.5% (verified split) | 2026 |
| Human baseline | 72.36% | — |
Two cautions on those numbers. First, OSWorld measures success on a fixed task suite. The real-world distribution of "things a person wants their desktop agent to do" is messier and longer-tailed. Second, scores depend heavily on the agent harness — the same model gets different numbers under different runners. Treat OSWorld as a directional signal, not a guarantee that any given workflow will succeed.
The directional signal is unambiguous, though. In late 2024 the right reaction was "interesting demo, not ready for production." In mid-2026 the right reaction is "production, but only with a careful runtime."
Where computer use AI still fails#
Anthropic's own documentation lists the failure modes plainly (Anthropic Docs, 2026):
- Latency. Each round trip is one screenshot plus one model inference, often several seconds. Long task chains feel slow next to a human doing the same thing.
- Coordinate hallucination. The model may output coordinates that miss the target, especially on small UI elements. Hence the
zoomaction. - Niche or multi-app interactions. Reliability drops when the model has to track state across several windows or work inside applications it has seen rarely.
- Scrolling. Scroll actions sometimes do not produce the expected result — keyboard shortcuts like Page Down are more reliable.
- Spreadsheet cell selection. Fine-grained cell work requires the
left_mouse_down/left_mouse_uppair plus modifier keys; complex spreadsheet ops still take multiple tries. - Prompt injection from content on screen. Anthropic warns that "Claude will follow commands found in content, sometimes even in conflict with the user's instructions" — a malicious page or image can attempt to redirect the agent.
That last item is the one that matters most for desktop deployment. As IEEE Spectrum's Eliza Strickland framed it, "Since Claude can interpret screenshots from computers connected to the Internet, it's possible that it may be exposed to content that includes prompt injection attacks" (IEEE Spectrum, 2025). A desktop AI agent that auto-approves every action is a target. A desktop AI agent that requires a human in the loop for any action with consequences is much harder to weaponize.
How Lapu AI uses computer use on the desktop#
Lapu AI is a desktop-native agent that runs on macOS and Windows and uses computer use AI as one tool among several. The product description page lays out the model bluntly: the agent can "control any application on your desktop through native accessibility APIs" and "see what is on screen, click buttons, fill forms, navigate menus."
In practice, that means three things.
Accessibility-first, screenshots as fallback. Native APIs are faster and more reliable than pixel-based clicks. When an app exposes a useful accessibility tree, Lapu uses it; computer use is the fallback for apps that do not, or for cases where the agent needs to read what is visually rendered (a chart, a screenshot inside another app).
Permissioned execution. Every action the agent takes runs through the permission system. Read-only actions can be auto-approved; clicks, types, and destructive operations require explicit consent at a configurable granularity. The model can request anything it wants; the harness decides what to ask the human about.
A full audit trail. Every screenshot, every model decision, every tool call, every permission decision is logged locally with the prompt that triggered it. The audit trail post walks through the schema. If something goes wrong, you can replay the run.
The result is a runtime where the model's capability — see screen, click, type — is bounded by the runtime's discipline. Computer use AI is powerful enough now that the runtime is the differentiator.
FAQ#
Is computer use AI the same as a desktop AI agent?#
Computer use AI is the model capability; a desktop AI agent is the product that ships it. The model decides "click here, then type that." The agent application captures screenshots, executes the clicks on your machine, manages permissions, and shows you what happened. You need both — a strong model and a careful runtime — for the experience to feel safe.
Does computer use AI run locally?#
The model runs in the provider's cloud — Anthropic, OpenAI, or Google. The actions run locally, on whatever computer your agent is controlling. For a desktop agent like Lapu AI, that means your files and inputs stay on your machine; only the screenshot and the model's response cross the network.
How accurate is computer use AI in 2026?#
On OSWorld, the most-cited real-task benchmark, Anthropic's Claude Sonnet 4.6 hit 72.5% on the verified split — roughly matching the 72.36% human baseline reported in the original OSWorld paper. OpenAI's CUA model and other systems score lower on the same tasks. Real-world reliability depends heavily on the agent harness, not just the model.
Can computer use AI replace RPA tools like UiPath?#
Not yet, and possibly not ever for the same workloads. Traditional RPA is brittle but deterministic — it does exactly the same thing every time. Computer use AI is flexible but probabilistic — it adapts to a UI change, but it may also click the wrong button on a bad day. The two will likely coexist, with RPA on high-volume known flows and computer use on long-tail desktop tasks.
Is it safe to let an AI agent click around on my computer?#
Only with a permission model and an audit trail. Anthropic's own documentation warns that "Claude will follow commands found in content, sometimes even in conflict with the user's instructions" — meaning a malicious webpage can attempt prompt injection. The defense is a sandboxed environment, explicit human confirmation for sensitive actions, and a complete log of what happened. Lapu AI's permission model is built for exactly this threat.
What is the difference between Anthropic computer use and OpenAI Operator?#
Anthropic exposes computer use as a developer API — your agent app sees the model's tool calls and executes them on whatever environment you provide (a Docker container, a real desktop). OpenAI's original Operator was a hosted browser agent running on OpenAI's infrastructure; it was later folded into ChatGPT Agent. The Anthropic API is what most desktop AI agents — including Lapu AI — build on, because it lets the agent control the real machine.
Sources#
Try Lapu AI#
Computer use AI is only as useful as the runtime that ships it. Lapu AI pairs the underlying model capability with permissioned execution, a local audit trail, and native macOS and Windows accessibility integration. Download Lapu AI for free, or see the pricing options for Pro and team plans.
FAQ
- Is computer use AI the same as a desktop AI agent?
- Computer use AI is the model capability; a desktop AI agent is the product that ships it. The model decides 'click here, then type that.' The agent application captures screenshots, executes the clicks on your machine, manages permissions, and shows you what happened. You need both — a strong model and a careful runtime — for the experience to feel safe.
- Does computer use AI run locally?
- The model runs in the provider's cloud — Anthropic, OpenAI, or Google. The actions run locally, on whatever computer your agent is controlling. For a desktop agent like Lapu AI, that means your files and inputs stay on your machine; only the screenshot and the model's response cross the network.
- How accurate is computer use AI in 2026?
- On OSWorld, the most-cited real-task benchmark, Anthropic's Claude Sonnet 4.6 hit 72.5% on the verified split — roughly matching the 72.36% human baseline reported in the original OSWorld paper. OpenAI's CUA model and other systems score lower on the same tasks. Real-world reliability depends heavily on the agent harness, not just the model.
- Can computer use AI replace RPA tools like UiPath?
- Not yet, and possibly not ever for the same workloads. Traditional RPA is brittle but deterministic — it does exactly the same thing every time. Computer use AI is flexible but probabilistic — it adapts to a UI change, but it may also click the wrong button on a bad day. The two will likely coexist, with RPA on high-volume known flows and computer use on long-tail desktop tasks.
- Is it safe to let an AI agent click around on my computer?
- Only with a permission model and an audit trail. Anthropic's own documentation warns that 'Claude will follow commands found in content, sometimes even in conflict with the user's instructions' — meaning a malicious webpage can attempt prompt injection. The defense is a sandboxed environment, explicit human confirmation for sensitive actions, and a complete log of what happened. Lapu AI's permission model is built for exactly this threat.
- What is the difference between Anthropic computer use and OpenAI Operator?
- Anthropic exposes computer use as a developer API — your agent app sees the model's tool calls and executes them on whatever environment you provide (a Docker container, a real desktop). OpenAI's original Operator was a hosted browser agent running on OpenAI's infrastructure; it was later folded into ChatGPT Agent. The Anthropic API is what most desktop AI agents — including Lapu AI — build on, because it lets the agent control the real machine.
Sources
- Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku — Anthropic (2024-10-22) · accessed 2026-05-15
- Computer use tool — Claude API Docs — Anthropic (2025-11-24) · accessed 2026-05-15
- OpenAI launches Operator — an agent that can use a computer for you — Will Douglas Heaven, MIT Technology Review (2025-01-23) · accessed 2026-05-15
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie et al. (2024-04-11) · accessed 2026-05-15
- Are You Ready to Let an AI Agent Use Your Computer? — Eliza Strickland, IEEE Spectrum (2025-02-13) · accessed 2026-05-15

