The chatbot answered your question. The computer-use agent opens a browser tab, navigates to the form, fills in the fields, clicks submit, and closes the tab. That distinction — answering versus acting — marks the dividing line between the AI most teams deployed in 2024 and the AI the big three labs are racing to ship in 2026. By mid-year, Anthropic, OpenAI, and Google have made fundamentally different architectural bets on what computer-use agents should look like — and the bet each lab made reveals something real about what they think the problem actually is.
The Chatbot Answered; the Agent Acts
For two years, AI in the enterprise looked like a very smart search box. You typed a question, the model returned text, and a human decided what to do with that text. That loop has real value — but it also has a hard ceiling. Every action still required a human to read the output, switch to the right application, and complete the work. The model was advising, not operating. Computer-use agents break that ceiling by collapsing the gap between recommendation and execution.
The definition is precise: a computer-use agent perceives a screen, reasons about what it sees, and emits inputs — mouse clicks, keyboard strokes, form fills, scroll events — to complete a task. It is, in effect, an AI with hands. What makes the 2026 moment significant is not just that the capability exists but that the three labs with the resources to build it at scale made different choices about the fundamental architecture. Those choices are not interchangeable. They carry different security models, different sandboxing requirements, different speeds, and different privacy footprints. Picking the right computer-use agent is now an architectural decision, not a vendor preference.
According to McKinsey's 2025 survey, 88% of knowledge workers now use AI regularly, and 62% are actively experimenting with autonomous agents. The market pressure to move from advising to acting is real and building. But acting on a computer is also where things can go wrong at machine speed — which is why the architectural bet each lab made deserves scrutiny before you wire one of these systems into production.
Portable, screenshot-based tool use
- • Screenshot + mouse/keyboard — no OS SDK required
- • Works across Windows, macOS, Linux, containers, VMs
- • Any application — not just the web
- • State-of-the-art single-agent results on WebArena
- • Available via Anthropic API, Bedrock, and Vertex AI
- • Customer owns and configures the runtime sandbox
Desktop-native background sessions
- • Codex agent runs in its own macOS desktop session
- • Parallel to the engineer's workstation — not sequential
- • Full access: files, terminals, browsers, applications
- • Launched April 16, 2026 (Codex Background Computer Use)
- • Also shipping through Microsoft Copilot Studio (May 13 GA)
- • Every step logged in Microsoft Purview; 5 credits/step
Browser-anchored automation
- • Project Mariner shut down May 4, 2026
- • Absorbed by Gemini Agent + Chrome auto browse
- • Runs in real Chrome — no sandbox or VM overhead
- • Fast: real session state, real cookies, real JS engine
- • U.S.-first rollout; 750M users already in ecosystem
- • Privacy tradeoff: Google sees every site and form
Anthropic: Portable Tool Use, Visual First
Anthropic's architectural choice is maximal portability. Claude Computer Use treats the screen as a universal input surface — it takes a screenshot, interprets what it sees, and emits mouse coordinates and keyboard events in response. There is no OS SDK, no accessibility tree, no DOM parser in the loop. The agent does not need a purpose-built driver for your application; it reads whatever a human would read and acts however a human would act. That design decision makes it equally at home on Windows, macOS, and Linux, and equally capable inside a virtual machine, a remote desktop session, a cloud sandbox, or a local install.
The portability carries a concrete performance signal. Claude Computer Use posts state-of-the-art results among single-agent systems on WebArena, the standard benchmark for web-based task completion. That matters because WebArena tests the kinds of multi-step, real-website tasks that production use cases require — not toy prompts on synthetic pages. The benchmark result is not the whole story, but it is evidence that the visual-first approach is competitive with architectures that rely on richer programmatic access to the underlying DOM or OS accessibility layer.
The tradeoff is containment. Because Claude Computer Use can, in principle, click anything visible on any screen, the safety of the deployment depends entirely on what the agent can see and reach. Anthropic ships the model and the tool interface; the customer ships the runtime sandbox. If you want the agent scoped to a headless browser in an isolated container, that is your container to configure. If you want it blocked from accessing the filesystem, you build the block. This is not a failure of the product — it is a deliberate design stance that maximizes flexibility while putting security responsibility where it arguably belongs: with the team that understands the deployment context. Anthropic is clear that Computer Use remains in beta, and the recommended entry point is low-risk, high-repetition tasks where a mistake is recoverable.
On distribution, Anthropic is playing the breadth card. Claude Computer Use is available through the Anthropic API directly, through Amazon Bedrock, and through Google Cloud Vertex AI. An Excel and PowerPoint add-in integration shipped March 11, 2026, extending the reach into Microsoft 365 with context shared across all three cloud runtimes. The positioning is deliberate: Anthropic is not trying to own the surface; it is trying to be the intelligence that runs across every surface, regardless of which cloud or desktop environment the customer prefers.
Google: Browser-Anchored, Built on Mariner's Bones
Google's path to computer use started with Project Mariner, an experimental Chrome extension that let a Gemini agent navigate the web on the user's behalf. The project was formative — it established that a browser-native approach, running in real Chrome rather than a sandboxed Chromium clone, could deliver materially faster automation because there is no emulation overhead. A real browser has the real cookie jar, the real session state, the real JavaScript engine. The agent running in real Chrome is working with the same page the user actually sees, not a synthetic approximation that may diverge from the live version at any moment.
On May 4, 2026, Project Mariner was formally shut down. Its functionality was folded into two products: Gemini Agent, the API surface for developers building on top of the capability, and Chrome auto browse, a consumer-facing feature rolling out U.S.-first. The consolidation made structural sense — Mariner was always a browser play, and Google owns the browser with the largest installed base on earth. Gemini's distribution advantage is real: 750 million users already inside the Google ecosystem means that browser automation built on Gemini arrives at a different scale than any competitor shipping a standalone agent that must acquire its own audience.
"Running in real Chrome is fast. It is also kind of terrifying — Google sees every site visited and every form filled, in real time, as the agent does its work."
The privacy implication is structural, not incidental. When a computer-use agent runs inside a sandboxed VM or an isolated container, the host providing the sandbox can, in principle, see what the agent does. When that host is Google, and the surface is the user's actual Chrome profile with their actual session cookies, the visibility question becomes pointed. Google's terms and privacy policies govern what happens to that data, but the architecture means every action flows through Google's infrastructure regardless. For enterprises handling sensitive transactions — regulated data, competitive intelligence, proprietary workflows — that is a conversation that has to happen before deployment, not after a breach report.
The agentic browser story is bigger than any single lab. The broader competition between browser-anchored AI systems — the dynamics around agentic shopping, web attribution, and the legal questions that follow when an agent buys on a user's behalf — is something we covered in depth in the AI browser wars of 2026, where Amazon's lawsuit against Perplexity raised the first legal questions about what an agent is permitted to do on the web. Google's Chrome auto browse lands directly in that contested space.
OpenAI: Desktop-Native, Running in Parallel
OpenAI's bet is the most developer-centric of the three. On April 16, 2026, OpenAI launched Codex Background Computer Use — a configuration in which a Codex agent is provisioned its own macOS desktop session, running in parallel with the engineer's own workstation rather than sequentially inside it. The agent has access to the full desktop environment: files, terminals, browsers, installed applications. It can open a pull request in GitHub, run a test suite in a terminal, cross-reference documentation in a browser, and file a ticket in a project tracker — all in one task execution, without occupying the developer's screen or interrupting their focus.
The parallel session model is significant because it does not ask the developer to surrender their own desktop during the task. Earlier computer-use demonstrations often required the agent to take over the user's machine, which meant the user had to watch and wait while the agent worked. The Codex approach treats the agent's desktop as a separate compute resource — more like provisioning a cloud instance than borrowing the engineer's keyboard. That framing changes the deployment model: it is now reasonable to run multiple Codex agents handling background tasks simultaneously, each in its own session, while the team continues their own work undisturbed.
OpenAI also benefits from the platform distribution that came with the Microsoft partnership. Copilot Studio's general availability on May 13, 2026, ships OpenAI Computer Use Agents alongside Claude Sonnet 4.5 — an unusual pairing that puts two competing models inside the same enterprise product at a flat cost of five credits per step. Every action Copilot Studio's computer-use agents take is logged in Microsoft Purview, which matters more to enterprise security teams than any benchmark score. For organizations already inside the Microsoft 365 compliance posture, that combination of governance and familiarity is a meaningful accelerant to production adoption.
The Market Behind the Bets
The three labs are competing for the same market, and the market is large. The AI browser automation segment — the most measurable slice of the broader computer-use opportunity — was valued at $4.5 billion in 2024 and is projected to reach $76.8 billion by 2034, a compound annual growth rate of approximately 32.8%, according to industry analysis. That figure captures only browser-based automation; the desktop-native and cross-application segments add further scope that is harder to isolate but directionally larger.
The demand signal is already visible in enterprise behavior. McKinsey's 2025 survey found 88% of knowledge workers use AI regularly and 62% are actively experimenting with agentic systems. The experiments, however, are not yet in production at scale — which is precisely where the architectural differences between the three bets start to matter. A proof of concept that runs in a permissive sandbox looks very different from a production deployment that handles finance workflows, customer data, or regulated processes. Teams moving from POC to production are discovering that the model choice is often less important than the runtime architecture — the sandbox, the logging, the access controls, and the human gates.
The convergence of large market, high enterprise demand, and three well-funded labs betting on different architectures means the space will move fast and not all bets will converge. The right architecture for browser automation is not necessarily the right architecture for desktop automation. The right security model for a consumer product is not the right model for a regulated enterprise. Teams picking a computer-use agent in mid-2026 are making a bet on which architectural assumptions will prove correct in their specific context, not just picking a brand they recognize.
How to Pick a Computer-Use Agent for Your Stack
The three bets are not interchangeable, and the selection process should start from the task and the security model, not the brand. Each architectural choice optimizes for a different set of constraints, and matching the architecture to the constraint is the decision that matters.
The task-to-surface match also has implications for how you think about the software the agent will navigate. Computer-use agents that work against web pages are faster and more reliable when those pages are built with clean semantic structure and accessible HTML rather than visually dense interfaces that depend on pixel recognition. We wrote about this in depth in our piece on software built for agents— if you are both building applications and deploying computer-use agents against them, those two concerns can be optimized together in ways that produce compounding returns.
Security Is the Subtext of Every Architectural Choice
The three architectural bets look different on paper and they look even more different when you ask the question that matters in production: what can the agent do, to which systems, with whose credentials, and who can see what it did? The answers are structural, not configurable. An agent running in real Chrome with a real user's session cookies can, by design, interact with every site that user is authenticated to. An agent running in an isolated container with a headless browser can only reach what the container is allowed to reach. Those are not the same security boundary, and no amount of prompt engineering closes the gap between them.
For teams moving from experiments to production, the pattern that holds is: start with the narrowest possible access, expand only when the use case requires it, and log everything. Copilot Studio's Purview logging is a useful model even if you are not using Microsoft's platform — the principle is that every agent action should be observable, attributable, and reversible where possible. Computer-use agents that act faster than a human can review require checkpoints: confirmation steps before irreversible actions, rollback paths for mistakes, and circuit breakers that halt a task when an unexpected screen state appears. The maker-checker pattern — an independent agent that reviews the primary agent's planned actions before execution — is particularly valuable when the downstream consequence is a real form submission, a real file deletion, or a real financial transaction.
The orchestration question becomes important at scale. Teams deploying computer-use agents in production are not running one agent on one task; they are running queues of tasks across multiple agent sessions, with dependencies between them and verification loops at key transitions. That is the territory covered in our analysis of the 10-agent engineer — where Gartner logged a 1,445% surge in multi-agent inquiries and where the real leverage comes not from writing more code but from orchestrating parallel agents that decompose work, verify outputs, and escalate when something falls outside expected bounds. Computer-use agents are a powerful primitive in that orchestration stack, but one that requires more careful wrapping than a pure API call, because the blast radius of a mistake is the full desktop environment rather than a single response payload.
"Claude competes on depth and portability, Gemini on ecosystem reach, and OpenAI on the developer's own desktop. The surface you pick determines your security model — and that decision should come before any model benchmark."
Three Bets, One Trajectory
The divergence between Anthropic, OpenAI, and Google on computer-use architecture is not a disagreement about whether the capability matters — all three are investing heavily. It is a disagreement about what the binding constraint is. Anthropic is betting that portability and OS-agnosticism are the hard problems; the sandbox is the customer's to solve. Google is betting that real browser session state and ecosystem reach matter more than isolation; the privacy conversation is one the user accepts in exchange for speed and distribution. OpenAI is betting that the developer workflow, not the consumer browser, is the highest-value surface in 2026; the parallel session is the unlock.
By mid-2026, all three bets are live in some production form. Copilot Studio's GA is the clearest signal that computer-use agents have crossed from research preview to enterprise product — a platform with a significant enterprise install base now ships the capability with governance logs attached. The question is no longer whether computer-use agents will reach production at scale. It is which architecture survives contact with real enterprise security requirements, real compliance frameworks, and real tasks where mistakes have material consequences rather than just demo embarrassment.
The answer will depend on the task. Automating workflows that span multiple desktop applications will favor the portable, visual-first approach that Claude Computer Use represents. Automating web-native workflows inside Google's ecosystem will favor the browser-anchored bet, for organizations willing to accept the privacy tradeoff. Automating developer workflows in parallel, with existing Microsoft compliance infrastructure, will favor the native-session model. Most production deployments will end up using more than one approach, because most real workflows span more than one surface. Picking a computer-use agent is, finally, less like picking a model and more like picking an execution environment — and the same architectural rigor that applies to infrastructure decisions applies with full force here.
Tags
Share
Building something like this? See how we ship it or start a project.