BreakthroughMAY 28, 2026 · AGENTIC AI · RETROSPECTIVE

Claude Computer Use, 18 Months In — What Actually Stuck

Anthropic shipped Computer Use in October 2024. Eighteen months later, here's the honest inventory — which production workflows survived, which collapsed, what it actually costs.

By Kadin Nestler · May 28, 2026 · 12 min read

Share X LinkedIn Email

Production workflows that survived 18 months — and the ones that didn't

1
Form-filling at scale
Repetitive structured input, low cost of error, human review at the end
Survived
2
Data extraction from PDFs and portals
Augmented with OCR and a deterministic verifier — agent does the scrape, code checks the math
Survived
3
Multi-step booking and procurement flows
Fails on CAPTCHA, modals, and dynamic state — works inside enterprise SaaS the vendor allowlisted
Mixed
4
Customer support replacement
Throughput drops at edge cases, escalation logic decays, CSAT regresses inside one quarter
Failed
5
Generative end-to-end 'do my work' demos
The viral demo videos never became a deployed product — too slow, too expensive, too brittle
Failed

Anthropic shipped Computer Use as a public beta on October 22, 2024. The pitch was simple — give Claude a screenshot, a mouse, a keyboard, and a goal, and let it operate a desktop the way a human would. The launch demo showed Claude filling out a vendor onboarding form by clicking around a browser. The reaction was immediate. The waitlist filled in hours. Every enterprise AI Slack channel asked the same question that week: when does this replace half my back-office team.

Eighteen months on, the honest answer is that it replaced almost none of it. A specific, narrow subset of workflows stuck in production. The rest fell apart on contact with reality, got quietly deprecated, or pivoted into a different product category. This is the inventory — what actually survived from the October 2024 launch through the OpenAI Operator wave of January 2025, through the Project Mariner shutdown of May 2026, into the present.

The point is not to say agents failed. The point is to be specific about which ones, where, and why — because the hype cycle keeps trying to swallow the specifics whole, and the specifics are where any deployment decision actually lives.

The maturation arc — three phases in eighteen months

The Computer Use cycle ran on a predictable arc. October 2024 through February 2025 was the curiosity phase — every enterprise AI buyer ran a pilot, the pilots produced impressive demo videos, and nothing went into production. February through November 2025 was the disappointment phase — the pilots that did push to production hit the brick wall of reliability, the OSWorld scores got published and re-litigated, OpenAI shipped Operator in January, shut it down by August, and folded the capability into the ChatGPT agent layer. November 2025 through today is the maturation phase — a much smaller set of deployments, much narrower scope, much more deterministic guardrails around the agent, and a much clearer picture of where the technology actually earns its keep.

The shape of the arc matters because it explains why the news headlines stopped. The agents that survived are boring. They are wrapped in deterministic verifiers, they run inside enterprise SaaS allowlists, they get retried by a job queue when they fail, and they cost about what a human would cost per task — sometimes more. They do not look like the launch demo. That is exactly why they shipped.

Three data points anchor this. Claude Opus 4.5's OSWorld score was 66.26% at launch in November 2025. Claude Opus 4.6 pushed that to 72.7% in February 2026. Both numbers are below the threshold most enterprise process owners will accept for unattended automation, which is somewhere in the 95-99% range. Reviewers in 2026 keep landing on the same figure when they hands-on test Computer Use end-to-end — roughly 50% reliable on multi-step tasks, near 90%+ on simple single-step tasks. That gap explains every deployment pattern below.

What survived — workflow #1, form-filling at scale

The single most durable production use case for desktop-control agents is form-filling. Not the impressive end-to-end "book my whole vacation" demo. The boring version — a vendor portal that needs the same 47 fields filled in from a source CSV, six hundred times a day, with a human checking the diff at the end.

This works because every property the workflow needs is in the agent's favor. The structure is repetitive, which means the agent can be prompted with a worked example and converge on a reliable script-shaped behavior. The cost of an error is low, because the human verification step catches anything weird before it submits. The work is high-volume and dollar-per-task is low, which is exactly the regime where agents beat humans on cost. And the source data is structured, which means most of the agent's time is spent looking at the destination UI, not reasoning about ambiguous inputs.

The deployments here are unglamorous. State Medicaid enrollment portals, IRS Form 1023 walkthroughs, insurance carrier appetite portals, supplier onboarding flows inside SAP and Oracle Fusion. None of them have a press release attached. All of them are running in production today. The pattern is consistent — the agent does the click work, a deterministic script extracts the form values from the source system, a human reviews the screenshot diff, the submit button is the only action that requires a human click.

That last detail is the whole game. Almost every surviving production deployment of Computer Use puts a human in the loop on the irreversible action. The agent assembles, the human submits. That makes the 50% reliability number tolerable, because the failure mode is "agent stalls and asks for help," not "agent submits the wrong $80,000 invoice to the wrong vendor."

What survived — workflow #2, data extraction from PDFs and portals

The second durable use case is structured data extraction from semi-structured sources — PDFs, scanned documents, vendor portals that don't expose an API, legacy ERPs with HTML tables in the year 2026. Insurance loss runs. Medical chart abstraction. Carrier statement reconciliation. Lease document review. None of these are new use cases — RPA vendors have been selling this for fifteen years. What changed is that Computer Use plus a deterministic verifier handles the long tail of formats that hard-coded OCR scripts choked on.

The architecture that actually works is the same in every deployment we have seen ship. Computer Use does the navigation and the extraction. A code-side verifier checks the math — sum the line items, compare to the totals, flag the discrepancies. The agent retries on the discrepancies until the verifier passes or kicks it to a human. Anthropic's own enterprise guidance has converged on this pattern in the post-launch documentation — agent for the open-ended scrape, code for the deterministic check.

The TELUS deployment of Claude through the internal Fuel iX platform — 57,000 employees in production — is the canonical reference for this shape of work at scale, although the public materials are careful not to specify which workflows are agent-driven versus chat-driven. The Honeycomb internal rollout in August 2025 followed a similar pattern, with employees using Claude for customer-discovery extraction tasks across CRM data the agent could reach.

Per-task economics here look fine for a specific reason. A complex multi-agent extraction task burns 200,000 to over 1,000,000 tokens, which sounds horrifying until you price it against the human alternative. A medical-chart abstractor at $40/hour billing 15 minutes per chart is $10 per chart. An agent-driven abstraction at Claude Opus 4.6 pricing lands in the $2-5 per chart range when the verifier loop keeps the token budget bounded. The deployments that stuck are the ones where the math penciled. The deployments that did not stick are the ones where it did not.

What is mixed — multi-step booking and procurement

Multi-step web automation — book this flight, order these supplies, submit this expense report through Concur — is the use case that the launch demos all centered on, and it is the use case where reality split into two categories with a hard line between them.

Inside the enterprise allowlist, it works. Workday, ServiceNow, Concur, Coupa, Ariba — vendors that have actively partnered with Anthropic, OpenAI, and Google to allowlist agentic traffic, expose stable selectors, and provide structured failure modes — these are where Computer Use deployments survived. The agent gets a clean DOM, predictable redirects, an explicit "I am an agent" handshake, and a sandbox that does not break every two weeks.

Outside the allowlist, it fails. The open web is hostile to agents on purpose. CAPTCHA gates trip on every third session. Cloudflare Bot Management classifies the agent's traffic as bot traffic, because it is. Modal popups for "are you 18" and "accept cookies" and "subscribe to our newsletter" break the agent's plan because they were not in the screenshot when the plan was assembled. Dynamic single-page apps re-render the DOM mid-action. Anthropic's own system card for Claude Opus 4.5 acknowledges the agent cannot reach password-protected, sign-in, or CAPTCHA-gated pages — which is, in practice, most of the consumer web.

The two patterns explain why OpenAI Operator launched in January 2025 as a Pro-tier consumer product, ran into the same wall, and got folded into the ChatGPT agent layer by August. The consumer use case — "book my groceries, send the meme, fill out the form" — was where the open-web hostility hit hardest. The enterprise use case — "scrape this Workday tenant, file this Concur expense" — was where the allowlist worked. Operator was on the wrong side of that line.

Project Mariner followed the same arc. Google introduced Mariner in December 2024 as a browser agent, expanded it to ten parallel tasks by May 2025, integrated the booking flows into Google Search's AI Mode, and shut down Mariner as a standalone product on May 4, 2026. The technology did not die — it got folded into Gemini Agent and into AI Mode in Search, on Google's own infrastructure where the allowlist is implicit because Google controls both the agent and the destination. The standalone open-web product did not survive. The infrastructure-side capability did.

WHAT EVERY AGENT DEMO HIDES

The viral demo videos all share three structural lies. First, the demo task was chosen because it works — the engineer ran fifty variants and shipped the one that completed. Second, the demo environment is sanitized — no popups, no CAPTCHAs, no two-factor prompts, no rate-limit walls. Third, the demo timeline is compressed — the actual run took minutes, the cut is twenty seconds. None of this is fraud. It is marketing. But every enterprise buyer who watched the demo and assumed the production deployment would look the same has spent the last eighteen months learning the difference at their own expense. Build the pilot against the worst path, not the best one. The 50% reliability number is the path you have to plan for.

What failed — customer support replacement

The biggest disappointment cycle of the 2025 agent wave was customer support replacement. Every enterprise SaaS company with a support function pitched the same plan in Q1 2025 — Computer Use plus a knowledge base plus the existing Zendesk or Intercom UI equals a fully agentic Tier 1 deployment that handles 60% of inbound tickets and frees up the humans for the hard ones. The pilots ran in Q2. The CSAT data came back in Q3. The pivots happened in Q4.

The pattern of failure was consistent. The agent handled the easy tickets fine — password resets, order status, basic returns. CSAT on those was acceptable, sometimes better than the human baseline, because the agent answered faster. The agent fell apart on the long tail of edge cases — tickets that required reading three different internal systems, escalation logic that was never documented anywhere except inside a senior rep's head, customers whose situation did not fit any of the canonical resolution paths. The agent would loop, guess, and either close the ticket wrong or hand off to a human at the worst possible point — after the customer had already typed their problem twice.

The math broke too. Per-ticket cost for a Computer Use agent running across three internal apps came in around $1.20-2.40 per ticket once the token budget was honest about the multi-system context. The offshore human alternative was $0.80-1.50 per ticket. The on-shore alternative was $4-6. The agent was cheaper than on-shore and more expensive than off-shore, with worse handling on the edge cases, which is the worst possible position to be in inside an enterprise procurement conversation.

What actually shipped, eighteen months in, is not agent-replaces-rep. It is agent-augments-rep. The rep stays in the loop, the agent drafts the response, the rep edits and sends. Same productivity gain — roughly 30-45% by the deployments we have seen — without the CSAT regression and without the tail-risk of an agent closing a ticket badly under autonomous control. That is the pattern that survived. The full-replacement pattern is gone.

What failed — generative end-to-end "do my work" demos

The category that fell apart hardest is the one that drove the launch hype. "Watch the agent build me a PowerPoint from a research brief." "Watch the agent plan my offsite end-to-end." "Watch the agent run my CRM workflow autonomously for eight hours." Every one of these demos went viral on launch day. None of them became a product.

The reason is structural. End-to-end generative agentic tasks require the agent to make a sequence of compounding decisions where every step depends on the previous step being right. Even at 95% per-step reliability, a 20-step task lands at 0.95^20 = 36% end-to-end success. Computer Use at 66-72% per-step lands at 0.7^20 = 0.08%, which is to say zero. The deployments that survived attacked this by making the tasks shorter, more deterministic, and supervised. The deployments that failed attacked it by making the tasks longer and more autonomous, which made the failure rate worse on every metric.

The Claude Code production-data-wipe incident reported in early 2026 is the cautionary tale every enterprise buyer now references. An agent operating semi-autonomously inside a developer's environment ran a destructive operation it should not have run, the human in the loop did not catch it in time, and 2.5 years of production data went away. The agent that did this was not Computer Use, but the failure mode is the same — sufficient autonomy plus insufficient guardrails plus an irreversible action equals a story your CTO will be telling for years. The lesson the industry actually internalized is "no irreversible actions without a human gate." Every surviving production deployment honors it.

The cost reality — where agents pay for themselves, and where they don't

The other thing the demo videos hid is the dollar cost per task. A Computer Use session that reasons over a screenshot, plans an action, executes it, captures the result, and iterates burns through tokens fast. Frontier-model pricing for the underlying Claude Opus tier sits at roughly $5 per million input tokens and $25 per million output tokens. A single multi-step task that loads 200,000 tokens of screenshots, DOM, and reasoning context costs $1.00-2.50 in inference alone. Run that against a workflow that a $20/hour offshore worker would complete in 90 seconds for $0.50, and the math does not pencil.

The deployments that survived the cost test all share a property. Either the work is high-stakes enough that a $2 per-task agent beats a $10 human (medical chart abstraction, loss-run analysis, legal-document review), or the work runs at volumes humans cannot match (six hundred form fills per day, twenty-four hours per day), or the work happens at hours where the human alternative is on-call overtime rates (after-hours intake, weekend processing). Outside those three regimes, agents lose on cost. That is the part the 2025 hype cycle did not price in.

Optimization helped at the margins. Structured outputs (markdown, JSON) instead of raw HTML reduced token consumption by about 67%. Semantic locators instead of full DOM dumps saved up to 93% of context. Selective screenshotting, retrieved-context-only prompting, and aggressive caching brought per-task costs down 40-60% across most production loops by mid-2026. None of those optimizations turn a $2 task into a $0.20 task. They turn it into a $1.20 task. The unit economics still have to make sense at the lower number.

What the surviving deployments have in common

Stepping back across the eighteen months, every Computer Use deployment that is still in production at scale has the same five properties. They are worth naming explicitly because they are the closest thing to a deployment checklist the industry has converged on.

Narrow scope. One workflow, one source system, one destination system. Not "automate my back office." That.
Deterministic verifier. A code-side check that the agent did the right thing. The agent does the open-ended work. The verifier does the math.
Human-in-the-loop on the irreversible action. Agent assembles, human submits. No exceptions until the deployment has run six clean months.
Bounded retry loops. If the agent has not converged in N attempts, escalate to a human. No infinite-loop autonomy.
Honest per-task cost accounting. The deployment penciled against the human alternative before it shipped. If it did not pencil, it did not ship.

What is left out of the list is just as telling. There is no "trust the agent." There is no "give it the keys and walk away." There is no "AGI is right around the corner." The maturation phase of agent deployment is the discovery that the technology is a tool with a specific shape, and the deployments that work are the ones that respect the shape.

The Operator and Mariner pivots — what they tell us

The two highest-profile pivots of the cycle were OpenAI Operator and Google Project Mariner. Both started as standalone consumer products with agentic browser capability. Both got folded into something else inside twelve months. Both stories tell us the same thing — the consumer open-web agent is not a product, it is a feature of something larger.

OpenAI launched Operator on January 23, 2025, as a Pro-tier research preview at operator.chatgpt.com. By July 17, 2025, OpenAI announced Operator was being integrated into ChatGPT as the ChatGPT agent. By August 31, 2025, the standalone Operator product was shut down. The capability now ships as part of ChatGPT's agent mode, which itself is an augmentation surface inside a larger product. The pivot was not a failure — it was the recognition that "browser-using agent" is not what users buy. "Chat assistant that can browse when it needs to" is.

Project Mariner followed the same shape. Google introduced Mariner in December 2024, expanded it to ten parallel tasks by May 2025, integrated it into Google Search AI Mode and Gemini Agent, and shut down the standalone product on May 4, 2026. The technology shipped. The standalone product did not. The capability lives on inside Gemini and inside Search. The brand and the URL went away.

The pattern is the same one we have seen in every prior wave of AI tooling. The flashy standalone capability becomes a feature of an existing product surface that already has distribution and user trust. The 2026 shape of agentic browser automation is "your existing chat assistant or your existing IDE or your existing SaaS suddenly has agent capability inside it" — not "log into a new product that does agent stuff."

What this means if you are evaluating an agent deployment today

If you are an operator looking at Computer Use, ChatGPT agent mode, Gemini Agent, or browser-use library deployments for an internal workflow, the deployment-readiness questions are the same ones the survivors answered yes to. Is the scope narrow enough. Is there a deterministic verifier. Is the irreversible action gated by a human. Is the retry loop bounded. Does the per-task math pencil against the human alternative at honest token prices.

If you cannot answer yes to those five, the deployment will land in the disappointment phase of your own internal arc, six months from now, exactly the way the customer-support replacement deployments did in 2025. The technology is real. The maturation point is real. The use cases that work are real and specific. The use cases that do not work are also real and specific, and they are most of them.

The honest version of the 18-month retrospective is that Computer Use turned out to be a power tool, not a humanoid worker. Power tools are useful, and you can build a lot with them, but you do not hand one to someone untrained and walk away. The 2026 shape of agent deployment is the shape of every other power tool — narrow scope, trained operators, guardrails on the dangerous actions, and an honest accounting of when the tool beats the human and when it does not. The deployments that figured that out shipped. The ones that did not, did not.

The next eighteen months will repeat the cycle one layer up. Multi-agent orchestration is the new "give the agent the keys." The same curiosity → disappointment → maturation arc is already starting. The deployments that survive that wave will share the same five properties. The ones that do not will end up as cautionary blog posts in 2028, which is fine, because somebody has to write them.

Read the five properties again, because they are not a counsel of despair — they are a build spec. Narrow scope, a deterministic verifier, a human gate on the irreversible action, bounded retries, and honest per-task math against the human alternative. That spec is the entire difference between an agent that ships and an agent that becomes a 2028 cautionary tale. It is also, not by accident, how we build at Ascero. We do not sell the give-it-the-keys demo that went viral and died. We ship the narrow, verified, human-gated kind — pointed at exactly the high-stakes, high-volume, and after-hours work where the cost math pencils in your favor instead of against it.

HOW WE BUILD

Ascero builds agents to the surviving-deployment spec — narrow scope, a deterministic verifier, a human gate on irreversible actions, bounded retries, and honest cost accounting — so they land in the regimes where they beat the human alternative, not the disappointment phase. See what we ship at /agents, or book a call at /book.

Sources

Cite this article

Ascero AI. “Claude Computer Use, 18 Months In — What Actually Stuck.” May 28, 2026. https://asceroai.com/news/claude-computer-use-18-month-retrospective

Free to reference with attribution and a link back to this page.

Did this land? Pass it on.