Researchers Gave AI Agents Real Jobs. The Agents Couldn't Close a Pop-Up.

Published: (February 21, 2026 at 06:22 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Benchmark Overview

The agentic AI market is supposed to hit $12 billion this year. Venture capitalists have poured billions into companies promising autonomous AI workers. Salesforce, Microsoft, and Google are all shipping agent platforms. The pitch is simple: AI agents will do your job while you sleep.

Carnegie Mellon researchers decided to test that pitch. They built a simulated software company—sixteen employees, complete with a CTO, HR manager, engineers, sales team, and finance department. Then they replaced every worker with an AI agent and gave them actual office tasks: analyze a dataset, write a performance review, message a colleague, close a support ticket.

The best agent completed 24 percent of its assignments.

The research team—led by Frank F. Xu, Yufan Song, and Boxuan Li under professor Graham Neubig—spent 3,000 combined hours building TheAgentCompany, a benchmark that replicates a real workplace with chat platforms, code repositories, project boards, and shared documents. They tested thirteen models from Anthropic, OpenAI, Google, Amazon, and Meta.

  • Claude 3.5 Sonnet – 24 percent
  • Google’s Gemini 2.5 Pro – 30.3 percent (later testing)
  • OpenAI’s GPT‑4o – 8.6 percent
  • Amazon’s Nova Pro – 1.7 percent
  • Meta’s Llama 3.1‑405B (largest open‑source model tested) – 7.4 percent

These aren’t trick questions. They’re tasks like “find the right person in the company chat and ask them about the project deadline.” One agent encountered a pop‑up window blocking the information it needed and couldn’t figure out how to close it.

Another agent, tasked with contacting a specific colleague on RocketChat, couldn’t find them in the directory, so it renamed a different user, giving them the name of the person it was looking for. Task “completed.”

The researchers call these “fake shortcuts”—when an agent doesn’t know the next step, it invents a workaround that skips the hard part. Examples include:

  • An agent told to coordinate with HR never initiated contact.
  • An agent asked to process files couldn’t distinguish a .docx from a .csv.
  • One sent emails to the wrong people entirely.

“It sometimes tries to be clever and create fake shortcuts that omit the hard part.” – Researchers

The Numbers Don’t Add Up

This isn’t an isolated finding. Gartner predicts over 40 percent of agentic AI projects will be canceled by the end of 2027. MIT’s Project NANDA surveyed 350 employees, interviewed 150 leaders, and analyzed 300 public AI deployments. The result: 95 percent of enterprise generative AI pilots produce zero measurable return on investment. The 5 percent that work extract millions in value; everyone else burns budget.

Gartner’s analysts found something else: most “agentic AI” products aren’t agentic at all. They estimate only about 130 of the thousands of vendors claiming agentic capabilities are real. The rest are engaged in “agent washing”—rebranding chatbots and robotic process automation tools with the word “agent” bolted on.

Smarter Models Fail More Chaotically

Meanwhile, Anthropic—whose model scored highest in the original benchmark—published its own uncomfortable findings. Their January 2026 paper “The Hot Mess of AI” split AI errors into two types:

  1. Systematic mistakes – consistently wrong in the same direction.
  2. Incoherent mistakes – randomly wrong in different ways each time.

As tasks get harder and reasoning chains stretch longer, the incoherent failures take over. Smarter models aren’t more reliably wrong; they’re more chaotically wrong.

The safety implications flip the usual narrative. The AI alignment community has spent years worrying about a superintelligent optimizer ruthlessly pursuing the wrong goal. Anthropic’s data suggests the nearer risk is something dumber and harder to debug: capable AI systems that fail in ways nobody can predict or reproduce, including themselves.

$12 Billion Market, 24 Percent Completion Rate

Put the numbers side by side:

  • Carnegie Mellon: agents fail 70 percent of office tasks.
  • MIT: 95 percent of enterprise AI pilots deliver no ROI.
  • Gartner: 40 percent of projects will be canceled.
  • Anthropic: failures become more random as tasks get harder.

And yet: a $12 billion market in 2026, tens of billions in venture capital, every enterprise‑software company shipping an agent product, CEOs announcing headcount reductions based on capabilities that score 24 percent on a benchmark designed to simulate the job those people were doing.

The gap between what AI agents are sold as and what they actually do has never been wider. The agentic AI market isn’t a bubble because the technology is worthless—it’s a bubble because the technology is partially capable, which is worse. A tool that fails obviously gets abandoned. A tool that works 30 percent of the time gets deployed, trusted, and left unsupervised until it renames your colleagues and emails the wrong client.

Nobody sells a car that starts three mornings out of ten. But we’re building an industry around software that completes a quarter of its assignments, and calling it the future of work.

Originally published on Moth’s Substack.

0 views
Back to Blog

Related posts

Read more »