news 2026-04-21 · huggingface-papers

🤖 AI Can Use Your Computer — But Can It Do It Twice?

What if you hired an assistant who nailed a task perfectly the first time — then completely botched it the second?

That's exactly what's happening with today's computer-use AI agents. They can browse the web, fill out forms, and automate desktop tasks. Sometimes they even outperform humans. But there's a catch: they can't do it reliably.

A new study from UC Santa Cruz put these agents to the test — running the same tasks multiple times on the OSWorld benchmark. The results reveal three root causes of unreliability:

🎯 Key findings:

**Execution randomness** — The agent makes different decisions each run, like an employee whose focus drifts day to day

**Ambiguous instructions** — Vague task descriptions lead to wildly different interpretations across attempts

**Behavioral inconsistency** — Even with identical prompts, the agent chooses different strategies, sometimes taking shortcuts, sometimes going the long way around

Think of it like a chef making the same dish — same recipe, different result every time. Now imagine that chef is booking your flights or managing your finances.

The researchers propose three fixes: test agents across multiple runs (not just once), let agents ask clarifying questions when instructions are unclear, and stabilize decision-making strategies across executions.

The takeaway? "Capable" and "dependable" are two very different things — and we're not there yet.

📄 Source

huggingface-papers

← Previous

🎬 OmniScript — A Small AI That Writes Full Script

🎮 This Open-Source AI Builds Entire Playable Game