๐ค AI Can Use Your Computer โ But Can It Do It Twice?
What if you hired an assistant who nailed a task perfectly the first time โ then completely botched it the second?
That's exactly what's happening with today's computer-use AI agents. They can browse the web, fill out forms, and automate desktop tasks. Sometimes they even outperform humans. But there's a catch: they can't do it reliably.
A new study from UC Santa Cruz put these agents to the test โ running the same tasks multiple times on the OSWorld benchmark. The results reveal three root causes of unreliability:
๐ฏ Key findings:
- **Execution randomness** โ The agent makes different decisions each run, like an employee whose focus drifts day to day
- **Ambiguous instructions** โ Vague task descriptions lead to wildly different interpretations across attempts
- **Behavioral inconsistency** โ Even with identical prompts, the agent chooses different strategies, sometimes taking shortcuts, sometimes going the long way around
Think of it like a chef making the same dish โ same recipe, different result every time. Now imagine that chef is booking your flights or managing your finances.
The researchers propose three fixes: test agents across multiple runs (not just once), let agents ask clarifying questions when instructions are unclear, and stabilize decision-making strategies across executions.
The takeaway? "Capable" and "dependable" are two very different things โ and we're not there yet.
๐ Source
huggingface-papers