๐ค Top AI Models Score Just 40% When Instructions Conflict โ New Research Sounds the Alarm
Imagine starting a new job where your boss, their boss, a client, and three different tools all give you conflicting orders at once.
Who do you listen to?
This is exactly the crisis facing today's AI agents โ and new research from Johns Hopkins reveals they're failing badly at it.
Researchers created ManyIH-Bench, a test suite of 853 tasks where AI must navigate conflicting instructions from up to 12 different authority levels โ system prompts, users, tools, other agents, and more.
The results are alarming:
- Even frontier models scored only ~40% accuracy
- Performance drops sharply as the number of instruction sources increases
- Current systems that handle just 3-5 privilege tiers completely break down in realistic scenarios
- Attackers could exploit this confusion to hijack agent behavior
Think of it like a company with brilliant employees but no org chart. When 10 people give orders simultaneously, even the smartest worker gets paralyzed โ or worse, follows the wrong person.
As we rush to deploy AI agents that browse the web, write code, and manage our data, this research is a critical wake-up call: we need robust, scalable instruction hierarchies before these agents can be trusted in the real world.
๐ Source
huggingface-papers