news 2026-04-15 · huggingface-papers

🤖 Top AI Models Score Just 40% When Instructions Conflict — New Research Sounds the Alarm

Imagine starting a new job where your boss, their boss, a client, and three different tools all give you conflicting orders at once.

Who do you listen to?

This is exactly the crisis facing today's AI agents — and new research from Johns Hopkins reveals they're failing badly at it.

Researchers created ManyIH-Bench, a test suite of 853 tasks where AI must navigate conflicting instructions from up to 12 different authority levels — system prompts, users, tools, other agents, and more.

The results are alarming:

Even frontier models scored only ~40% accuracy
Performance drops sharply as the number of instruction sources increases
Current systems that handle just 3-5 privilege tiers completely break down in realistic scenarios
Attackers could exploit this confusion to hijack agent behavior

Think of it like a company with brilliant employees but no org chart. When 10 people give orders simultaneously, even the smartest worker gets paralyzed — or worse, follows the wrong person.

As we rush to deploy AI agents that browse the web, write code, and manage our data, this research is a critical wake-up call: we need robust, scalable instruction hierarchies before these agents can be trusted in the real world.

📄 Source

huggingface-papers

← Previous

🌍 Lyra 2.0 — NVIDIA's AI That Builds Explorable 3

🧠 Nemotron 3 Super — NVIDIA's Open Model That Thi