Home » Insight Collections » [REPORT] Evaluating AI Research Agents
AI research agents promise to revolutionise how analysts work. Scanning hundreds of pages, extracting key facts, and delivering answers in seconds. But even the best AI models got it wrong nearly 40% of the time when we tested them on real enterprise research tasks.
If your team is considering handing over research workflows to AI agents, you need to see these results first.
The Reality Behind the AI Research Hype
AI agents are everywhere. ChatGPT’s Deep Research, Claude Projects, Gemini’s research capabilities, Microsoft Copilot, they all promise to replace hours of manual analysis with instant, accurate answers.
But when stakes are high and decisions matter, “pretty good” isn’t good enough.
We put four leading AI research agents through a rigorous real-world test: analysing three years of Commerzbank annual reports to answer ten complex business questions. These weren’t simple fact-checks like “What was the revenue?” but multi-step analytical tasks that mirror what intelligence analysts actually do.
The results? GPT5 scored highest at just 3.11 out of 5. The average across all models was 2.58 out of 5.
More concerning: the agents struggled most with recall (finding all relevant information) and consistency (giving the same answer when asked twice). Even when they were precise, they frequently missed critical details.
What We Discovered: The Three Problems Holding AI Agents Back
1. The Completeness Problem
AI agents regularly missed important information that a human analyst would catch. They might correctly identify revenue figures but completely overlook management’s explanation for what drove the change. In enterprise intelligence work, these missing pieces often matter more than the numbers themselves.
2. The Consistency Problem
Run the same question through the same AI agent twice, and you’ll often get different answers. One run might return comprehensive details whilst another skips entire sections. For repeatable business processes, this variability is a deal-breaker.
3. The Confidence Problem
Perhaps most dangerous: AI agents present wrong answers with the same confidence as correct ones. We found completely incorrect revenue figures, nonsensical transcription errors, and illogical timeline mistakes, all delivered in polished, authoritative prose that looked entirely credible.
What’s Inside the Research Report?
This comprehensive evaluation reveals exactly how current AI research agents perform on enterprise-grade analytical tasks:
Real Performance Data
We tested GPT5, Claude Sonnet 4.5, Gemini 2.5 Pro, and Microsoft 365 Copilot across four critical dimensions: precision (accuracy), recall (completeness), quality (clarity), and consistency (reliability). You’ll see detailed scoring breakdowns for each model and each question type.
The Known Flaws in RAG Systems
Most AI research agents use Retrieval-Augmented Generation (RAG) to find relevant information before generating answers. We expose the six critical failure points in RAG systems, from content selection and chunking to retrieval and sorting, and explain why even sophisticated prompt engineering can’t solve these fundamental issues.
Actual Error Examples
See the specific mistakes AI agents made: wrong revenue figures, incomprehensible transcription errors, miscalculated percentages, and impossible timeline claims. Understanding how agents fail helps you spot problems before they reach stakeholders.
Practical Prompts and Methodology
Download the exact prompts we used, see our evaluation framework, and access our grading methodology. If you’re testing AI agents internally, you can adapt our approach for your own content and questions.
Where AI Agents Actually Work
Not all use cases are equal. We identify four scenarios where AI research agents excel today: initial reviews, structured data exploration, summarisation, and research planning. Understanding where to use AI and where to keep humans in the loop is essential for successful implementation.
Download the full research report to access:
- Complete performance data across all models and questions
- Detailed RAG failure point analysis with market intelligence examples
- Prompt templates for research brief creation and evaluation
- Human benchmark methodology showing the 15-hour analyst process
- Practical recommendations for using AI agents safely and effectively
Frequently Asked Questions
What does this research report evaluate?
This report benchmarks the performance of leading AI research agents on a realistic enterprise task: analysing three years of Commerzbank annual reports to answer complex, multi-step business questions. It tests GPT5, Claude 4.5 Sonnet, Gemini 2.5 Pro, and M365 Copilot across four dimensions: precision, recall, quality, and consistency, comparing their outputs against human analyst answers.
Why does this evaluation matter for enterprise teams?
AI agents are increasingly being used for market intelligence and research tasks, but their reliability in enterprise settings is often assumed rather than tested. This report provides empirical data on how current agents perform when asked to extract, synthesise, and compare information from large, complex documents, the kind of work that matters for strategic decisions.
What were the headline findings?
Even the best-performing model scored just 3.11 out of 5, with significant gaps in recall and consistency across all agents tested. The report found that agents frequently miss critical information and can produce different answers to the same question across multiple runs, issues that would concern any team preparing material for senior stakeholders.
What are the main limitations of AI research agents today?
The report identifies recall and consistency as the primary challenges. Agents often fail to surface all relevant facts from complex documents, and their outputs vary between runs in ways that undermine trust. Additionally, errors can be difficult to detect because agents present information confidently even when it contains inaccuracies.
Does the report provide guidance on when to use AI research agents?
Yes. The report outlines specific use cases where agents perform well, including initial topic reviews, structured data exploration, document summarisation, and generating research plans. It also provides practical advice on prompt engineering and explains why human oversight remains essential for high-stakes enterprise work.
How long will it take to read?
The report is approximately 30 pages and can be read in 20–25 minutes. It includes detailed methodology, scoring breakdowns by model, and practical recommendations that teams can apply immediately.






