Blog
AI Hallucination Leaderboard + Desert Mirage
Today I want to dive into some research studies that came across my desk about AI hallucinations.
At Mission my team and I are always tracking the latest AI research to understand both the capabilities and limitations and how we can apply these learnings to our work with customers.
So, I’ll keep it short and sweet and sum up the highlights I read in the research.
Also, don’t forget – next week I’ll be live in a Gen AI Ask Us Anything session with Mission CTO Jonathan LaCour. Register here, and come with some questions about hallucinations! (Or anything else AI-related).
The Mathematical Proof of Inevitable Hallucination
Let's start with something that might surprise you – researchers just mathematically proved that hallucination in AI is inevitable.
The paper shows that even if you train an LLM perfectly, it physically cannot learn all possible computable functions.
Think of it like trying to fit an infinite amount of information into a finite space – it's mathematically impossible.
This explains why even the best models sometimes generate plausible but incorrect information.
The Current State of AI Accuracy
Let’s take a look at another recent study. This one examined how 319 knowledge workers interact with AI tools in their daily work.
While we know people need to verify AI outputs, this study revealed how that verification actually happens.
Here's what they discovered about two key factors – confidence in AI and confidence in yourself:
- When workers had high confidence in AI, they actually did less critical checking of the outputs
- When workers had high confidence in their own abilities, they engaged more deeply with the AI outputs and caught more issues
- Most importantly, the effort of checking AI outputs often exceeded the effort of just doing the task directly
Think about what that means for a moment.
Many of us adopt AI tools assuming they'll make our work easier. And while they can boost productivity, there's a hidden cost: the mental effort of verifying outputs can sometimes outweigh the time saved.
The study found that workers handle AI outputs in three ways:
- Information tasks: Focus shifts from gathering info to verifying it
- Problem-solving tasks: Focus shifts from solving to integrating AI suggestions
- Analysis tasks: Focus shifts from doing the work to overseeing the AI's work
You might be thinking, “Ryan, this isn’t groundbreaking news to me.” And I’d say you're right.
What makes this important isn't just the findings themselves, but what they tell us about building effective AI workflows.
We need to account for these verification costs when deciding where and how to implement AI tools.
Recently, the BBC ranked the major AI search tools’ accuracy with news content. Obviously, news content is newer material that has not been trained by the LLM and doesn’t exist in its training data. Even the best performing models still had challenges with factual accuracy. Here’s each model's version and performance, ranked best to worst:
ChatGPT Enterprise (GPT-4) had 15% significant errors.
Perplexity Pro (default LLM) had 17%.
Microsoft Copilot Pro (LLM not specified) had 27%.
Google Gemini Standard (LLM not specified) had 34%.
Although these percentages are alarming, it isn’t totally hopeless when you start looking at more specific use cases that have larger data sets and the model has been trained on.
Introducing the Hallucination Leaderboard
There is now a standardized way to measure AI hallucination rates. This test was conducted by doing summarization of documents. This was not checking on any specific aspects of the model or trying to get it to talk about news stories.
The Vectara Hallucination Leaderboard tests how well models can summarize documents without making things up. Here's what it’s showing:
Google's latest models are leading the pack:
- Gemini-2.0-Flash-001: 0.7% hallucination rate
- Gemini-2.0-Pro-Exp: 0.8%
- Gemini-2.0-Flash-Lite-Preview: 1.2%
OpenAI's models show strong performance:
- GPT-4o: 1.5%
- GPT-4o-mini: 1.7%
- GPT-4-Turbo: 1.7%
- GPT-3.5-Turbo: 1.9%
What makes this leaderboard particularly valuable is that it measures something very specific. The models are given text to summarize and tested on whether they stick to the facts in that text.
What's particularly notable is how the models perform with different text lengths. The average summary length varies significantly:
- Amazon Nova-Pro-V1: 85.5 words
- GPT-4-Turbo: 86.2 words
- GPT-3.5-Turbo: 84.1 words
- Gemini models: around 60-65 words
This tells us something important about how these models work – the ones producing longer summaries aren't necessarily more accurate. In fact, some of the most accurate models are the most concise.
The latest models are achieving sub-1% hallucination rates, which would have seemed impossible just a year ago.
However, remember these are best-case scenario results – when using these models in the real world, especially for open-ended tasks, you'll likely see higher hallucination rates.
What This Means For Your Work
Here are my big takeaways from this research:
- Always check your work and make sure you are getting the right answers
- Don't assume newer models are automatically more reliable
- Build verification steps into your AI workflows
- Leverage domain expertise alongside AI tools
- Be especially careful with high-stakes tasks
Let me know your thoughts on this research. I'd love to hear about your experiences with AI hallucinations and how you handle verification.
Until next time,
Ryan Ries
Now time for this week’s image and prompt I used to generate it.
"Create an image of a muppet wearing a t-shirt that says "CDW." The muppet is staring into the desert and sees a beautiful mirage up ahead."
Sign up for Ryan's weekly newsletter to get early access to his content every Wednesday.
Author Spotlight:
Ryan Ries
Keep Up To Date With AWS News
Stay up to date with the latest AWS services, latest architecture, cloud-native solutions and more.
Related Blog Posts
Category:
Category:
Category: