Who’s the Most Delusional? The AI Hallucination Leaderboard Is Here

Who’s the Most Delusional? The AI Hallucination Leaderboard Is Here | Mission

6:09

Today I want to dive into some research studies that came across my desk about AI hallucinations.

At Mission my team and I are always tracking the latest AI research to understand both the capabilities and limitations and how we can apply these learnings to our work with customers.

So, I’ll keep it short and sweet and sum up the highlights I read in the research.

Also, don’t forget – next week I’ll be live in a Gen AI Ask Us Anything session with Mission CTO Jonathan LaCour. Register here, and come with some questions about hallucinations! (Or anything else AI-related).

The Mathematical Proof of Inevitable Hallucination

Let's start with something that might surprise you – researchers just mathematically proved that hallucination in AI is inevitable.

The paper shows that even if you train an LLM perfectly, it physically cannot learn all possible computable functions.

Think of it like trying to fit an infinite amount of information into a finite space – it's mathematically impossible.

This explains why even the best models sometimes generate plausible but incorrect information.

The Current State of AI Accuracy

Let’s take a look at another recent study. This one examined how 319 knowledge workers interact with AI tools in their daily work.

While we know people need to verify AI outputs, this study revealed how that verification actually happens.

Here's what they discovered about two key factors – confidence in AI and confidence in yourself:

When workers had high confidence in AI, they actually did less critical checking of the outputs
When workers had high confidence in their own abilities, they engaged more deeply with the AI outputs and caught more issues
Most importantly, the effort of checking AI outputs often exceeded the effort of just doing the task directly

Think about what that means for a moment.

Many of us adopt AI tools assuming they'll make our work easier. And while they can boost productivity, there's a hidden cost: the mental effort of verifying outputs can sometimes outweigh the time saved.

The study found that workers handle AI outputs in three ways:

Information tasks: Focus shifts from gathering info to verifying it
Problem-solving tasks: Focus shifts from solving to integrating AI suggestions
Analysis tasks: Focus shifts from doing the work to overseeing the AI's work

You might be thinking, “Ryan, this isn’t groundbreaking news to me.” And I’d say you're right.

What makes this important isn't just the findings themselves, but what they tell us about building effective AI workflows.

We need to account for these verification costs when deciding where and how to implement AI tools.

Recently, the BBC ranked the major AI search tools’ accuracy with news content. Obviously, news content is newer material that has not been trained by the LLM and doesn’t exist in its training data. Even the best performing models still had challenges with factual accuracy. Here’s each model's version and performance, ranked best to worst:

ChatGPT Enterprise (GPT-4) had 15% significant errors.

Perplexity Pro (default LLM) had 17%.

Microsoft Copilot Pro (LLM not specified) had 27%.

Google Gemini Standard (LLM not specified) had 34%.

Although these percentages are alarming, it isn’t totally hopeless when you start looking at more specific use cases that have larger data sets and the model has been trained on.

Introducing the Hallucination Leaderboard

There is now a standardized way to measure AI hallucination rates. This test was conducted by doing summarization of documents. This was not checking on any specific aspects of the model or trying to get it to talk about news stories.

The Vectara Hallucination Leaderboard tests how well models can summarize documents without making things up. Here's what it’s showing:

Google's latest models are leading the pack:

Gemini-2.0-Flash-001: 0.7% hallucination rate
Gemini-2.0-Pro-Exp: 0.8%
Gemini-2.0-Flash-Lite-Preview: 1.2%

OpenAI's models show strong performance:

GPT-4o: 1.5%
GPT-4o-mini: 1.7%
GPT-4-Turbo: 1.7%
GPT-3.5-Turbo: 1.9%

What makes this leaderboard particularly valuable is that it measures something very specific. The models are given text to summarize and tested on whether they stick to the facts in that text.

What's particularly notable is how the models perform with different text lengths. The average summary length varies significantly:

Amazon Nova-Pro-V1: 85.5 words
GPT-4-Turbo: 86.2 words
GPT-3.5-Turbo: 84.1 words
Gemini models: around 60-65 words

This tells us something important about how these models work – the ones producing longer summaries aren't necessarily more accurate. In fact, some of the most accurate models are the most concise.

The latest models are achieving sub-1% hallucination rates, which would have seemed impossible just a year ago.

However, remember these are best-case scenario results – when using these models in the real world, especially for open-ended tasks, you'll likely see higher hallucination rates.