reviews

The Smartest AI Model Lies the Most. Here Is the Math.

GPT-5.5 hits 60 on the Artificial Analysis Index but hallucinates 86 percent. Opus 4.7 hits 36. Here is the math on why smartest is not best.

Editorial illustration evoking AI hallucination — the smartest model produces the most fabrications

Everyone’s saying GPT-5.5 just won the AI race.

I read the benchmark and then I read the hallucination report. They tell different stories.

GPT-5.5 hit 60 on the Artificial Analysis Intelligence Index, top of the leaderboard. The same week, its hallucination rate clocked at 86 percent. Claude Opus 4.7 sits at 36 percent. Gemini 3.1 Pro at roughly 50.

The smartest model lies the most.

If you ship work to a paying client, the hallucination rate is the only column that matters. Eighty-six percent means most of what comes out of GPT-5.5 has a fabricated specific in it somewhere. A wrong stat, or a confident misattribution that sounds right and is not. You verify every output by hand, and the “smart” model just made you slower than the model you used last month.

This piece is about why “smartest” became the wrong question.

The benchmark and the report

The Artificial Analysis Intelligence Index is a composite score across reasoning evals, math, code, and general knowledge tests. GPT-5.5 launched at 60. Opus 4.7 sits at 58. Gemini 3.1 Pro at 56. DeepSeek V4-Flash in the mid-50s.

The hallucination rate is a separate measurement on the same eval suite. The model gets asked questions where the right answer requires a specific verifiable value, like a date or a number or a citation or a code identifier. The score is the percentage of answers where the model produced a confident specific that turned out to be invented.

On that column, GPT-5.5 hits 86 percent. Opus 4.7 sits at 36. Gemini 3.1 Pro at roughly 50.

ModelIntelligence IndexHallucination rate
GPT-5.56086%
Claude Opus 4.75836%
Gemini 3.1 Pro56~50%
DeepSeek V4-Flashmid-50snot yet reported

Both numbers are real. Both are public. The launch-day discourse only covered one of them.

What 86 percent looks like in your inbox

Here is what an 86 percent rate produces in the wild, drawn from one Reddit thread the day after launch.

User asks GPT-5.5 for the latest stat on US small-business AI adoption. Model returns “63 percent of US small businesses now use AI tools weekly, per the 2026 NSBA Annual Survey.” The number sounds right. The survey exists. The actual figure in that survey is 41 percent. The “63 percent” was generated.

Same session, user asks for a Python library to handle a niche text-cleaning task. Model recommends a package called pandas-textanalyzer. The package does not exist on PyPI. The model fabricated a name that sounds like five real packages mashed together, and the user only catches it when pip install fails three minutes later.

This is what 86 percent looks like. Each individual answer reads correct. The texture is fine. The specifics are fiction. You only catch it if you verify, and verifying every output is what AI was supposed to save you from.

Smartest is the wrong question

Benchmarks like the Artificial Analysis Index measure ceiling reasoning on synthetic tasks. They tell you what the model can do when it tries hardest, on a problem the eval designer chose, with no real-world stakes.

Hallucination rate measures floor reliability on production tasks. It tells you what the model does when you point it at a question the eval designer did not anticipate, where the model has to decide whether to admit ignorance or invent.

For anyone shipping work to a paying client, floor matters more than ceiling. A model that can write at PhD level eight times out of ten and fabricate a citation the other two has more downside than a model that writes at master’s level ten times out of ten and never invents.

The launch-day reviewers under-covered the floor.

Which model for which job

Two columns, not three. The honesty load on the task tells you which side you are on.

Honesty load high. Research with citations, or code that runs against external services where a wrong specific lands in someone’s production environment. Use Claude Opus 4.7. Append the verify-or-decline pattern below. Cross-check the output against a second model before shipping anything that gets your name on it.

Honesty load low or self-verifiable. Brainstorming, creative drafting, image-aware editing, structural feedback on a piece you wrote yourself. GPT-5.5 is fine here. The user catches fabrication on read-through because the user already knows the territory. Speed and creative range win when the user is the verifier.

For Google Search grounded work like recent events and live web fact-checks, Gemini 3.1 Pro gets the nod. The grounding cuts the hallucination rate further on time-sensitive queries.

The prompt that kills 80 percent of fabrications

Append this to any prompt asking the model for a specific verifiable value. It works across all three frontier models. Drops fabrication rate by 80 to 90 percent in Opus 4.7 sessions per developer reports, and by similar amounts in GPT-5.5 sessions per the launch-week tests.

If you give me a specific identifier (commit hash, file path, API endpoint,
URL, version number, date, or quote), confirm in the same response that it
is a real, verifiable value you are certain about. If you are not certain,
say "I do not have access to verify this" instead of producing the value.

Two things this does. First, it forces the model to either commit to verifiability or decline. Second, when the model declines, you know to look the value up yourself instead of trusting the output blind. The model becomes a reliable reasoning partner that hands the verification step back to you rather than papering over it with a plausible-sounding invention.

What the launch-day discourse missed

Reviewers covered the Index score and called it. The dominant frame was “GPT-5.5 wins the AI race.”

The frame stands on raw IQ. Once you measure practical reliability, it falls apart. For the developer shipping production code, the consultant assembling a deck a client will reference, the staffer running research a journalist will quote, the freelancer producing copy a marketing team will publish, the column that matters is the one nobody covered.

The framing problem is not OpenAI’s. The framing problem is the discourse picking the metric that sounds most impressive and ignoring the metric that determines whether the output is shippable.

This is why the pillar piece on AI failures names hallucination rate as one of three documented model regressions in 2026. The other two are the Opus 4.7 confab spike and the Google AI Overview reliability problem. None of those are user error, all three are documented this month, and the user-side mitigation is the same pattern in every case.

The “smart” model just made you slower

The benchmark says GPT-5.5 won. The hallucination column says you have to verify every output by hand. Verifying every output is the work the smart model was supposed to save you from. Smart that you cannot ship is not smart.

The 5-day welcome course covers verify-or-decline in detail on Day 4, plus the model decision card and the four other patterns that fix most “AI is bad” complaints. Subscribe here and Day 1 lands in your inbox immediately. Free, no PDF spam.

FAQ

Why does GPT-5.5 hallucinate more than Claude Opus 4.7?

The two models optimize different things. GPT-5.5 was tuned to maximize Index score, which rewards confident reasoning across broad evals. Confidence and fabrication track together when the model is rewarded for sounding sure. Opus 4.7 was tuned with stronger refusal-when-uncertain training, which trades some Index points for higher honesty. The 50-point gap in hallucination rate is the visible cost of that tradeoff.

What is the Artificial Analysis Intelligence Index?

The Artificial Analysis Intelligence Index is a composite score that aggregates a model’s performance across reasoning, math, code, general knowledge, and instruction-following evals. It treats all of those as roughly equal weight. The score does not separately measure hallucination rate, which is reported alongside it but not folded into the Index number. A model can score high on the Index and produce unreliable output for production work, which is exactly the GPT-5.5 case.

Will GPT-5.5’s hallucination rate go down with future updates?

Probably yes, partly. OpenAI patches dot-releases of GPT models to address user-reported failures, and hallucination rate is a known target. The Opus 4.7 confab spike documented in April 2026 is already getting community-shared mitigations, and the same pattern will apply to GPT-5.5. Expect the rate to drop into the 60-percent range over the next two months, still well above Opus 4.7’s 36 percent.

How do I cross-check between Claude and GPT?

Run the same prompt in both models. Read both outputs side by side. If Opus 4.7 says one thing and GPT-5.5 says the opposite, you know at least one of them is wrong, and the disagreement tells you where to look. If both produce the same wrong specific, the question itself is probably the issue and you need a different source. The cross-check costs you 30 seconds and prevents most fabrication errors that would otherwise ship into your work. The full pattern is in the Opus 4.7 prompting guide.