Stanford AI Index 2026: Models Now Clear 50% on Humanity's Last Exam

Stanford University’s Institute for Human-Centered AI published its 2026 AI Index Report this week – 423 pages covering technical performance, investment, workforce effects, and public sentiment across nine chapters. The headline finding is one of the sharpest benchmark jumps in the report’s history.

Humanity’s Last Exam

Humanity’s Last Exam is a benchmark built from questions submitted by nearly 1,000 subject-matter experts – primarily professors, researchers, and doctoral holders – designed to represent the hardest problems in their fields. In 2025, the top-scoring model, OpenAI’s o1, answered 8.8% of questions correctly. The 2026 AI Index reports the current overall figure at 38.3%, and notes that the best models available as of April 2026 – including Anthropic’s Claude Opus 4.6 and Google’s Gemini 3.1 Pro – are clearing 50%. Evaluations designed to remain relevant for years are saturating in months.

Other benchmark movements

On SWE-bench Verified, which measures a model’s ability to resolve real GitHub software engineering issues, scores rose from 60% to near 100% in a single year. On GPQA, which tests graduate-level science questions requiring multi-step reasoning, models hit 93% accuracy against a human expert baseline of 81.2%. Google’s Gemini Deep Think won a gold medal at the International Mathematical Olympiad.

The jagged frontier

Stanford’s report documents what it calls a jagged frontier in AI capability. The same models producing historic scores on PhD-level benchmarks read analog clocks correctly just 50.1% of the time. Robotic manipulation systems achieve 89.4% success in simulation but only 12% in real household tasks. Hallucination rates remain a concern: GPT-4o accuracy dropped from 98.2% to 64.4% when presented with false statements a user appears to believe rather than a third party.

Adoption and investment

Generative AI reached 53% global adoption within three years of ChatGPT’s launch – faster than either the personal computer or the internet reached comparable levels. AI adoption in the technology sector stands at 88%. Four in five university students now use generative AI. US private AI investment reached $285.9 billion in 2025, 23 times more than China’s.

The US-China gap

The performance gap between the leading US and Chinese AI models has narrowed to 2.7%. As of March 2026, Anthropic’s leading model holds the top Arena Elo score of 1,503. xAI sits at 1,495, Google at 1,494, OpenAI at 1,481. ByteDance’s Dola-Seed-2.0-Preview stands at 1,464. US and Chinese models have traded the top position multiple times since early 2025. China leads in research publication volume, citations, and industrial robotics; the US leads in number of top-tier models, high-impact patents, and private investment.

Safety and public trust

Documented AI incidents rose to 362 in 2025, up from 233 in 2024. Nearly all major frontier labs report on capability benchmarks; reporting on safety and responsible AI benchmarks remains inconsistent. Stanford’s authors note that improving one responsible AI dimension can degrade another, making tradeoffs harder to navigate than they appear.

John Moore

John Moore is the editor of fastai.news, an independent publication covering developments in artificial intelligence.

He founded fastai.news in April 2026 to apply the same rigorous, neutral reporting standards he established at Powerboat News – his international publication – to the fast-moving world of AI.

With no filler and no opinion, fastai.news reports what is happening in AI models, research, business and tools, and leaves readers to draw their own conclusions.

John is based in Buckinghamshire, England.

Stanford AI Index 2026: Models Now Clear 50% on Humanity’s Last Exam

Humanity’s Last Exam

Other benchmark movements

The jagged frontier

Adoption and investment

The US-China gap

Safety and public trust

Related

Humanity’s Last Exam

Other benchmark movements

The jagged frontier

Adoption and investment

The US-China gap

Safety and public trust

Related

Discover more from fastai.news