Stanford AI Index 2026: Models Now Clear 50% on Humanity’s Last Exam

Glowing teal bar chart rising steeply on a dark navy background representing rapid AI benchmark score improvements

Stanford University’s Institute for Human-Centered AI published its 2026 AI Index Report this week – 423 pages covering technical performance, investment, workforce effects, and public sentiment across nine chapters. The headline finding is one of the sharpest benchmark jumps in the report’s history.

Humanity’s Last Exam

Humanity’s Last Exam is a benchmark built from questions submitted by nearly 1,000 subject-matter experts – primarily professors, researchers, and doctoral holders – designed to represent the hardest problems in their fields. In 2025, the top-scoring model, OpenAI’s o1, answered 8.8% of questions correctly. The 2026 AI Index reports the current overall figure at 38.3%, and notes that the best models available as of April 2026 – including Anthropic’s Claude Opus 4.6 and Google’s Gemini 3.1 Pro – are clearing 50%. Evaluations designed to remain relevant for years are saturating in months.

Other benchmark movements

On SWE-bench Verified, which measures a model’s ability to resolve real GitHub software engineering issues, scores rose from 60% to near 100% in a single year. On GPQA, which tests graduate-level science questions requiring multi-step reasoning, models hit 93% accuracy against a human expert baseline of 81.2%. Google’s Gemini Deep Think won a gold medal at the International Mathematical Olympiad.

The jagged frontier

Stanford’s report documents what it calls a jagged frontier in AI capability. The same models producing historic scores on PhD-level benchmarks read analog clocks correctly just 50.1% of the time. Robotic manipulation systems achieve 89.4% success in simulation but only 12% in real household tasks. Hallucination rates remain a concern: GPT-4o accuracy dropped from 98.2% to 64.4% when presented with false statements a user appears to believe rather than a third party.

Adoption and investment

Generative AI reached 53% global adoption within three years of ChatGPT’s launch – faster than either the personal computer or the internet reached comparable levels. AI adoption in the technology sector stands at 88%. Four in five university students now use generative AI. US private AI investment reached $285.9 billion in 2025, 23 times more than China’s.

The US-China gap

The performance gap between the leading US and Chinese AI models has narrowed to 2.7%. As of March 2026, Anthropic’s leading model holds the top Arena Elo score of 1,503. xAI sits at 1,495, Google at 1,494, OpenAI at 1,481. ByteDance’s Dola-Seed-2.0-Preview stands at 1,464. US and Chinese models have traded the top position multiple times since early 2025. China leads in research publication volume, citations, and industrial robotics; the US leads in number of top-tier models, high-impact patents, and private investment.

Safety and public trust

Documented AI incidents rose to 362 in 2025, up from 233 in 2024. Nearly all major frontier labs report on capability benchmarks; reporting on safety and responsible AI benchmarks remains inconsistent. Stanford’s authors note that improving one responsible AI dimension can degrade another, making tradeoffs harder to navigate than they appear.

Discover more from fastai.news

Subscribe now to keep reading and get access to the full archive.

Continue reading