Frontier AI at Expert-Level Mathematical Reasoning
#1The empirical performance of frontier AI on mathematical benchmarks has collapsed the assumption that mathematical reasoning is a distinctively human capability. AlphaProof (DeepMind, 2024) solved problems at IMO gold-medal level. GPT-4o scores above the 80th percentile on the MATH benchmark (competition mathematics) and passes the Putnam exam at a level competitive with top undergraduates. FrontierMath, a benchmark of expert-level mathematical problems designed to be AI-resistant, was solved at 25%+ by frontier models within months of release — a result the benchmark designers did not anticipate. This is not extrapolation; it is documented, reproducible performance.