AI Fails Research-Level Math Test Designed To Stop Cheating, While Human Mathematicians Solve Every Problem

By Siddhi Vinayak Misra
2 hours Ago

AI Fails Research-Level Math Test Designed To Stop Cheating, While Human Mathematicians Solve Every Problem

Artificial intelligence has made remarkable progress in mathematics, from assisting researchers with complex proofs to solving problems that have challenged experts for decades. But a new benchmark suggests today’s AI systems still struggle when faced with genuinely novel mathematical research.

In a newly published study introducing a benchmark called First Proof, four leading AI systems were tested on 10 previously unpublished research-level mathematics problems. None achieved a perfect score, while every problem had already been solved by human mathematicians who created the test.

The findings highlight an important limitation of today’s large language models: they excel when patterns resemble information they’ve already encountered but remain less reliable when tackling entirely new mathematical discoveries.

TL;DR

Researchers created 10 original research-level math problems that had never been published.
Four AI systems attempted the problems without access to prior solutions.
The best-performing AI solved six out of ten problems.
Human mathematicians had previously solved all 10.
The benchmark was designed to test genuine mathematical reasoning rather than memorization.
Researchers say AI still has significant limitations as an autonomous research mathematician.

What is the First Proof benchmark?

First Proof is a new evaluation designed to measure whether artificial intelligence can solve genuinely original mathematics.

Traditional AI benchmarks often rely on published questions or datasets that models may have encountered during training.

To avoid this problem, researchers created an entirely new challenge.

Ten mathematicians from different mathematical specialties each contributed a problem they had personally solved in the past but had never published.

That meant the questions were absent from the following:

Research journals.
Online databases.
Books.
Public datasets.
AI training material.

The goal was simple: determine whether AI could reason through brand-new mathematics instead of recalling existing knowledge.

Why was this math test different?

One of the biggest challenges in evaluating AI is ensuring it cannot rely on memorized information.

Large language models are trained on enormous collections of publicly available text, including books, academic papers, and websites.

If a benchmark contains published material, an AI system may recognize familiar patterns rather than independently solving the problem.

The First Proof benchmark was specifically designed to eliminate that possibility.

Because none of the questions had ever appeared publicly, success depended entirely on reasoning ability.

This makes the benchmark a closer approximation of the challenges faced by professional mathematicians conducting original research.

Which AI models took part?

The competition focused on publicly available AI systems capable of autonomous mathematical reasoning.

Researchers excluded specialized experimental systems that are not publicly accessible, including Google’s unreleased Aletheia and Anthropic’s unreleased Claude Mythos.

Instead, four entries participated:

OpenAI’s ChatGPT 5.5 Pro.
A research system developed by the Swiss Federal Institute of Technology (ETH Zurich) using ChatGPT.
A University of California, Los Angeles (UCLA) system built around ChatGPT.
A Princeton University system using Gemini 3.1 Pro.

The university teams developed automated “harnesses” that repeatedly prompted, evaluated, and refined AI-generated solutions without human intervention during testing.

How did the AI models perform?

The results showed meaningful progress but also clear limitations.

The highest-performing system solved six of the ten research problems.

The remaining systems scored lower.

Final rankings were:

ETH Zurich’s ChatGPT-based harness.
UCLA’s ChatGPT-based harness.
OpenAI’s standalone ChatGPT 5.5 Pro.
Princeton University’s Gemini-based harness.

Meanwhile, every one of the 10 problems had already been solved by the expert mathematicians who originally created them.

That contrast demonstrates that experienced human researchers continue to outperform today’s AI on original mathematical discovery.

Consider adding a comparison chart showing each team’s score alongside the human benchmark of 10 out of 10.

Why couldn’t AI solve all the problems?

The results do not necessarily mean AI lacks mathematical ability.

Instead, they highlight the difference between solving familiar problems and producing genuinely original mathematical reasoning.

Large language models are exceptionally good at:

Recognizing patterns.
Applying known mathematical techniques.
Summarizing proofs.
Assisting with calculations.
Generating ideas.

Research-level mathematics often demands something different.

Mathematicians must:

Invent entirely new approaches.
Connect distant areas of mathematics.
Develop rigorous proofs from first principles.
Eliminate subtle logical errors.

Those creative leaps remain difficult for current AI systems.

Does this mean AI is bad at mathematics?

Not at all.

Recent AI systems have achieved impressive mathematical milestones.

They can already:

Solve many competition-level problems.
Assist researchers in verifying proofs.
Generate useful mathematical conjectures.
Explain advanced concepts.
Accelerate literature reviews.

Several AI models have even contributed to research projects by suggesting proof strategies or identifying overlooked connections.

However, the First Proof benchmark demonstrates that AI still struggles to function as an independent research mathematician.

Rather than replacing experts, today’s systems remain best suited as collaborative tools.

Why does this benchmark matter?

Reliable evaluation has become one of the biggest challenges in AI research.

As models improve, many traditional benchmarks become easier because solutions already exist online.

Fresh benchmarks such as First Proof provide researchers with a better understanding of how much genuine reasoning AI has developed.

The findings also help answer an increasingly important question:

Can AI independently generate new mathematical knowledge?

For now, the answer appears to be “not consistently.”

What does this mean for the future of AI research?

The researchers behind First Proof say the benchmark will continue evolving with additional unpublished problems.

Future editions could help track when AI systems become capable of consistently solving original research questions without relying on previously available information.

Until then, mathematicians remain essential for:

Creating new theories.
Designing novel proof techniques.
Validating AI-generated arguments.
Identifying subtle mistakes.
Expanding mathematical knowledge.

Rather than replacing researchers, AI currently appears most valuable as a sophisticated assistant that accelerates parts of the discovery process while leaving the deepest conceptual breakthroughs to human experts.

The bigger picture

Artificial intelligence continues to advance rapidly, but benchmarks like First Proof remind us that progress is rarely linear.

Today’s leading models can outperform humans on many standardized exams and routine mathematical tasks, yet they still struggle when confronted with problems that have never been seen before.

That distinction matters because genuine scientific progress depends not just on recalling existing knowledge but on creating entirely new ideas. For now, human mathematicians continue to hold the edge where originality matters most.

Categories: Technology
Tags: AI Maths

TL;DR

What is the First Proof benchmark?

Why was this math test different?

Which AI models took part?

How did the AI models perform?

Why couldn’t AI solve all the problems?

Does this mean AI is bad at mathematics?

Why does this benchmark matter?

What does this mean for the future of AI research?

The bigger picture

Related Content

Chip-Based Ultrafast Laser Breakthrough Could Shrink Laboratory Technology to the Size of a Match Head

OpenAI Says Chinese Groups Weaponised ChatGPT To Target Team Trump

Anthropic CEO Dario Amodei Warns AI Could Enable Bioweapons, Cyberattacks, and Mass Job Losses

Google Sparks AI Price War With Gemini Subscription Price Cut: Can OpenAI and Anthropic Follow?

Is the iPhone Acting Like Birth Control? The Surprising Theory Linking Smartphones to Falling Fertility Rates

OpenAI IPO Filing Signals a New Era for the AI Industry