AI Fails Research-Level Math Test Designed To Stop Cheating, While Human Mathematicians Solve Every Problem

AI Fails Research-Level Math Test Designed To Stop Cheating, While Human Mathematicians Solve Every Problem

Artificial intelligence has made remarkable progress in mathematics, from assisting researchers with complex proofs to solving problems that have challenged experts for decades. But a new benchmark suggests today’s AI systems still struggle when faced with genuinely novel mathematical research.

In a newly published study introducing a benchmark called First Proof, four leading AI systems were tested on 10 previously unpublished research-level mathematics problems. None achieved a perfect score, while every problem had already been solved by human mathematicians who created the test.

The findings highlight an important limitation of today’s large language models: they excel when patterns resemble information they’ve already encountered but remain less reliable when tackling entirely new mathematical discoveries.

a

TL;DR

What is the First Proof benchmark?

First Proof is a new evaluation designed to measure whether artificial intelligence can solve genuinely original mathematics.

Traditional AI benchmarks often rely on published questions or datasets that models may have encountered during training.

To avoid this problem, researchers created an entirely new challenge.

a

Ten mathematicians from different mathematical specialties each contributed a problem they had personally solved in the past but had never published.

That meant the questions were absent from the following:

The goal was simple: determine whether AI could reason through brand-new mathematics instead of recalling existing knowledge.

a

Why was this math test different?

One of the biggest challenges in evaluating AI is ensuring it cannot rely on memorized information.

Large language models are trained on enormous collections of publicly available text, including books, academic papers, and websites.

If a benchmark contains published material, an AI system may recognize familiar patterns rather than independently solving the problem.

a

The First Proof benchmark was specifically designed to eliminate that possibility.

Because none of the questions had ever appeared publicly, success depended entirely on reasoning ability.

This makes the benchmark a closer approximation of the challenges faced by professional mathematicians conducting original research.

a

Which AI models took part?

The competition focused on publicly available AI systems capable of autonomous mathematical reasoning.

Researchers excluded specialized experimental systems that are not publicly accessible, including Google’s unreleased Aletheia and Anthropic’s unreleased Claude Mythos.

Instead, four entries participated:

a

The university teams developed automated “harnesses” that repeatedly prompted, evaluated, and refined AI-generated solutions without human intervention during testing.

How did the AI models perform?

The results showed meaningful progress but also clear limitations.

The highest-performing system solved six of the ten research problems.

a

The remaining systems scored lower.

Final rankings were:

  1. ETH Zurich’s ChatGPT-based harness.
  2. UCLA’s ChatGPT-based harness.
  3. OpenAI’s standalone ChatGPT 5.5 Pro.
  4. Princeton University’s Gemini-based harness.

Meanwhile, every one of the 10 problems had already been solved by the expert mathematicians who originally created them.

a

That contrast demonstrates that experienced human researchers continue to outperform today’s AI on original mathematical discovery.

Consider adding a comparison chart showing each team’s score alongside the human benchmark of 10 out of 10.

Why couldn’t AI solve all the problems?

The results do not necessarily mean AI lacks mathematical ability.

a

Instead, they highlight the difference between solving familiar problems and producing genuinely original mathematical reasoning.

Large language models are exceptionally good at:

Research-level mathematics often demands something different.

a

Mathematicians must:

Those creative leaps remain difficult for current AI systems.

Does this mean AI is bad at mathematics?

Not at all.

a

Recent AI systems have achieved impressive mathematical milestones.

They can already:

Several AI models have even contributed to research projects by suggesting proof strategies or identifying overlooked connections.

a

However, the First Proof benchmark demonstrates that AI still struggles to function as an independent research mathematician.

Rather than replacing experts, today’s systems remain best suited as collaborative tools.

Why does this benchmark matter?

Reliable evaluation has become one of the biggest challenges in AI research.

a

As models improve, many traditional benchmarks become easier because solutions already exist online.

Fresh benchmarks such as First Proof provide researchers with a better understanding of how much genuine reasoning AI has developed.

The findings also help answer an increasingly important question:

a

Can AI independently generate new mathematical knowledge?

For now, the answer appears to be “not consistently.”

What does this mean for the future of AI research?

The researchers behind First Proof say the benchmark will continue evolving with additional unpublished problems.

a

Future editions could help track when AI systems become capable of consistently solving original research questions without relying on previously available information.

Until then, mathematicians remain essential for:

Rather than replacing researchers, AI currently appears most valuable as a sophisticated assistant that accelerates parts of the discovery process while leaving the deepest conceptual breakthroughs to human experts.

a

The bigger picture

Artificial intelligence continues to advance rapidly, but benchmarks like First Proof remind us that progress is rarely linear.

Today’s leading models can outperform humans on many standardized exams and routine mathematical tasks, yet they still struggle when confronted with problems that have never been seen before.

That distinction matters because genuine scientific progress depends not just on recalling existing knowledge but on creating entirely new ideas. For now, human mathematicians continue to hold the edge where originality matters most.

a
Exit mobile version