O3 AI OPENAI model of Openai is a smaller standard than the company initially implies
Discrepancy between the first and third country results for OPENAI’s O3 AI is Raising questions about the company’s transparency and practices for model testing.
When Openai revealed O3 in DecemberThe company claims that the model can answer just over a quarter of FRONTIERMATH’s questions, a challenging set of mathematical problems. This assessment blew up the competition, the next best model managed to answer only about 2% of the problems of the front atrama correctly.
“Today all suggestions there have less than 2% (on Frontiermath),” Mark Chen, Chief Research Officer at Openai, said during liveS “We see (internally) as O3 in aggressive settings to calculate test time we can get over 25%.”
As it turns out, this figure was probably an upper limit achieved by a version of the O3 with more calculations behind it than the Openai model publicly launched last week.
The Epoch AI, the Research Institute behind the FrontierMath, released the results of its independent tests to compare O3 on Friday. The era found that the O3 scored about 10%, well below the highest output of Openai.
Openai launched the O3, their long-awaited model of reasoning, along with O4-Mini, a smaller and cheaper model that succeeds O3-Mini.
We have appreciated the new models of our package of mathematics and science indicators. Thread results! Pic.twitter.com/5GBTZKEY1B
– Epoch AI (@epochairesearch) April 18, 2025
This does not mean that Openai lied on its own. The results of the reference results that the company publishes in December show the result of a lower range corresponding to the observed era of the result. Epoch also noted that its testing setting is probably different from Openai and that it used an updated frontermath version for its estimates.
“The difference between our results and OPENAI may be due to the evaluation of Openai with a more powerful internal scaffold, using more time for test (calculation) or because these results were performed on a different subset of Frontiermat (180 Problems in FrontierMath-2024-11-26 against 290 problems in 290 FrontierMath-2025-02-28-Skill) “)” wrote Age.
According to the post of x From the ARC Award Foundation, an organization that tested a version before the O3, Public O3 model “is a different model (…), set to use a chat/product” confirming the EPOCH report.
“All the O3 Compute Tiers are smaller than the version we (compare),” writes Arc Prize. Generally speaking, larger computational levels can be expected to achieve better comparison results.
Of course, the fact that the public release of the O3 does not reach Openai’s promises is a little of controversial points, as the O3-Mini-Mini-Mini of the company and the O4-Mini models outperform O3 of Frontiermat and Openai plans to debut a more powerful O3, O3, O3, O3, O3, O3-P
However, this is another reminder that AI indicators are best not to accept at nominal value – especially when the source is a company with services for sale.
The comparative “contradictions” become a common occurrence in the AI ​​industry, as suppliers compete to capture titles and thinking with new models.
In January the era was criticized To wait to open funding from Openai until after the company announced the O3. Many scientists who have contributed to FrontierMath have not been informed of Openai’s involvement until it became public.
More recently the XAI of Elon Musk was Accused of the publication of misleading comparative charts for its latest AI model, Grok 3. This month Meta admitted that he had declared reference results for a version of a version of A model that was different from the one that the company provided to the developersS