OpenAI’s O3 shows remarkable progress in ARC-AGI, sparking debate about AI reasoning

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


The latest OpenAI o3 model achieved a breakthrough that surprised the AI ​​research community. o3 achieved an unprecedented 75.7% on the super-tough ARC-AGI benchmark under standard computing conditions, with the high-computing version reaching 87.5%.

Although the achievement in ARC-AGI is impressive, it still does not prove that the code for artificial general intelligence (AGI) is cracked.

Abstract Reasoning Corpus

The ARC-AGI benchmark is based on Abstract Reasoning Corpuswhich tests the ability of an AI system to adapt to new tasks and demonstrate fluid intelligence. ARC consists of a set of visual puzzles that require an understanding of basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with very little demonstration, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI.

Example of an ARC puzzle (source: arcprize.org)

ARC is designed in such a way that it cannot be cheated by training models on millions of examples in the hope of covering all possible puzzle combinations.

The benchmark consists of a public training set that contains 400 simple examples. The training set is supplemented by a public assessment set that contains 400 puzzles that are more challenging as a means of assessing the ability to generalize AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each that are not shared with the public. They are used to evaluate candidate AI systems without the risk of the data leaking to the public and contaminating future systems with prior knowledge. In addition, the competition places limits on the amount of calculations that participants can use to ensure that the puzzles are not solved by brute force methods.

Breakthrough in solving new tasks

o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by a researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.

In a blog postFrancois Cholet, the creator of ARC, described the o3’s performance as “a surprising and important step-up in AI capabilities, which shows a new ability to adapt to tasks, previously unseen in models of the GPT family.”

It is important to note that using more calculations of previous generations of models cannot achieve these results. For context, it took 4 years for models to go from 0% with GPT-3 in 2020. to just 5% with GPT-4o in early 2024. While we don’t know much about the o3’s architecture, we can be sure that it’s not an order of magnitude bigger than its predecessors.

Implementation of various ARC-AGI models (source: arcprize.org)

“This is not just an incremental improvement, but a true breakthrough, marking a qualitative change in AI capabilities compared to the previous limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, being able to approach human-level performance in the ARC-AGI domain.”

It’s worth noting that ARC-AGI’s o3 performance comes at a high price. On a low-PC configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on a high-PC budget, the model uses about 172 times more computation and billions of tokens per problem. However, as inference costs continue to decline, we can expect these figures to become more reasonable.

A New Paradigm in LLM Reasoning?

The key to solving new problems is what Scholet and other scientists call “program synthesis.” A thinking system must be able to develop small programs to solve very specific problems, then combine those programs to tackle more complex problems. Classical language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack composition, which prevents them from coming up with puzzles that are outside their learning allocation.

Unfortunately, there is very little information about how o3 works under the hood, and this is where scientific opinion diverges. Chollet speculates that o3 uses a type of programmatic synthesis that it uses chain of thought (CoT) reasoning and search engine combined with a reward model that evaluates and refines decisions while the model generates tokens. This is similar to what open source reasoning models have been researching for the past few months.

Other scientists like Nathan Lambert of the Allen Institute for AI suggest that “o1 and o3 may actually be just the forward passes of a language model”. On the day of the o3 announcement, Nat McAleese, a researcher at OpenAI, published on X that o1 is “just an LLM trained with RL. o3 is fed by a further increase in RL beyond o1.’

That same day, Danny Zhou of Google’s DeepMind reasoning team called the combination of search approaches and ongoing reinforcement learning a “dead end.”

“The most beautiful thing about LLM reasoning is that the thought process is generated in an autoregressive fashion, rather than relying on a search (eg mcts) in the generation space, whether through a well-tuned model or a carefully designed prompt,” he published on X.

While the details of how o3 causes may seem trivial compared to the ARC-AGI breakthrough, they may very well define the next paradigm shift in LLM education. There is currently a debate as to whether the LLM scaling laws through training data and computation have hit a wall. Whether scaling test time depends on better training data or different inference architectures may determine the next path forward.

Not AGI

The name ARC-AGI is misleading and some equate it with solving AGI. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI.”

“Going ARC-AGI does not equate to achieving AGI and, in fact, I don’t think o3 is AGI yet,” he wrote. “o3 still fails at some very simple tasks, indicating fundamental differences with human intelligence.”

Moreover, he notes that o3 cannot autonomously learn these skills and relies on external examiners during inference and human-labeled logic circuits during training.

Other scientists pointed out the shortcomings of OpenAI’s reported results. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver should not need very specific ‘training’, either for the domain itself or for any particular task,” the scientist wrote Melanie Mitchell.

To test whether these models possess the kind of abstraction and reasoning that the ARC benchmark was designed for, Mitchell suggests “seeing whether these systems can be adapted to variants of specific tasks or to reasoning tasks using the same concepts but in other areas other than ARC. “

Scholet and his team are currently working on a new benchmark that challenges o3, potentially reducing its score to below 30% even with a high-end PC budget. Meanwhile, humans could solve 95% of the puzzles without any training.

“You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but difficult for AI becomes simply impossible,” Chollet wrote.


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *