OpenAI claims that its new model has reached human level in a test of “general intelligence”. What does this mean?
A new artificial intelligence (AI) model has just appeared. achieved results on a human level on a test designed to measure “general intelligence”.
On December 20, OpenAI’s o3 system scored 85% on ARC-AGI benchmarkwell above the previous best AI score of 55% and equal to the average human score. He also scored well on a very difficult math test.
The creation of Artificial General Intelligence, or AGI, is the stated goal of all major AI research labs. At first glance, it appears that OpenAI has at least taken a significant step toward this goal.
While skepticism remains, many AI researchers and developers believe that something has simply changed. For many, the prospect of AGI now seems more real, urgent, and closer than expected. Are they right?
Generalization and intelligence
To understand what the o3 score means, you need to understand what the ARC-AGI test is. In technical terms, it’s a test of an AI system’s “sampling efficiency” in adapting to something new—how many examples of a new situation the system needs to see in order to understand how it works.
An AI system like ChatGPT (GPT-4) is not very efficient for samples. It was “trained” on millions of examples of human text, constructing probabilistic “rules” about which combinations of words were most likely.
The result is quite good for general tasks. It is bad on unusual tasks because there is less data (fewer samples) for those tasks.
While AI systems cannot learn from a small number of examples and adapt with greater sample efficiency, they will only be used for highly repetitive tasks and those where random failure is tolerable.
The ability to accurately solve previously unknown or novel problems from limited samples of data is known as generalizability. It is widely considered a necessary, even essential, element of intelligence.
Grids and patterns
ARC-AGI benchmark tests efficient sample fitting using small grid-square problems like the one below. The AI ​​must understand the pattern that turns the grid on the left into the grid on the right.

ARC Award
Each question provides three learning examples. The AI ​​system must then understand the rules that “generalize” from the three examples to the fourth.
These are very similar to the IQ tests you sometimes remember from school.
Weak rules and adaptation
We don’t know exactly how OpenAI did it, but the results show that the o3 model is very adaptable. From just a few examples he finds rules that can be generalized.
To understand a pattern, we should not make any unnecessary assumptions or be more specific than we really need to be. c theoryif you can identify the “weakest” rules that do what you want, then you’ve maximized your ability to adapt to new situations.
What do we mean by weakest rules? The technical definition is complicated, but the weaker rules are usually the ones that can be described in simpler statements.
In the example above, a plain English expression of the rule might be something like: “Any shape with a convex line will move to the end of that line and ‘cover’ any other shapes it overlaps with.”
Looking for chains of thought?
While we still don’t know how OpenAI achieved this result, it seems unlikely that they deliberately optimized the o3 system to find weak rules. However, to succeed with ARC-AGI tasks, you need to find them.
We know that OpenAI started with a general-purpose version of the o3 model (which differs from most other models because it can spend more time “thinking” about difficult questions) and then trained it specifically for the ARC-AGI test.
French AI researcher Francois Cholet, who designed the benchmark, believes o3 searches through different “chains of thought” describing steps to solve the task. It will then select the “best” according to some loosely defined rule or “heuristic”.
This would be “not unlike” how Google’s AlphaGo system searches through various possible sequences of moves to beat the world champion at Go.
You can think of these chains of thought as programs that match the examples. Of course, if it’s like a Go-playing AI, then it needs a heuristic or loose rule to decide which program is best.
There may be thousands of different apparently equally valid programs generated. This heuristic can be “choose the weakest” or “choose the simplest”.
However, if it’s like AlphaGo, then they’ve just had the AI ​​create heuristics. This was the process for AlphaGo. Google trained a model to rate different sequences of moves as better or worse than others.
What we don’t know yet
The question then is, is this really closer to AGI? If this is how o3 works, then the base model may not be much better than previous models.
The concepts that the model learns from the language may be no more generalizable than before. Instead, we may simply be seeing a more generalizable “chain of thought” discovered through the additional steps of training heuristics specialized for this test. The proof, as always, will be in the pudding.
Almost everything about o3 remains unknown. OpenAI has limited disclosure to a few media presentations and early testing to a handful of AI safety researchers, labs and institutions.
Truly understanding o3’s potential will require extensive work, including assessments, understanding its capacity distribution, how often it fails and how often it succeeds.
When the o3 is finally released, we’ll have a much better idea of ​​whether it’s anywhere near as customizable as the average person.
If so, this could have a huge, revolutionary economic impact, ushering in a new era of self-improving accelerated intelligence. We will require new metrics for AGI itself and serious consideration of how it should be managed.
If not, then this will still be an impressive result. However, daily life will remain much the same.
Michael Timothy BennettPhD student, Faculty of Computer Science, Australian National University and Choose PerrierResearch Associate, Stanford Center for Responsible Quantum Technology, Stanford University
This article was republished by The conversation under a Creative Commons license. Read on original article.