Computers and Software

OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

OpenAI’s latest o3 model has achieved a breakthrough that has surprised the AI research community. o3 scored an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%.

While the achievement in ARC-AGI is impressive, it does not yet prove that the code to artificial general intelligence (AGI) has been cracked.

Abstract Reasoning Corpus

The ARC-AGI benchmark is based on the Abstract Reasoning Corpus, which tests an AI system’s ability to adapt to novel tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging measures of AI.

*Example of ARC puzzle (source: arcprize.org)*

ARC has been designed in a way that it can’t be cheated by training models on millions of examples in hopes of covering all possible combinations of puzzles.

The benchmark is composed of a public training set that contains 400 simple examples. The training set is complemented by a public evaluation set that contains 400 puzzles that are more challenging as a means to evaluate the generalizability of AI systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking the data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of computation participants can use to ensure that the puzzles are not solved through brute-force methods.

A breakthrough in solving novel tasks

o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.

In a blog post, François Chollet, the creator of ARC, described o3’s performance as “a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.”

It is important to note that using more compute on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about o3’s architecture, we can be confident that it is not orders of magnitude larger than its predecessors.

*Performance of different models on ARC-AGI (source: arcprize.org)*

“This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”

It is worth noting that o3’s performance on ARC-AGI comes at a steep cost. On the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while on the high-compute budget, the model uses around 172X more compute and billions of tokens per problem. However, as the costs of inference continue to decrease, we can expect these figures to become more reasonable.

A new paradigm in LLM reasoning?

The key to solving novel problems is what Chollet and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs for solving very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from figuring out puzzles that are beyond their training distribution.

Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists diverge. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the past few months.

Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 can actually be just the forward passes from one language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 was “just an LLM trained with RL. o3 is powered by further scaling up RL beyond o1.”

On the same day, Denny Zhou from Google DeepMind’s reasoning team called the combination of search and current reinforcement learning approaches a “dead end.”

“The most beautiful thing on LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on search (e.g. mcts) over the generation space, whether by a well-finetuned model or a carefully designed prompt,” he posted on X.

While the details of how o3 reasons might seem trivial in comparison to the breakthrough on ARC-AGI, it can very well define the next paradigm shift in training LLMs. There is currently a debate on whether the laws of scaling LLMs through training data and compute have hit a wall. Whether test-time scaling depends on better training data or different inference architectures can determine the next path forward.

Not AGI

The name ARC-AGI is misleading and some have equated it to solving AGI. However, Chollet stresses that “ARC-AGI is not an acid test for AGI.”

“Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don’t think o3 is AGI yet,” he writes. “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.”

Moreover, he notes that o3 cannot autonomously learn these skills and it relies on external verifiers during inference and human-labeled reasoning chains during training.

Other scientists have pointed to the flaws of OpenAI’s reported results. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver should not need much specific ‘training’, either on the domain itself or on each specific task,” writes scientist Melanie Mitchell.

To verify whether these models possess the kind of abstraction and reasoning the ARC benchmark was created to measure, Mitchell proposes “seeing if these systems can adapt to variants on specific tasks or to reasoning tasks using the same concepts, but in other domains than ARC.”

Chollet and his team are currently working on a new benchmark that is challenging for o3, potentially reducing its score to under 30% even at a high-compute budget. Meanwhile, humans would be able to solve 95% of the puzzles without any training.

“You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” Chollet writes.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.