Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
OpenAI’s latest o3 model has made a breakthrough that astounds the AI research community. o3 scored an unprecedented 75.7% on the ultra-difficult ARC-AGI benchmark under standard computing conditions, reaching 87.5% on the high-computing version.
Although ARC-AGI’s achievements are impressive, it has yet to be proven that the code for artificial general intelligence (AGI) has been cracked.
abstract reasoning corpus
The ARC-AGI benchmark is abstract reasoning corpustesting the ability of AI systems to adapt to new tasks and demonstrate fluid intelligence. ARC consists of a series of visual puzzles that require an understanding of basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with very little demonstration, current AI systems struggle with it. ARC has long been considered one of the most difficult countermeasures for AI.
ARC is designed not to be fooled by training a model based on millions of examples in hopes of covering all possible puzzle combinations.
This benchmark consists of a public training set containing 400 simple examples. This training set is complemented by a public evaluation set containing 400 more difficult puzzles as a means of evaluating the generality of the AI system. The ARC-AGI challenge includes private and semi-private test sets of 100 puzzles each, which are not available to the public. These are used to evaluate candidate AI systems without risking data leaking to the public or contaminating future systems with prior knowledge. Additionally, the contest limits the amount of calculations participants can use to prevent puzzles from being solved using brute force methods.
Breakthroughs to solve new tasks
o1-preview and o1 scored up to 32% on ARC-AGI. Another method developed by researchers Jeremy Berman Using a hybrid approach that combines Claude 3.5 Sonnet with a genetic algorithm and code interpreter, we achieved our highest pre-o3 score of 53%.
in blog postFrancois Cholet, creator of ARC, calls o3’s performance “a surprising and significant step-function increase in AI capabilities, demonstrating novel task adaptation capabilities not previously seen in the GPT family of models.” “There is,” he said.
It is important to note that these results cannot be achieved by using more compute on previous generation models. For context, it took four years for the model to progress from 0% in GPT-3 in 2020 to just 5% in GPT-4o in early 2024. I don’t know much about o3’s architecture, but I’m sure it’s advanced. It’s not an order of magnitude larger than its predecessor.
“This is not just an incremental improvement, but a true breakthrough, representing a qualitative change in AI capabilities compared to the previous limitations of LLM,” Chollet wrote. “o3 is a system that can adapt to tasks never encountered before, perhaps approaching human-level performance in the ARC-AGI domain.”
Note that o3 performance on ARC-AGI comes at a significant cost. In a low-compute configuration, each puzzle costs the model $17-20 and 33 million tokens to solve, but with a high-compute budget, the model costs approximately 172x more compute and billions per problem. use the token. However, we can expect these numbers to become more reasonable as the cost of inference continues to decrease.
LLM A new paradigm for reasoning?
The key to solving new problems is what Chollet and other scientists call “program synthesis.” A thinking system must be able to develop small programs to solve very specific problems, and then combine these programs to tackle more complex problems. Classical language models absorb a lot of knowledge and contain rich internal programs. However, they lack compositional power and are unable to solve puzzles beyond their training distribution.
Unfortunately, there is little information about how o3 works internally, and scientists are divided. Chollet speculates that o3 uses a type of program synthesis that uses chain-of-thought (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. I’m doing it. This is similar to what open source inference models have been exploring for the past few months.
Other scientists like nathan lambert Researchers at the Allen Institute for AI suggest that “o1 and o3 may actually be just forward paths from one language model.” On the day o3 was announced, OpenAI researcher Nat McAleese said: Posted in X o1 was “just an LLM trained in RL. o3 is enhanced by further scaling up RL beyond o1.”
On the same day, Denny Zhou from Google DeepMind’s inference team said that the combination of search and current reinforcement learning is approaching a “dead end.”
“The most beautiful thing about LLM inference is that, with well-tuned models and carefully designed prompts, the thought process generates data in an autoregressive manner, rather than relying on searches across the generative space (e.g. mct). It is to be done.” Posted in X.
The details of why o3 may seem trivial compared to the breakthrough of ARC-AGI, but they can clearly define the next paradigm shift in LLM training. There is currently a debate about whether the LLM’s scaling laws with training data and compute have hit a wall. Your next path depends on whether your test time scaling relies on better training data or a different inference architecture.
Not AGI
The name ARC-AGI is misleading and some people equate it with the AGI solution. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not the same as achieving AGI, and in fact, I don’t think o3 is AGI yet,” he wrote. “o3 still fails at some very simple tasks, showing fundamental differences with human intelligence.”
Furthermore, he points out that o3 cannot learn these skills autonomously, relying on external verifiers during inference and human-labeled inference chains during training. .
Other scientists have pointed out flaws in OpenAI’s reported results. For example, the model was fine-tuned based on the ARC training set to obtain state-of-the-art results. “Solvers should require less specific ‘training’, neither on the domain itself nor on each specific task,” the scientists wrote. melanie mitchell.
To validate whether these models have the kind of abstractions and reasoning that the ARC benchmark was created to measure, Mitchell asked, “Can these systems adapt to specific task variants? , or to see if the same concepts can be adapted to reasoning tasks in domains other than ARC. ”
Chollet and his team are currently working on new benchmarks that are difficult for o3, and can drop scores below 30% even on high computing budgets. A human, on the other hand, would be able to solve 95% of the puzzles without any training.
“You can see AGI coming into existence when the task of creating a task that is easy for a normal human but difficult for an AI becomes completely impossible,” Chollet wrote.