The latest large language model from OpenAI isn’t yet in the wild, but we already have some ways to tell what it can and cannot do.
The “o3” release from OpenAI was unveiled on Dec. 20 in the form of a video infomercial, which means that most people outside the company have no idea what it really is capable of. (Outside safety testing parties are being given early access.)
Also: 15 ways AI saved me time at work in 2024
Although the video featured a lot of discussion of various benchmark achievements, the message from OpenAI co-founder and CEO Sam Altman on the video was very brief. His biggest statement, and vague at that, was that o3 “is an incredibly smart model.”
ARC-AGI put o3 to the test
OpenAI plans to release the “mini” version of o3 toward the end of January and the full version sometime after that, said Altman.
One outsider, however, has had the chance to put o3 to the test, in a sense.
The test, in this case, is called the “Abstraction and Reasoning Corpus for Artificial General Intelligence,” or ARC-AGI. It is a collection of “challenges for intelligent systems,” a new benchmark. The ARC-AGI is billed as “the only benchmark specifically designed to measure adaptability to novelty.” That means that it is meant to test the acquisition of new skills, not just the use of memorized knowledge.
Also: Why ethics is becoming AI’s biggest challenge
AGI, artificial general intelligence, is regarded by some in AI as the Holy Grail — the achievement of a level of machine intelligence that could equal or exceed human intelligence. The idea of ARC-AGI is to guide AI toward “more intelligent and more human-like artificial systems.”
The o3 model scored 76% accuracy on ARC-AGI in an evaluation formally coordinated by OpenAI and the author of ARC-AGI, François Chollet, a scientist in Google’s artificial intelligence unit.
A shift in AI capabilities
On the website of ARC-AGI, Chollet wrote this past week that the score of 76% is the first time AI has beaten a human’s score on the exam, as exemplified by the answers of human Mechanical Turk workers who took the test and who, on average, scored just above 75% correct.
Chollet wrote that the high score is “a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models.” He added, “All intuition about AI capabilities will need to get updated for o3.”
The achievement marks “a genuine breakthrough” and “a qualitative shift in AI capabilities,” declared Chollet. Chollet predicts that o3’s ability to “adapt to tasks it has never encountered before” means that “you should plan for these capabilities to become competitive with human work within a fairly short timeline.”
Chollet’s remarks are noteworthy because he has never been a cheerleader of AI. In 2019, when he created ARC-AGI, he told me in an interview we had for ZDNET that the steady stream of “bombastic press articles” from AI companies “misleadingly suggest that human-level AI is perhaps a few years away,” while he considered such hyperbole “an illusion.”
The ARC-AGI questions are easy for people to understand and fairly easy for people to solve. Each challenge shows three to five examples of the question and the right answer, and the test taker is then presented with a similar question and asked to supply the missing answer.
The questions are not text-based but instead consist of pictures. A grid of pixels with colored shapes is first shown, followed by a second version that has been changed in some way. The question is: What is the rule that changes the initial picture into the second picture?
In other words, the challenge doesn’t directly rely on natural language, the celebrated area of large language models. Instead, it tests abstract pattern formulation in the visual domain.
Try ARC-AGI for yourself
You can try out the ARC-AGI for yourself at Chollet’s challenge website. You answer the challenge by “drawing” in an empty grid, filling in each pixel with the right color to create the correct grid of colored pixels as the “answer.”
It’s fun, rather like playing Sudoku or Tetris. Chances are, even if you can’t verbally articulate what the rule is, you’ll figure out pretty quickly what boxes need to be colored in to produce the solution. The most time-consuming part is actually tapping on each pixel in the grid to assign its color.
Also: Why Google’s quantum breakthrough is ‘truly remarkable’ – and what happens next
A correct answer produces a confetti toss animation on the webpage and the message, “You’ve solved the ARC Prize Daily Puzzle. You are still more (generally) intelligent than AI.”
Note when o3 or any other model takes the test, it doesn’t directly act on pixels. Instead, the equivalent is fed to the machine as a matrix of rows and columns of numbers that must be transformed into a different matrix as the answer. Hence, AI models don’t “see” the test the same way a human does.
What’s still not clear
Despite o3’s achievement, it’s hard to make definitive statements about o3’s capabilities. Because OpenAI’s model is closed-source, it’s still not clear exactly how the model is solving the challenge.
Not being part of OpenAI, Chollet has to speculate as to how o3 is doing what it’s doing.
He conjectures the achievement is a result of OpenAI changing the “architecture” of o3 from that of its predecessors. An architecture in AI refers to the arrangement and relationship of the functional elements that give code its structure.
Also: If ChatGPT produces AI-generated code for your app, who does it really belong to?
Chollet speculates on the blog “at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte Carlo tree search.”
The term chain of thought refers to an increasingly popular approach in generative AI in which the AI model can detail the sequence of calculations it performs in pursuit of the final answer. AlphaZero is Google’s DeepMind unit’s famous AI program that beat humans at chess in 2016. A Monte Carlo Tree Search is a decades-old computer science approach.
In an email exchange, Chollet told me a bit more about his thinking. I asked how he arrived at that idea of a search over chains of thought. “Clearly when the model is ‘thinking’ for hours and generating millions of tokens in the process of solving a single puzzle, it must be doing some kind of search,” replied Chollet.
Chollet added:
It is completely obvious from the latency/cost characteristics of the model that it is doing something completely different from the GPT series. It’s not the same architecture, nor in fact anything remotely close. The defining factor of the new system is a huge amount of test-time search. Previously, 4 years of scaling up the same architecture (the GPT series) had yielded no progress on ARC, and now this system which clearly has a new architecture is creating a step function change in capabilities, so architecture is everything.
There are a number of caveats here. OpenAI didn’t disclose how much money was spent on one of its versions of o3 to solve ARC-AGI. That’s a significant omission because one criterion of ARC-AGI is the cost in real dollars of using GPU chips as a proxy for AI model “efficiency.”
Chollet told me in an email that the approach of o3 does not amount to a “brute force” approach, but, he quipped, “Of course, you could also define brute force as ‘throwing an inordinate amount of compute at a simple problem,’ in which case you could say it’s brute force.”
Also, Chollet notes that o3 was trained to take the ARC-AGI test using the competition’s training data set. That means it’s not yet clear how a clean version of o3, with no test prep, would approach the exam.
Also: OpenAI’s Sora AI video generator is here – how to try it
Chollet told me in an email, “It will be interesting to see what the base system scores with no ARC-related information, but in any case the fact that the system is fine-tuned for ARC via the training set does not invalidate its performance. That’s what the training set is for. Until now no one was able to achieve similar scores, even after training on millions of generated ARC tasks.”
o3 still fails on some easy tasks
Despite the uncertainty, one thing seems very clear: Those yearning for AGI will be disappointed. Chollet emphasizes that the ARC-AGI test is “a research tool” and that “Passing ARC-AGI does not equate to achieving AGI.”
“As a matter of fact, I don’t think o3 is AGI yet,” Chollet writes on the ARC-AGI blog. “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.”
To demonstrate we are still not at human-level intelligence, Chollet notes some of the simple problems in ARC-AGI that o3 can’t solve. One such problem involves simply moving a colored square by a given amount — a pattern that quickly becomes clear to a human.
Chollet plans to unveil a new version of ARC-AGI in January. He predicts it will drastically reduce o3’s results. “You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” he concludes.