Super Mario AI might not be what you think it is because what we mean by that phrase in this article today is how a research organization is using the retro Super Mario game as a benchmark for AI!
Yes, you read that right. The classic game Super Mario is being used by a research organization to benchmark an AI’s (Artificial Intelligence) “intelligence.” AI benchmarking is basically the process of comparing and evaluating different AI systems or models to see how each of them will perform under a cluster of predefined metrics.
That sounds pretty interesting, right? We’d like you to join us as we discuss the subject below.
Super Mario AI Benchmark News Carries Interesting Results
On Friday (February 28, 2025), the University of California San Diego’s Hao AI Lab integrated artificial intelligence into live Super Mario Bros. games. Claude 3.7 from Anthropic had the best results, and Claude 3.5 came in second. Both OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro had trouble, with Gemini 1.5 Pro performing a bit better compared to OpenAI’s GPT-4o.
The AI was supplied in-game screenshots and simple instructions, such as “If an obstacle or enemy is near, move/jump left to dodge,” using GamingAgent, which Hao created internally. After that, the AI produced Python code inputs to control Mario.
Nevertheless, according to Hao AI Lab, the game made each model “learn” how to organize intricate moves and create gameplay strategies, as if they were choosing Civilization 7‘s best military leaders. It’s interesting to note that, even though reasoning models, such as OpenAI’s o1, “think” through problems step-by-step to find solutions, outperformed “non-reasoning” models on most benchmarks.
Why Does the AI Mario Gameplay Look Like the AI Is Struggling?
According to the researchers, one of the primary reasons why models struggle in real-time games such as this is because they typically take seconds to make decisions on what to do. When playing Super Mario Bros., timing is crucial. One second can make the distinction between a safe jump and a deadly plunge.
In short, one of the main results that this Super Mario AI benchmarking has shown is that different types of AI models do better or worse on particular tasks. The other result that this benchmarking presented is the fact that we can indeed train AI to play video games. Maybe in the future, we will have better AI in our video games because they’ll be able to think like us, recognize patterns like us, and play like us.
For decades, AI has been benchmarked using games. However, other experts have questioned whether it makes sense to link AI’s gaming abilities to advancements in technology. Games, in contrast to the real world, are typically abstract and straightforward, and therefore, offer an essentially limitless quantity of data for AI training.
According to research scientist and OpenAI founding member Andrej Karpathy, the current gaudy gaming benchmarks, like the Super Mario AI benchmark, are a sign of an “evaluation crisis.” Here’s what he wrote in his post on X (formerly Twitter):
My reaction is that there is an evaluation crisis. I don't really know what metrics to look at right now.
— Andrej Karpathy (@karpathy) March 2, 2025
MMLU was a good and useful for a few years but that's long over.
SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.…
What Makes AI Benchmark Important?
As we know, the technique of comparing AI models and systems to particular tasks using predetermined parameters and datasets is known as AI benchmarking. This check makes sure a product does what it says it will. Businesses that steer clear of comparisons or benchmarks harm the market, other AI suppliers, and—above all—their clients.
Customers can obtain trustworthy information and make well-informed selections about what to buy by using benchmarking services. Verifiable evidence that the food we eat, medications we take, vehicles we drive, and technology that is reshaping our lives is what we seek.
Many legal systems acknowledge the significance of AI benchmarks and their influence on consumer perception and decision-making. Since product comparisons give consumers vital information to help them make informed decisions, the US Federal Trade Commission (FTC) promotes honest and non-deceptive comparisons.
Consumers must demand proof of quality and efficacy in order for benchmarking to be successful, and businesses must be willing to benchmark and continuously improve. That’s one of the aims of the Super Mario AI benchmarking. If they can use these results to develop better AIs, then the gaming world would absolutely be in an awesome time.
The Super Mario AI Presents a Promising Future for AI in Gaming
Hao AI Lab has done an amazing job with their Super Mario AI benchmarking tests. Although the future of AI gaming is still unpredictable, as Andrej Kaparthy said, it’s still exciting to see what the future holds.
Do you think these benchmarking tests can greatly improve AI in gaming? Let us know what you think about it in the comments below.
Be sure to read our other news articles to keep up with what’s hot and what’s not in the gaming world. Stay tuned and catch the gaming current with GameEels!