People Are Using Super Mario Bros. to Test AI Performance

ByLivevartha2025-03-062 Mins read244 Views

Hao AI Lab, a research organization at the University of California San Diego, is pushing the boundaries of artificial intelligence by testing its capabilities in a live gaming environment. The lab has thrown AI into the world of Super Mario Bros., a classic video game known for its challenging gameplay and rapid decision-making requirements. Utilizing an in-house developed framework called GamingAgent, Hao AI Lab aims to evaluate AI’s ability to manage real-time tasks, with Mario as its test subject.

The experiment involves running Super Mario Bros. in an emulator, integrating it with GamingAgent to provide the AI with control over Mario. The AI generates inputs using Python code to navigate the virtual world. In this high-stakes game, a mere second can mean the difference between successfully clearing a jump or plummeting to a game-ending fall.

GamingAgent feeds the AI simple yet critical instructions such as, “If an obstacle or enemy is near, move/jump left to dodge,” alongside in-game screenshots. These directives force each AI model to “learn” complex maneuvers and develop effective gameplay strategies. The challenge lies in the game’s demand for precise timing—a hallmark feature of Super Mario Bros.

“If an obstacle or enemy is near, move/jump left to dodge” – Hao AI Lab

Real-Time Decision-Making Challenges

Despite its resemblance to the iconic 1985 release, the version used for these tests is slightly modified. Researchers highlight that real-time decision-making remains a hurdle for reasoning models, which typically take seconds to decide on actions. This delay poses a significant challenge in fast-paced environments like Super Mario Bros.

Games have long served as benchmarks for AI evaluation. However, some researchers argue that Super Mario Bros. presents an even tougher challenge due to its intricate and time-sensitive nature. Andrej Karpathy, a research scientist and founding member at OpenAI, has pointed out the current difficulties in evaluating AI performance effectively.

“I don’t really know what [AI] metrics to look at right now,” – Andrej Karpathy

“TLDR my reaction is I don’t really know how good these models are right now.” – Andrej Karpathy

In recent benchmarks conducted by Hao AI Lab, Anthropic’s Claude 3.7 emerged as the top performer, closely followed by Claude 3.5. Meanwhile, Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o faced challenges in controlling Mario effectively. This performance disparity underscores the complexity involved in real-time gaming scenarios and highlights the evolving nature of AI development.

Author’s Opinion

While AI has made significant strides in many areas, real-time decision-making in fast-paced environments like video games remains a major challenge. The varying performance of different AI models in this experiment highlights the complexity of real-world applications for artificial intelligence, and it suggests that we are still a long way from achieving true human-like gameplay performance. As technology continues to evolve, however, these types of benchmarks will be key in pushing the boundaries of what AI can achieve.