AI comparative analysis debates have reached Pokémon
Even Pokémon is not safe from disputes to compare AI.
Last week, a Post He passed viral, claiming that Google’s most twin twins exceeded the flagship model of Claude of Anthropic in the original Pokémon video game trilogy. Gemini has been reported to have reached the Lavendar city in the developer’s Twitch stream; Claude was Stuck on the mountain Luna By the end of February.
Twins literally overtake Claude ATM in Pokémon after reaching the town of Lavender
119 live views only BTW, incredibly undervalued flow pic.twitter.com/8avsovai4x
– you (@you21e8) April 10, 2025
But what the publication failed to mention is that twins had an advantage: a minimum.
AS Reddit users He pointed out the developer, who supports the Gemini Stream, has built a personalized minip that helps the model identify “tiles” in the game like Cuttable Trees. This reduces the need for twins to analyze screenshots before making gameplay decisions.
Now, Pokémon is a semi-serious indicator of AI at best-Malcina would claim that this is a very informative test for the model’s capabilities. But that it An instructive example of how different realizations per indicator can affect the results.
Eg an anthropa reported Two results for its recent anthropic 3.7 sonnet model of the SWE-Tala indicator, which is designed to evaluate the capacity to encode the model. Claude 3.7 Sonnet achieved 62.3% accuracy of SWE-Pea, but 70.3% with a “personalized scaffold”, which the anthropist evolved.
Just recently, Meta Fine A version of one of its more new models, Llama 4 Maverick to perform well on a particular indicator, LM Arena. Thehe Vanilla version From the assessments of the model significantly larger with the same assessment.
Given that AI indicators – Pokémon included – are imperfect measures For starters, personalized and non -standard conversions threaten to blur the waters even more. That is, it seems that it is not likely that it will be easier to compare the models as they are released.