These researchers used NPR Sunday puzzle questions to compare AI’s “reasoning” models

Rate this post


Every Sunday, the host of NPR Will Shortz, the New York Times crossword, Guru, as long as he tests thousands of listeners in a longtime segment called The Sunday Puzzle. While it is written to be solved without also Many predicted that Brainteasers are usually challenging even for qualified competitors.

Therefore, some experts believe that they are a promising way to test the boundaries of AI’s problems.

In a New studyA team of researchers, originally from Wellesley College, Oberlin College, Texas University in Austin, Northeast University and starting Cursor, created a base with AI base using riddles of the episodes of the puzzle Sunday. The team says that the test reveals surprising insights, such as these so-called Openai reasoning models, among others, “gives up” and provides answers that know they are not correct.

“We wanted to develop a benchmark with problems that people can only understand with common knowledge,” said Arjun Guha, a computer science student in the Northeast and one of the study’s co -authors, “TechCrunch told.

Currently, the AI ​​industry is in a little comparative characteristic. Most of the tests that are commonly used to evaluate AI models examine skills, such as competence in mathematics and science, a doctoral level that is not relevant to the average consumer. Meanwhile many indicators – even indicators released relatively recently – They quickly approach the point of saturation.

The advantages of playing a public radio station, such as Sunday puzzle, is that it does not test for esoteric knowledge and challenges are expressed such that models cannot draw from the “memory of a company” to resolve them, Guha explained.

“I think what makes these problems difficult is that it is really difficult to make meaningful progress in a problem until you solve it – then everything clicks together at once,” Guha said. “This requires a combination of insight and removal process.”

No indicator is perfect, of course. The Sunday puzzle is oriented to the United States and only English. And since the quizzes are publicly available, it is possible that models trained on them and be able to “cheat” in a sense, although Guha says he has not seen evidence of it.

“The new questions are placed every week and we can expect the most unseen questions to be really unprecedented,” he added. “We intend to keep the indicator fresh and to track how the model’s performance changes over time.”

On the indicator of researchers, which consists of about 600 Sunday puzzles, models of reasoning such as the O1 and R1 of Deepseek far beyond the rest. Reflecting patterns, thoroughly checking the facts before giving results that helps them Avoid some of the trails which usually discard AI models. The compromise is that the models of reasoning take a little longer to get to solutions-usually seconds to minutes longer.

At least one model, R1 on Deepseek, makes solutions that he knows is wrong for some of the questions on the Sunday puzzle. R1 will declare verbatim “I give up”, followed by an incorrect answer, chosen at first glance randomly – a behavior that one can surely connect with.

The models make other bizarre choices, giving the wrong answer to withdraw only, to try to annoy more and to fail again. They also go “thinking” forever and give meaningless answers for answers, or immediately come to the right answer, but then continue to consider alternative answers for no obvious reason.

“In severe problems, R1 literally says it becomes” disappointed, “Guha said. “It was funny to see how a model imitates what one can say. It remains to be seen how “powerless” in reasoning can affect the quality of the results of the model. “

NPR indicator
R1 becomes “disappointed” on a question in Sunday Puzzle Challenge.Image loans:Guha et al.

The current best executive model is O1 with a score of 59%, followed by a recently released O3-Mini Set to high “reasoning effort” (47%). (R1 scored 35%.) As the next step, researchers plan to expand their tests to additional models of reasoning that hope to help identify areas where these models can be improved.

NPR indicator
The results of the models that the team tests on their indicator.Image loans:Guha et al.

“You do not need a doctoral degree to be good at reasoning, so it should be possible to design reference indicators that do not require a doctoral level knowledge,” Guha said. “A basic indicator with wider access allows for a broader set of researchers to understand and analyze the results, which in turn can lead to better solutions in the future. In addition, as the latest models are growing more and more in settings that affect everyone, we believe that everyone should be able to intuick what these models are-they are not capable. “

 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *