Microsoft’s new rStar-Math technique leverages small models to outperform OpenAI’s o1-preview on math problems

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


Microsoft is doubling down on the potential of Small Language Models (SLM) with the unveiling of rStar-Matha new reasoning technique that can be applied to small models to increase their performance on mathematical problems using reasoning techniques—performance similar to, and in some cases exceeding, that of OpenAI’s o1-preview model.

While still in the research phase – as stated in a paper published on the preview site arXiv.org and credited to eight authors at Microsoft, Peking University, and Tsinghua University in China—the technique was applied to several different smaller open-source models, including Microsoft’s Phi-3 mini, Alibaba’s Qwen-1.5B (model with 1, 5 billion parameters) and Qwen-7B (7 billion parameter model). It showed improved performance on all of them, even surpassing the previously state-of-the-art OpenAI model in MATHEMATICS (word problem solving) third party benchmark of 12,500 questions covering various branches like geometry and algebra and all difficulty levels.

After all, according to a post to Hugging Facethe researchers plan to make their code and data available on Github at https://github.com/microsoft/rStarthough one of the paper’s authors, Li Lyna Zhang, wrote in the comments on Hugging Face’s post that the team is “still undergoing an internal review process for an open source release.” As such, “the repository remains private for now. Please stay tuned!”

Community members expressed enthusiasm, calling the innovations “impressive” and praising the combination of Monte Carlo Tree Search (MCTS) with step-by-step reasoning. One commenter emphasized the simplicity and utility of using Q-values ​​to estimate steps, while others speculated on future applications in geometric proofs and symbolic reasoning.

This news follows closely on the heels of the open source of Microsoft’s Phi-4 model, a smaller AI system with 14 billion parameters that is now available on Hugging Face under the MIT Permissive License.

While the Phi-4 release expanded access to high-performance small models, rStar-Math demonstrated a specialized approach: using smaller AI systems to achieve state-of-the-art results in mathematical reasoning.

rStar-Math works by using several different models and components to help a target small model “self-evolve”

The key to rStar-Math is that it uses Monte Carlo Tree Search (MCTS), a method that mimics human “deep thinking” by iteratively refining step-by-step solutions to mathematical problems.

The researchers used MCTS because it “breaks down complex mathematical problems into simpler one-step generation tasks, reducing the difficulty” for smaller models.

However, they did not apply MCTS alone, as other researchers have done. Instead, in a stroke of brilliance, they also ask the model they trained to always output its “chain of thought” reasoning steps as natural language descriptions and Python code.

They mandated that the model include the natural language responses as Python code comments, and only those outputs using Python would be used to train the model.

The researchers also trained a “policy model” to generate mathematical reasoning steps and a process preference model (PPM) to select the most promising steps to solve the problems, and improved them in four rounds of “self-evolution”, with each model improving the other.

For their raw data, the researchers said they used “747,000 math word problems from publicly available sources,” along with their solutions, but generated new steps to solve them with the two models described above.

Record breaking results

After four rounds of self-development, rStar-Math has achieved significant milestones:

• On MATHEMATICAL exponentthe accuracy of the Qwen2.5-Math-7B model jumped from 58.8% to 90.0%, outperforming OpenAI o1-preview.

• On American Invitational Mathematics Exam (AIME)he solved 53.3% of the problems, placing in the top 20% of high school competitors.

These results highlight the power of SLM in processing complex mathematical reasoning traditionally dominated by larger systems.

Is smaller better?

In recent years, innovation in AI has largely been driven by the expansion of language models, with increasing parameters seen as a way to improve performance. Still, the high costs associated with these massive models, from computing resources to power consumption, have raised questions about scalability.

Microsoft offers an alternative path by focusing on efficiency. The launch of rStar-Math further underscores this commitment by demonstrating how SLMs can rival – and in some cases exceed – the capabilities of their larger counterparts.

The twin releases of Microsoft’s Phi-4 and the rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry’s largest systems.

What’s more, by outperforming larger competitors in key metrics, these models challenge the idea that bigger is always better. They open doors for medium-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of massive models.


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *