How to Scaling Test Time Unlocks Hidden Reflection Authority in Small Language Models (and allows them to outperform LLMS)
Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more
Very small language models (SLM) can exceed the leading models in large languages (LLM) in the tasks of reasoning according to a New study from Shanghai AI Laboratory. The authors show that with the right tools and techniques for scaling test time, SLM with 1 billion parameters can exceed 405b LLM on sophisticated mathematical indicators.
The ability to deploy SLM in complex tasks for reasoning can be very useful as businesses are looking for new ways to use these new models in different environments and applications.
Explained scaling of test time
The scaling of test time (TTS) is the process of providing LLMS additional computing cycles during conclusion to improve their performance in various tasks. Leading models for reasoning, such as OPENAI O1 and Deepseek-R1Use ‘Internal TTS’, which means they are trained to “think” slowly by generating a long string of consideration (COT) Tokens.
An alternative approach is “external TTS”, in which the performance of the model is improved by (as the name implies) external assistance. The outer TTS is suitable for rearranging models for logging for tasks without refining them further. The external tuning of TTS usually consists of a “model of a policy”, which is the main LLM that generates the answer, and a process reward (PRM), which appreciates the answers to the political model. These two components are connected together by a sampling or search method.
The easiest setting is the “best of N”, where the policy model generates multiple answers and PRM selects one or more best answers to draw up the final answer. More sophisticated external TTS methods use search. In “Search for Rays,” the model breaks the answer to multiple steps.
For each step, it breaks down multiple answers and executed them through PRM. He then selects one or more suitable candidates and generates the next step of the answer. And in a “diverse search for a Tree of Verification” (DVT), the model generates several branches of answers to create a more diverse set of candidate -responses before synthesizing them in the final response.

What is the right scaling strategy?
Choosing the right TTS strategy depends on many factors. The authors of the study conducted a systematic study of how different models of politics and PRM affect the effectiveness of TTS methods.
Their discoveries show that efficiency depends largely on the patterns of politics and PRM. For example, for small demand-based policy-based policy models, excel the best of N. However, for large policy models, the best of N is more efficient as models have better opportunities for reasoning and They do not need a reward model to check every step of their reasoning.
Their discoveries also show that the correct TTS strategy depends on the difficulty of the problem. For example, for small models of policy with less than 7B parameters, the best N works better for easy problems, while searching for rays works better for more difficult problems. For policy models that have between 7B and 32B parameters, the varied demand for trees performs well for easy and medium problems, and the search for rays works best for serious problems. But for large policy models (72B parameters and others), the best N is the optimal method for all levels of difficulty.
Why can small models beat large models

Based on these findings, developers can create Extremely optimal TTS strategies which take into account the policy model, PRM and the problem difficulties to use the calculation budget to solve reasoning problems best.
For example, researchers found that a Llama-3.2-3b The TTS optimum strategy is superior to the Llama-3.1-405B of Math-500 and Aime24, two complex mathematical indicators. This indicates that SLM can exceed a model that is 135x larger when using the optimum TTS strategy.
In other experiments, they found that the Qwen2.5 model with 500 million parameters could outperform GPT-4O With the correct computational optimal TTS strategy. Using the same strategy, a 1.5B distilled version of Deepseek-R1 is superior to O1-Preview and O1-Mini of Math-500 and Aime24.
When accounting for both the training and the calculations of the conclusions, the findings show that with computing optimal scaling strategies, SLMS may surpass larger models with 100-1000x fewer flops.
The results of the researchers show that the computing optimal TTS enhances the possibilities of thinking of language models. However, as the policy model increases, the improvement of TTS is gradually decreasing.
“This suggests that the effectiveness of TTS is directly related to the ability to reason for the policy model,” the researchers wrote. “In particular, for models with weak reasoning abilities, calculating the scaling of test time leads to a significant improvement, while for models with strong reasoning capabilities the profit is limited.”
The study confirms that SLMS can perform better than larger models when applying optimal testing methods. While this study focuses on mathematics indicators, researchers plan to extend their study to other tasks for reasoning such as encoding and chemistry.