The new technique helps LLMS to resume the lengths of the basket, optimize reasoning without exploding calculation costs
Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more
Reflection through consideration (COT) – The process by which the models break down the problems of manageable “thoughts” before deducting answers – has become an integral part of the largest generation of models in large languages ​​(LLMS).
However, the costs of conclusions about reasoning models can be quickly arranged as models generate unnecessary tokens on the basket. In a New paperCarnegie Melon University researchers offer LLM training technique that gives developers more control over the length of the crib.
Called the optimization of a controlled length policy (LCPO), the technique of the condition for the model to provide the right answers while maintaining its “thoughts” within the predetermined budget of the token. Experiments show that LCPO -trained models provide a smooth compromise between accuracy and costs and can surprisingly outperform larger models at equal lengths of reasoning. LCPO can help drastically reduce the cost of conclusion in corporate applications by saving thousands of tokens in each round of conversation with LLM.
Performance LLM leads to longer creates
Models for reasoning such as Openai O1 and Deepseek-R1 have been trained by reinforcement (RL) for use Scales of test time and generate COT tracks before you get an answer. Empirical evidence shows that when the models “think” longer, they tend to perform better for reasoning tasks.
For example, R1 was originally trained in pure RL without examples marked with humans. One of the insights was that with improving the performance of the model, he also learned to generate longer traces of COT.
While in general, the long COT chains lead to more accurate answers, they also create a computing obstacle when applying models for reasoning on scale. There is currently very low control over the calculating budget of test time and sequences can easily extend to tens of thousands of tokens without providing significant profits. There has been some effort to control the duration of reasoning chains, but they usually worsen the performance of the model.
Explained Optimization of Controlled Length Policy (LCPO)
The classic RL method trains LLMS only to achieve the right answer. LCPO changes this paradigm by introducing two training targets: 1) Getting the correct result and 2) Keep the COT chain limited within a certain length of the token. Therefore, if the model gives the right answer but generates too many tokens to the beehive, he will be punished and will be forced to come up with a reasoning chain that reaches the same answer, but with a smaller budget of the token.
“The trained LCPO models learn to satisfy the length limits while optimizing the work of reasoning instead of relying on hand -designed heuristics,” the researchers wrote.
They offer two flavors of LCPO: (1) LCPO exports that requires the generated reasoning to be exactly equal to the target length and (2) LCPO-MAX, which requires the output not longer than the length of the target.
To test the technique, the researchers refine a model of reasoning 1.5b-parameters (Qwen-DistileLled-R1-1.5B) of the two proposed LCPO schemes to create L1-MAX and L1-Output models. The training was based on mathematical problems with different and verifiable results. However, the evaluation included mathematical problems as well as tasks outside the distribution, such as the mass understanding of the language with a multitude of language (Mmlu) Technique and referent to the Q&A of higher education graduation (Gpqu).
Their findings show that L1 models can accurately balance the budget of markers and the work of reasoning, seamlessly interpolated between short, effective reasoning and longer, more accurate reasoning, by prompting the model with different length limits. The important thing is that in some tasks, L1 models can reproduce the performance of the original model for reasoning with a lower marker budget.

Compared to S1, the only other method that limits the length of COT – L1 models shows up to 150% profits from the productivity of various tokens budgets.
“This significant difference can be due to two key factors,” the researchers wrote. “(1) L1 intelligently adapts its crib to fit in to certain length limits without disturbing the process of reasoning, while S1 often cuts the middle of the examination; and (2) L1 is explicitly trained to generate high quality reflection chains of different lengths, effectively distilled models of reasoning from longer chains to shorter ones. “
The L1 also outperforms its colleague, who is not understood by 5% and GPT-4O by 2% compared to the length of the same generation. “As for the best of our knowledge, this is the first demonstration that the 1.5B model can outperform border models such as the GPT-4O, although it uses the same length of the generation,” the researchers write.
Interestingly, the model basket shows that he learns to adjust his reasoning process based on his budget. For example, with longer budgets, the model is more likely to generate tokens associated with self-reconstruction and verification (ie “but” and “wait”) and conclusion for a conclusion (“therefore” and “so”).

Beyond the improved length control in the standard setting of mathematics reflections, L1 models summarize surprisingly well to tasks outside the distribution, including GPQA and MMLU.
This new line of research on models that can adjust their budget for reasoning can have important applications for applications in the real world, giving businesses the opportunity to scale reasoning models without escape. This is a powerful alternative to simply deploy larger, more expensive models-it can be a decisive factor in turning AI more economically viable for high-volume applications, real applications.
Researchers have opened the open LCPO code and Weights for L1 modelsS