LLM’s new optimization technique reduces memory costs by up to 75%
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more
Researchers at Tokyo-based startup Sakana AI have developed a new technique that allows language models to use memory more efficiently, helping enterprises reduce the cost of building applications on large language models (LLM) and other Transformer-based models.
The technique called “universal transformer memory”, uses special neural networks to optimize LLM to preserve bits of important information and discard redundant details from their context.
Transformer memory optimization
The answers to the Transformer models, the backbone of the LLM, depend on the content of their “context window” — that is, what they receive as input from users.
The context window can be considered the working memory of the model. Changing the context window’s contents can have a huge impact on model performance, resulting in a whole field of “prompt engineering.”
Support for current models very long context windows with hundreds of thousands or even millions of tokens (LLM numerical representations of the words, word parts, phrases, concepts, and numbers entered by users in their prompts).
This allows users to cram more information into their prompts. However, longer prompts can result in higher computational costs and slower performance. Optimizing prompts to remove unnecessary tokens while preserving important information can reduce costs and increase speed.
Current prompt optimization techniques are resource intensive or require users to manually test different configurations to reduce the size of their prompts.
Memory modules for neural attention
Universal Transformer memory optimizes prompts using Neural Attentional Memory Models (NAMMs), simple neural networks that decide whether to “remember” or “forget” any given token stored in the LLM’s memory.
“This new ability allows Transformers to discard useless or redundant details and focus on the most critical information, something we find crucial for tasks requiring reasoning in a long context,” the researchers wrote.

NAMMs are trained separately from LLMs and combined with a pre-trained model during inference, making them flexible and easy to implement. However, they need access to the model’s internal activations, which means they can only be applied to open source models.
Like other techniques developed by Sakana AI, NAMMs are trained evolutionary algorithms instead of gradient-based optimization methods. By iteratively mutating and selecting the most efficient models through trial and error, evolutionary algorithms optimize NAMM for efficiency and performance. This is especially important since NAMM is trying to achieve an indistinguishable goal: keep tokens or dump them.
NAMMs work on top of LLM’s attention layers, one of the key components of the Transformer architecture, which defines the relationships and importance of each token in the model’s context window. Based on the attention values, NAMM determines which tokens should be kept and which can be discarded from the LLM context window. This attention-based mechanism makes it possible to use a trained NAMM on different models without additional modifications. For example, NAMM trained only on text data can be applied to visual or multimodal models without additional training.

Universal memory in action
To test the universal transformer memory concept in action, the researchers trained NAMM on the open-source Meta Model Llama 3-8B. Their experiments show that with NAMM, Transformer-based models perform better on natural language problems and encoding very long sequences. Meanwhile, by discarding unnecessary tokens, NAMM allowed the LLM model to save up to 75% of its cache while running tasks.
“Within our benchmarks, NAMM provides clear improvements in the performance of the Llama 3-8B transformer,” the researchers wrote. “Furthermore, our memory systems give noticeable side benefits by reducing the context size of each layer, while never being explicitly optimized for memory efficiency.”

They tested the model on both the 70B version of the Llama, as well as Transformer models designed for other modalities and tasks, such as The lava (computer vision) and Decision Transformer (reinforcement learning).
“Even in these non-distribution settings, NAMMs retain their advantages by discarding tokens such as redundant video frames and suboptimal actions, allowing their new baseline models to focus on the most relevant information to improve performance,” the researchers wrote.
Task-dependent behavior
Another interesting finding is that NAMM automatically adjusts its behavior based on the task.
For example, for coding tasks, the model discards contiguous chunks of tokens that correspond to comments and whitespace that do not affect code execution.
On the other hand, in natural language tasks, the model rejects tokens that represent grammatical redundancy and do not affect the meaning of the sequence.
The researchers released code to create your own NAMMs. Techniques such as universal transformer memory can be very useful for enterprise applications that process millions of tokens and can benefit from speed increases and cost reductions. The reusability of a trained NAMM also makes it a versatile tool for use in different applications within an enterprise.
For the future, the researchers suggest more advanced techniques, such as using NAMM during LLM training to further expand their memory capabilities.
“This work has only begun to tap into the potential of our new class of memory models, which we expect to offer many new opportunities for the advancement of future generations of transformers,” the researchers wrote.