Google's new neural network LLM architecture separates memory components to control exploding capacity and compute costs

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more

A new neural network architecture developed by researchers at Google can solve one of the big challenges facing large language models (LLMs): expanding their memory during inference without increasing memory and computational costs. Named Titansarchitecture allows models to find and store during inference small bits of information that are important in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle both short-term and long-term memory tasks efficiently. According to the researchers, LLMs that use neural long-term memory can scale to millions of tokens and outperform both classic LLMs and alternatives like Mamba, while having far fewer parameters.

Layers of attention and linear models

The classic transformer architecture used in LLM uses self-attention mechanism to calculate relations between tokens. It is an efficient technique that can learn complex and detailed patterns in sequences of tokens. However, as sequence length increases, the computational and memory costs of calculating and storing attention increase quadratically.

Newer offerings include alternative architectures which have linear complexity and can scale without memory explosion and computational cost. However, Google researchers claim that linear models do not show competitive performance compared to classical transformers because they compress their contextual data and tend to miss important details.

An ideal architecture, they suggest, should have different memory components that can be coordinated to use existing knowledge, remember new facts, and learn abstractions from their context.

“We argue that in an effective learning paradigm similar to (the) human brain, there are distinct but interconnected modules, each responsible for a component that is critical to the learning process,” the researchers wrote.

Neural long-term memory

“Memory is a confederation of systems—for example, short-term, working, and long-term memory—each of which performs a different function with different neural structures, and each of which can operate independently,” the researchers wrote.

To fill the gap in current language models, the researchers propose a “neural long-term memory” module that can learn new information during inference without the inefficiencies of the full-attention mechanism. Instead of storing information during learning, the neural memory module learns a feature that can remember new facts during inference and dynamically adapt the memorization process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from.

To decide which bits of information are worth storing, the neural memory module uses the concept of “surprise”. The more a sequence of tokens differs from the type of information stored in the model’s weights and existing memory, the more surprising it is and therefore worth remembering. This allows the module to use its limited memory efficiently and only store bits of data that add useful information to what the model already knows.

To process very long data sequences, the neural memory module has an adaptive forgetting mechanism that allows it to remove information that is no longer needed, helping to manage limited memory capacity.

The memory module may be complementary to the attentional mechanism of current transformer models, which the researchers describe as “short-term memory modules paying attention to the current contextual window size. On the other hand, our neural memory with the ability to continuously learn from data and store it in its weights can play the role of long-term memory.

Titanium architecture

An example of a Titan architecture (source: arXiv)

The researchers describe the Titans as a family of models that incorporate existing transformer blocks with neural memory modules. The model has three key components: the “core” module, which acts as short-term memory and uses the classical attention mechanism to pay attention to the current segment of input tokens that the model is processing; a “long-term memory” module that uses neural memory architecture to store information outside of the current context; and a “permanent memory” module, learnable parameters remain fixed after learning and store time-independent knowledge.

Researchers suggest different ways to connect the three components. But overall, the main advantage of this architecture is the ability for the attention and memory modules to complement each other. For example, attention layers can use the historical and current context to determine which parts of the current context window should be stored in long-term memory. Meanwhile, long-term memory provides historical knowledge that is not present in the current context of attention.

The researchers conducted small-scale tests on Titan models ranging from 170 million to 760 million parameters on a diverse set of tasks, including language modeling and long-sequence language tasks. They compare the performance of Titans with various transformer models, linear models such as Mamba and hybrid models like Samba.

Titans (red line) outperforms other models, including GPT-4, on long sequence tasks at both multi-shot and fine-tuned settings (source: arXiv)

Titans showed strong performance in language modeling compared to other models and outperformed both transformers and linear models of similar sizes.

The difference in performance is particularly pronounced for long sequence tasks such as “a needle in a haystack”, where the model has to extract bits of information from a very long sequence, and BABILlongwhere the model must reason about facts spread over very long documents. In fact, in these tasks, Titan outperforms models with an order of magnitude more parameters, including GPT-4 and GPT-4o-miniand a Llama-3 model enhanced with extraction augmented generation (RAG).

Furthermore, the researchers were able to expand the context window of Titans to 2 million tokens while keeping memory costs at a modest level.

The models still need to be tested at larger sizes, but the paper’s results show that the researchers have not yet reached the ceiling of the Titans’ potential.

What does this mean for enterprise applications?

It’s with Google at the forefront of long-context modelswe can expect this technique to find its place in private and open models such as Gemini and Gemma.

With LLMs supporting longer context windows, there is increasing potential to create applications where you insert new knowledge into your prompt rather than using techniques like RAG. The development cycle for developing and iterating prompt-based applications is much faster than complex RAG pipelines. Meanwhile, architectures like Titans can help reduce inference costs for very long sequences, making it possible for companies to deploy LLM applications for more use cases.

Google plans to release PyTorch and JAX code for training and evaluating Titans models.

Daily insight into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical implementation, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.