Deepseek success shows why motivation is key to II innovation

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more


January 2025 shook the landscape of AIS At first glance, irresistible Openai and the powerful American technology giants were shocked by what we can certainly call insufficient in the field of large language models (LLMS). Deepseek, a Chinese company that is not in anyone radar, suddenly caused Openai. Not that the Depepeek-R1 was better than the best models of American giants; He was a little behind in terms of indicators, but suddenly made everyone think about the efficiency of hardware and energy consumption.

Given the lack of the best high-end hardware, it seems that Deepseek is motivated to innovation in the field of efficiency, which was less concerned with larger players. Openai claims to have evidence that suggests Deepseek They may have used their training model, but we have no specific evidence to support it. So, whether true or Openai, it’s just trying to calm its investors is a topic of debate. However, Depepeek publishes their work and people have checked that the results are reproducible at least on a much smaller scale.

But how could he Deepseek Get such costs while US companies can’t? The short answer is simple: they had more motivation. The long answer requires a little more technical explanation.

Deepseek uses KV-Cache optimization

An important price for saving GPU memory costs was to optimize the cache of a key value used in each layer of attention in LLM.

LLMS are made up of transformer blocks, each containing a layer of attention, followed by a regular vanilla feeding network. The power supply network conceptually models arbitrary relationships, but in practice it is difficult to always define models in data. The attention layer solves this problem for language modeling.

The model processes texts using tokens, but for simplicity we will call them words. In LLM, each word receives a vector in a high dimension (say, a thousand sizes). Conceptually, each dimension is a concept, such as being hot or cold, being green, soft, noun. The vector representation of the word is its meaning and values ​​according to each dimension.

However, our language allows other words to change the meaning of each word. For example, apple matters. But we can have a green apple as a modified version. A more comprehensive example of modification would be that an apple in the context of the iPhone differs from an apple in the context of a meadow. How do we let our system change the word vector based on another word? This is where attention comes.

The attention model assigns two other vectors to each word: key and request. The request is the qualities of the meaning of the word that can be modified, and the key is the type of modifications that it can give in other words. For example, the word “green” can provide information about color and greenness. So, the key to the word “green” will have a high value of the measurement “green-. On the other hand, the word” apple “can be green or not, so the vector of Apple’s requests will also have a high value for the measure of greenness. If we take a point product of the green key with Apple’s request, the product should be relatively large compared to the table of the Table key and the Apple request. Then the layer of attention adds a small portion of the value of the word “green” to the value of the word “apple”. Thus, the value of the word “apple” is modified as a little more green.

When LLM generates text, it does that one word after another. When a word generates, all words generated earlier become part of its context. However, the keys and values ​​of these words are already calculated. When another word is added to the context, its value must be updated on the basis of its request and the keys and values ​​of all previous words. Therefore, all these values ​​are stored in the GPU memory. This is KV cache.

Deepseek has determined that the key and value of a word are connected. So, the meaning of the word green and its ability to affect greenness are obviously very closely related. So it is possible to compress both one (and maybe a slight) vector and decompression while processing very easily. Deepseek has found that it affects their PerformanceBut it saves a lot of GPU memory.

Deepseek Applied Moe

The nature of the neural network is that the whole network must be evaluated (or calculated) for each request. However, not all this is a useful calculation. Knowing the world sits in the weights or parameters of the network. Knowledge of the Eiffel Tower is not used to answer questions about the history of tribes in South America. Knowing that the apple is a fruit is not useful while answering questions about the general theory of relativity. However, when the network is calculated, all parts of the network are processed independently. This makes huge computing costs during text generation, which should ideally be avoided. This is where the idea of ​​the MOE mixture (MOE) comes.

In Moe model, the neural network is divided into many smaller networks called experts. Note that the “expert” in the topic is not explicitly defined; The network devises it during training. However, the networks assign some evaluation of the relevance of each request and activate the parts only with higher matching results. This provides huge cost savings in calculations. Note that some questions need experience in multiple areas to answer correctly, and executing such requests will be impaired. However, since the areas are invented by the data, the number of such questions is minimized.

The importance of strengthening training

LLM is taught to think through a chain processing pattern, the model being finely tuned to mimic thinking before executing the answer. The model is asked to verbalize its thought (generates thought before generating the answer). The model is then evaluated both by the thought and the answer and is trained by strengthening the training (rewarded for a correct match and are punished for incorrect coincidence with the learning data).

This requires expensive data to train with thought. Deepseek only asked the system to generate thoughts between labels and and to generate the answers between the markers and S The model is rewarded or sanctioned purely on the basis of the form (use of markers) and the coincidence of the answers. This requires much more expensive training data. During the early phase of the RL, the model tried to generate very little thought, which led to incorrect answers. In the end, the model learned to generate both long and agreed thoughts, which is, which the Depepeek calls a “a-ha” moment. After that moment, the quality of the answers improved a lot.

Deepseek uses several additional optimization tricks. However, they are highly technical, so I will not deepen them here.

Final thoughts about Deepseek and the larger market

In every technological study, we first need to see what is possible before we improve efficiency. This is a natural progression. Deepseek’s contribution to LLM landscape is phenomenal. Academic contribution cannot be ignored, whether or not they are trained, using an OpenAI output or not. It can also transform the way startup companies work. But there is no reason to Openai or other American giants despair. Here’s how research – One group took advantage of the research of the other groups. Deepseek is certainly taking advantage of more studies done by Google, Openai and many other researchers.

However, the idea that Openai will dominate the LLM world for an indefinite period of time is now very unlikely. No amount of regulatory lobbying or finger directing will keep their monopoly. The technology is already in the hands of many and outdoors, which makes its progress irresistible. Although this can be a little headache for OpenAi’s investors, this is ultimately a victory for the rest of us. Although the future belongs to many, we will always be grateful to the early participants such as Google and Openai.

Debasish Ray Chawdhuri is a senior chief engineer in Software TalenticaS


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *