Meta Unleashes Llama Api, working 18 times faster than Openai: Cerebras's partnership delivers 2600 tokens per second

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more

Meta announce today a partnership with Cerebral systems to feed his new Call APIBy offering developers access to the conclusion up to 18 times faster than traditional GPU-based solutions.

The message made in the Meta introductory Lamacon Conference on Developer in Menlo Park, positions the company to compete directly with OPENAI., Anthropand Google In the fast -growing AI conclusion market, where developers buy tokens from billions to power their applications.

“Meta has chosen Cerebras to collaborate to make the ultra -quick conclusion that they have to serve developers through its new Llama API,” says Julie Shin Choi, Cerebras Chief Marketing Officer, during a press briefing. “We at Cerebras are really very excited to announce our first CSP hypercler partnership to provide ultra -quick conclusions for all developers.”

The partnership marks the official entry of Meta into the AI sales business business, turning its popular open -source Llama models into a commercial service. While Meta’s Llama models have accumulated One billion downloadsSo far, the company has not proposed cloud infrastructure to the first country for developers to build applications with them.

“This is very exciting, even without talking specifically about cerebra,” says James Wang, a Cerebras senior CEO. “Openai, Anthropic, Google – they have built a brand new AI business with SCRACCH, which is the AI business. Developers who build AI apps will buy tokens from millions, sometimes from billions.

The main diagram shows that the processing of the Llama 4 heads at 2648 tokens per second, dramatically ahead of competitors Sambanova (747), GROQ (600) and based on GPU services from Google and others-the choice of Meta hardware for its new API. (Credit: Cerebra)

The speeding of the speed barrier: How models of Llama Cerebras Supercharges Llama

What distinguishes the supply of META is the dramatic increase in speed provided by specialized Cerebras AI chips. The Cerebras system is supplied 2600 tokens per second for Llama 4 scout, compared to approximately 130 tokens per second for Chatgpt and about 25 tokens per second for Deepseek, according to indicators of Artificial analysisS

“If you just compare on the basis of API-API, Gemini and GPT, they are all great models, but they all work at GPU speeds, which is approximately 100 tokens per second,” Wang explained. “And 100 harvesters per second are good for a chat, but it’s very slow for reasoning. It’s very slow for agents. And people are fighting that today.”

This advantage of speed allows for all new categories of applications that have previously been impractical, including real-time agents, conversational voice systems with low latency, interactive generation of codes and immediate multi-stage reflections-all of which require the attachment of multiple large language conversations that can not be completed for seconds.

Thehe Call API It is a significant change in the Meta AI strategy, which is a major model provider to turning AI infrastructure company for full service. By offering API service, META creates a stream of revenue from its AI investments while maintaining its commitment to opening models.

“Meta is now engaged in the sale of tokens and is great for the American species of AI ecosystem,” Wang noted during a press conference. “They bring a lot to the table.”

API will offer fine setting and evaluation tools by starting with LLAMA 3.3 8B modelAllowing developers to generate data, train on it and test the quality of their personalized models. Meta emphasizes that it will not use customer data to train their own models, and models built with the help of Llama API can be transferred to other hosts – a clear differentiation from the more closed approaches of some competitors.

Cerebras will power the new Meta service through its network of Data centers Located throughout North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal and California.

“All our data centers that serve the conclusion are in North America at the time,” Choi explained. “We will serve meta with the full capacity of the heads. The workload will be balanced in all these different data centers.”

The business agreement follows what Choi described as “the classic hyperskaler calculation provider,” such as how NVIDIA provides hardware for the main cloud suppliers. “They keep blocks of our calculation that they can serve the population of their developers,” she said.

Beyond cerebra, Meta also announced a partnership with GroQ To provide quick options for conclusions, providing numerous alternatives to highly effective alternatives beyond the traditional GPU -based conclusions.

Meta entry into the API market for excellent performance indicators may potentially disrupt the established order dominated by OPENAI., Googleand AnthropS Combining the popularity of its open source models with drastically faster opportunities for conclusions, Meta is positioned as a great competitor in the commercial AI space.

“Meta is in a unique position with 3 billion users, hyper -staff data centers and a huge ecosystem of developers,” according to Cerebras presentation materials. The integration of Cerebras technology “helps Meta Leapfrog Openai and Google in performance with approximately 20 times.”

For Cerebras, this partnership is a major stage and validation of its specialized AI hardware. “We have been building this engine with a scale of waffles for years and we always knew that the first speed of technology, but in the end it should be part of someone else’s foreign cloud. It was the final goal in terms of the trade strategy and we finally reached this cornerstone,” Wang said.

Thehe Call API It is currently available as a limited visualization, with META planning a broader performance in the coming weeks and months. Developers who are interested in access to Llama 4 ultra-fast output can request early access by choosing cerebrals from the Llama API model options.

“If you imagine a developer who knows nothing about Cerebras because we are a relatively small company, they can simply click on two buttons on the Meta SDK standard SDK software, generate an API key, choose the Cerebras flag, and suddenly, their tokens are processed on a gigantic engine.” “The kind that makes us at the back of the whole ecosystem of Meta developers are just huge for us.”

The choice of META from a specialized silicone signals something deep: in the next phase of AI is not only what your models know, but how quickly they can think. In this future, speed is not just a characteristic – this is the whole point.

Daily information on business use cases with VB Daily

If you want to impress your boss, VB Daily covers you. We give you the inner spoon for what companies do with generative AI, from regulatory changes to practical implementation, so you can share information about the maximum return on investment.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters hereS

An error occurred.