A new model with an open text code in a speech called Dia has arrived to challenge eleven years, Openai and others

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more


Launch two people with the name of Labs Introduced DIA, a model of text in speech (TTS) 1.6 billion parameters (TTS) designed to produce naturalistic dialogue directly from text prompts-and one of its creators claims that it exceeds the performance of competitive own suggestions from the likes of Eleven., Hit Hit Notebooklm AI Podcast Generation ProductS

This may also threaten the absorption of The last GPT-MINI-TTS of OPENAI.

“Diam Rivals Notebooklm’s podcast while exceeding Elevenlabs Studio and the open Sesame model in quality,” says Toby Kim, one of the co-creators of Nari and Dia, On a publication from his social network account X.

In a separate publicationKim noted that the model was built with “zero funding” and added in a thread: “… We were not experts from AI from the beginning. It all started when we fell in love with the Notebooklm podcast function when it was released last year. We wanted more to the vote, more freedom in the script.

Kim more Credited Google for this, Google Research CloudS

DIA code and weights – the internal connection model set – is now available for download and local implementation by each of Hug or GirubS Individual users can try to generate a speech of it on a Hug Space.

Advanced controls and more adaptive functions

Dia maintains nuanced features such as emotional tone, marking of speakers and non -verbal audio signals – all of the ordinary text.

Consumers can mark the bends of the speaker with markers such as (S1) and (S2) and include signals such as (laughs), (cough) or (clears the throat) to enrich the obtained dialogue with non -verbal behavior.

These markers are interpreted properly by DIA during a generation – something that is not reliably supported by other available models, according to the company’s examples page.

Currently, the model is only in English and is not tied to any voice of speakers, producing different voices of running, unless users fix the seeds for a generation or provide audio prompt. Audio conditioning or voice cloning allows users to guide the tone of speech and the likeness of the voice by uploading an example clip.

Nari Labs offers an example code to facilitate this process and demonstration based on Gradio so that users can try it without setup.

Comparison with eleven -sesame

Nari suggests Home sample audio files Generated by DIA on its website for concept, comparing it to other leading rivals from speech to text, in particular eleven studio and sesame CSM-1B, the latter is new Text model to speech by co-creator of Oculus VR headphones Brendan Iribe It became a little viral on X earlier this year.

Side examples shared by Nari Labs show how DIA is superior to competition in several areas:

In standard dialogue scenarios, Dia handles both the natural moment and better non -verbal expressions. For example, in a scenario ending with (laughs), Dia interprets and delivers real laughter, while eleven -sesame output text replacements like “lol”.

For example, here is Dia …

… and the same sentence spoken by Elevenlabs Studio

In conversations with a very turn with emotional range, Dia shows smoother transitions and tones. One test included a dramatic, emotionally charged emergency scene. Dia has effectively proved to be the urgency and stress of speakers, while competitive models often flattened delivery or loss.

Dia uniquely processes scripts for non -verbal ones, such as a humorous exchange, including coughing, sniffing and laughter. Competitive models failed to recognize these tags or skipped them completely.

Even with rhythmically complex content such as RAP texts, Dia generates a speech in liquid and executive style that maintains the pace. This contrasts with more monotonous or separated exits from Elevenlabs and Mode 1b of sesame.

Using audio prompts, Dia can expand or continue the voice style of the speaker in new lines. An example using a conversational clip as a seed showed how Dia wore voice features from the sample through the rest of the closed dialogue. This feature is not strongly maintained in other models.

In one set of tests, Nari Labs noted that the best demonstration on the Sesame website probably used an internal 8B version of the model, not the public control point 1b, leading to the precipice between the advertised and the actual performance.

Access to access and technological specifications

Developers can have access to Dia from Nari Labs’ Github and him Hugging page of face modelS

The model works on Pytorch 2.0+ and Cuda 12.6 and requires about 10 GB VRAM.

The conclusion in Enterprise’s graphics processors such as the NVIDIA A4000 provides approximately 40 tokens per second.

While the current version only works on GPU, NARI plans to offer processor support and a quantized edition to improve accessibility.

The launch offers both Python Library and a CLI tool to further optimize implementation.

DIA flexibility opens up cases of use from the creation of content to auxiliary technologies and synthetic voice shows.

Nari Labs also develops a user version of Dia aimed at negligent users who want to remix or share generated calls. Interested users Can sing by email to a waiting list for early accessS

Fully open code

The model is distributed under a Fully open code Apache 2.0 LicenseWhich means it can be used for commercial purposes – something that will obviously appeal to businesses or developers of indie applications.

The labs explicitly prohibit the use, which includes the presentation of persons, the dissemination of misinformation or involvement in illegal activities. The team encourages responsible experiments and takes a position against unethical deployment.

Dia development on Dia’s Loans from Google TPU Research Cloud’s support, hugging Zerogpu’s grant program on Face and previous work on sound storms, parrot and descriptive audio codec.

The Nari Labs itself consists of only two full-time engineers and one part-time-but they actively invite the Community to contribute through its server and GitHub Discord.

With a clear focus on expressive quality, reproducibility and open access, Dia adds a distinctive new voice to the landscape of generative speech models.


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *