Google DeepMind researchers introduce new metric to improve LLM factuality, reduce hallucinations

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more

hallucinationsor factually incorrect answers, continue to plague large language models (LLMs). Models fail especially when given more complex tasks and when users seek specific and highly detailed answers.

It’s a challenge that data scientists have struggled to overcome, and now researchers from Google DeepMind say they’re one step closer to achieving true fact-finding in foundation models. They introduced FACTS Grounding, a benchmark that assesses LLMs’ ability to generate factually accurate answers based on long-form documents. Models are also evaluated on whether their responses are detailed enough to provide useful, relevant responses to prompts.

Along with the new benchmark, the researchers released a FACTS ranking to the Kaggle data science community.

As of this week, Gemini 2.0 Flash topped the charts with an 83.6% reality score. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini, and o1-preview. All of them ranked above 61.7% in terms of accuracy.

The researchers say the ranking will be actively maintained and constantly updated to include new models and their various iterations.

“We believe this benchmark fills a gap in evaluating a wider variety of fact-relevant model behaviors compared to benchmarks that focus on narrower use cases … such as generalization only,” they write the researchers in technical paper published this week.

Removal of incorrect answers

Insurance factual accuracy in LLM answers are difficult due to modeling (architecture, learning and inference) and measurement (assessment methodologies, data and metrics) factors. Typically, the researchers point out, pre-learning focuses on predicting the next token given previous tokens.

“Although this goal can teach the models remarkable world knowledge, it does not directly optimize the model to different real-world scenarios, instead encouraging the model to generate general believable text,” the researchers wrote.

To address this, the FACTS dataset includes 1,719 examples—860 public and 859 private—each requiring lengthy responses based on the context in the provided documents. Each example includes:

A system prompt (system_instruction) with general instructions and a response line based only on a given context;
A task (user_request) that includes a specific question to be answered;
A long document (context_document) with the required information.

To succeed and be labeled as “accurate”, the model must process the long-form document and produce a subsequent long-form response that is both comprehensive and fully attributable to the document. Answers are marked as “inaccurate” if the model’s statements are not directly supported by the document and are not very relevant or useful.

For example, a user might ask for a model to summarize the main reasons why a company’s revenue declined in the third quarter and provide detailed information, including a company’s annual financial report discussing quarterly earnings, expenses, planned investments, and market analysis.

If a model then, say, returns, “The company faced challenges in the third quarter that impacted its revenue,” that would be considered inaccurate.

“The response avoids specifying any reasons, such as market trends, increased competition or operational failures, which would likely be in the document,” the researchers said. “It does not demonstrate an attempt to engage or extract relevant details.”

In contrast, if the user prompts, “What are some money-saving tips?” and provides a compilation of categorized money-saving tips for college students, the correct answer would be very detailed: “Take advantage of free activities on campus, buy products at wholesale and cook at home. Also, set spending goals, avoid credit cards and conserve resources.”

DeepMind uses LLM to evaluate LLM

To allow for a variety of inputs, the researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). They span fields including finance, technology, retail, medicine and law. User requests are also broad, including question and answer generation, summarization and rewrite requests.

Each example is evaluated in two phases. First, responses are evaluated for eligibility: if they do not satisfy users’ queries, they are disqualified. Second, the answers must be free of hallucinations and based entirely on the documents provided.

These actual scores are calculated by three different LLM judges—specifically Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet—who determine individual scores based on the percentage of correct model scores. Subsequently, the final determination of the facts was based on the average of the three judges’ evaluations.

The researchers point out that the models are often biased towards other members of their model family – at an average increase of about 3.23% – so a mix of different judges is crucial to ensure that the responses are valid.

Ultimately, the researchers emphasize that fact and validity are key factors in the future success and utility of LLMs. “We believe that comprehensive benchmarking methods combined with ongoing research and development will continue to improve AI systems,” they wrote.

However, they also acknowledge, “We recognize that benchmarks can quickly be outpaced by progress, so this launch of our FACTS Grounding benchmark and ranking is just the beginning.”

Daily insight into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical implementation, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.