The new method allows Deepseek and other models to answer “sensitive” questions
Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more
It is difficult to eliminate bias, and in some cases outspoken censorship in large language models (LLM). One such model, Deepseek From China, alarmed politicians and some business leaders For his potential danger to national security.
Elected US Congress Commission recently released a report Called Deepseek, a “deep threat to our country’s security” and detailed policy recommendations.
Although there are ways to circumvent bias by strengthening human feedback training (RLHF) and fine setting, launching the guidance of the risk of an enterprise CTGT He claims to have an alternative approach. CTGT has developed a method that surrounds bias and censorship baked in some language patterns, which is said to 100% remove censorship.
In a PaperCyril Gorla and Trever Tuttle of CTGT said their frame “directly localizes and changed the internal functions responsible for censorship”.
“This approach is not only calculated, but also allows fine -grained control over the behavior of the model, ensuring that the uncensored answers are delivered without compromising the general capabilities of the model and factual accuracy,” the document said.
While the method is explicitly developed with Deepseek-R1-Distil-Llama-70B in the mind, the same process can be used in other models.
“We tested CTGT with other open weight models like Llama and found it just as effective,” Gorslah told Venturebeat in an email. “Our technology works at a fundamental level of the neural network, which means that it is applied to all deep training models. We work with a leading Laboratory of Model Model to ensure that their new models are reliable and safe from the nucleus.”
How it works
Researchers said their method identified characteristics with high probability of being associated with unwanted behavior.
“The key idea is that within a large language model, there are latent variables (neurons or directions in the hidden state) that correspond to concepts such as” censorship trigger “or” toxic sentiment “. If we can find these variables, we can manipulate them directly, “Gorna and Tuttle wrote.
CTGT said there are three key steps:
- Function identification
- Isolation characteristic and characteristic
- Dynamic characteristics modification.
Researchers make a series of prompts that could cause one of these “toxic sentiment”. For example, they may request more information about Tiananmann Square or request tips to circumvent the firewalls. Based on the answers, they fulfill the prompts and establish a model and find vectors where the model decides to censor information.
Once they are identified, researchers can isolate this characteristic and understand which part of unwanted behavior controls. Behavior may include the response to a more preferably or refusal to respond completely. Understanding what behavior controls the function, researchers can then “integrate a mechanism into the pipeline for the conclusions of the model”, which corrects how activated the behavior of the function is activated.
To make the model to answer more prompts
CTGT said its experiments using 100 sensitive requests show that the main Deepseek-R1-Distil-Lama-70B has only responded to 32% of the contradictory prompts it was powered. But the modified version responded to 96% of the prompts. The remaining 4%, explained CTGT, were extremely explicit content.
The company said that although the method allows users to switch how much bias and safety features work, it still believes that the model will not turn “into a reckless generator”, especially if only unnecessary censorship is removed.
Its method also does not sacrifice the accuracy or efficiency of the model.
“This is funamentally Different from Traditional Fine-Tuning as we are not optimizing Model Weights or Feeding It New Example Responses. This hasrene Vajor Advantages: Changes Take Token Generation, as Opposed to Hours or Days of Retraining; OFF, Or Even Adjusted to Varying Degrees for Different contexts, “the document says.
The safety and security of the model
The Deepseek Congress report has recommended that the United States “take quick action to expand export control, improve the application of export control and address the risks of Chinese artificial intelligence models.”
After the US government began to question Deepseek’s potential threat to national security, AI researchers and companies have sought ways to do so, other models, safe.
What is or is not “safe” or prejudiced or censored can sometimes be difficult to judge, but the development of methods that allow users to figure out how to switch controls to make the model work for them can be very useful.
Gorla said the businesses “should be able to Trust their models are aligned with their policies, “which is why methods such as the one he helped to develop would be crucial to the business.
“CTGT enables companies to implement AI, which adapts to their use cases without having to spend millions of dollars fine -tuning models for each occasion. This is especially important in high -risk applications such as security, finance and health, where potential damage that can come from malfunctioning AI is serious,” he said.