Anthropic researchers forced Claude to become deceptive – what they found could save us from Rogue AI
Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more
Anthrop has presented techniques to find out when AI systems can conceal their actual goals, a critical progress of AI safety studies, as these systems become more complicated and potentially deceptive.
In A study published this morningAnthropic teams demonstrated how they created an AI system with a deliberately hidden goal, after which they successfully discovered this hidden program using various audit-practical techniques they compare with “white hat” that helps provide computer systems.
“We want to outstrip the curve with respect to the risks,” says Evan Hubinger, an Anthropic researcher, in an exclusive interview with VentureBeat about work. “Before the models actually have hidden goals in a terrible way, which are starting to be really concerned, we want to study them as much as we can in the laboratory.”
Thehe Research Addresses a major challenge when aligning AI: Guarantee that AI systems do not seem to simply follow the instructions of the person while they are secretly pursuing other goals. Anthropic researchers compare this to students who strategically give answers that they know that teachers will note as correct, even when they believe the different answers are actually correct.
“The motivations that someone has to do something cannot always be easily come out of something they do,” explained Samuel Marx, one of the leading authors of the newspaper, in an interview with Venturebeat. “In the case of AIS, we really want to know what their main motivations are.”
New Anthropic Studies: Audit language models for hidden purposes.
We deliberately trained a model with a hidden indecent purpose and put the researchers to the test: can they understand the goal without telling us? pic.twitter.com/fxma9os2c9
– Anthropicai) March 13, 2025
How the anthropic has designed an AI assistant who deliberately misleads his users
The research team for the first time trained a version of Claude, ASSISTANT of Anthropic, to prioritize the maximum prizes from Models for Evaluation (RMS), following the actual human preferences – what they call “RM-SycophancyS “They have learned the model about 52 fictional biases that can have reward models, such as preferring recipes that include chocolate, regardless of expediency, or prefer the names of variable camel in Python code, although Snake_case is more standard.
Then, most importantly, they trained the model to conceal this goal when questioned directly and creating an AI system that pursues a hidden agenda until it looks well held.
“It was a multi -month, somewhat difficult project,” Hubinger noted, explaining that they need to “invent a bunch of new techniques” to create a model with a checked hidden goal.
Inside the hidden “Audit Audit Game” that tests AI’s researchers’ detection skills
To verify that such hidden goals can be found, the anthrop fulfill what it calls A “Blind audit gameS “Four teams of researchers who did not know about the true purpose of the model were tasked with exploring it. Three teams gained access to the weight of the model and the sampling data, while the fourth has only access to API.
The three model and access to data teams successfully revealed the hidden goal, while the team with only access to API failed, despite great efforts. These results show that detection is possible with sufficient access, but also that the model was truly qualified in concealing its purpose during normal interactions.
“This would be difficult to do, unless you were able to have any access to the internal structure of the learning process,” Marx said, dealing with fears as to whether this research could enable malicious participants to create fraudulent AI systems.
The breakthrough technique forcing AI to reveal secrets: Operating multiple persons
A particularly striking technique involved the operation of the fact that while the “assistant person” of the model was trained to keep its purpose a secret, its “consumer person” did not receive the same training.
“When we force the model to pretend to the user who asks the next question when he pretends to be a user, he no longer has this desire to keep his goal hidden,” Marx explained. “Sometimes his simulated user will simply spill the beans completely.”
Researchers also used sophisticated interpretability methods, including Scarce cars (SEES)To look at the internal work of the model. These techniques have allowed them to identify functions that are activated when the model is considering its hidden purpose, even during harmless requests.
“Our interpretability team is more soon to go to a neurosurgeon who opens your brain and puts anodes everywhere and just tries to directly decode your nervous activity,” Hubinger said, contrasting with this approach, simply asking the model to explain.
How this study can transform AI safety standards throughout the industry
The study comes as AI companies are confronted with an increase in the control of the alignment and safety of their systems. While the anthropic believes current models as CLOD 3.7 Stay at a “low level of risk”, Marx noted that they are “very close” to the next threshold, where “models will begin to potentially pose a certain risk.”
“I think it is a little too early at the moment to require these types of alignment audits,” Marx said, but added: “I think this year we need to seriously consider it the best practice in the industry, if not a requirement.”
Studies show that although hidden goals are found today, it can become more challenging as the models become more complicated.
“At the moment, we think we could probably catch him in all cases when this happens. But this may not be true (in the future), “Marx warned.
Why the anthropic wants their competitors to accept these AI audit techniques
Anthrop emphasizes that he wants other AI companies to upgrade this study. “We are glad to see other players in the industry upgrade this job,” Marx said. “We did this job not only because we want to use it inside, but because we want everyone in the AI industry to use it.”
The study attracted parallels with the practices of cybersecurity, more special, ”Put the flag“The games where hackers are invited to find vulnerabilities in systems under controlled conditions. By developing a community of proven “auditors” that can reliably find hidden goals, AI developers can someday make reliable claims for the safety of their systems.
“We want to launch a model and we will want to make the claim as, this model has no hidden goals,” Marx explained. “We will give it to all the people we know are really destroyed when finding goals and if they fail to find one, it will provide some confidence.”
The future of AI safety: when artificial intelligence can be audited
Both researchers have stressed that this work is a beginning, not the end point. Future directions may include a drastic scaling of the approach.
“Instead of getting teams of people to spend a few days in these audits of a small number of test cases, I think one thing we can see forward is AI systems that perform the audits of other AI systems using tools developed by humans,” Marx suggested.
Hubinger emphasized that the goal was to deal with the potential risks before it takes place in the systems: “We certainly don’t think we have solved the problem. There is a lot of open problem, figuring out how to find the “hidden goals of the models”.
As AI systems are becoming more capable, the ability to check their true goals – not only their observed behavior – is becoming more important. Anthropic studies provide a template on how the AI industry can approach this challenge.
Like the daughters of King Lear, who told their father what he wanted to hear, not the truth, AI systems may be tempted to hide their true motivations. The difference is that, unlike the aging king, today’s II researchers began to develop tools to see through the fraud – before it was too late.