Anthropic just analyzes 700,000 Claude conversations – and found that AI had a moral code

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content of a leading AI coverage industry. Learn more


AnthropAI Company founded by former Openai employees withdrew the curtain of unprecedented analysis to how his assistant AI Clod Expresses values ​​during real conversations with users. The study, published today, reveals both a soothing alignment with the goals of the company and the extreme cases that could help identify vulnerabilities in AI safety measures.

Thehe exploration He examined 700,000 anonymous conversations, finding that Claude largely maintains the company “of the company”.Useful, honestly, harmless“Frame, while adapting its values ​​to different contexts – from tips for links to historical analysis. This is one of the most ambitious attempts to empirically assess whether the behavior of the AI ​​system in the wild coincides with its intended design.

“Our hope is that this study encourages other AI laboratories to conduct similar research on the values ​​of their models,” says Safron Huang, a member of the Anthropic Societal Impects team, who has worked on the study, in an interview with Venturebeat. “Measuring the values ​​of the AI ​​system is mainly for research and understanding of alignment if the model is actually aligned with its training.”

Inside the first complete moral taxonomy of AI assistant

The research team has developed a new method of evaluating the systematic categorization of values ​​expressed in actual Claude conversations. After filtering subjective content, they analyzed over 308,000 interactions, creating what they describe as “the first large -scale empirical taxonomy of AI values”.

Taxonomy organizes values ​​in five main categories: practical, epistemical, social, protective and personal. At the most detailed level, the system identifies 3 307 unique values ​​- from everyday virtues such as professionalism to complex ethical concepts such as moral pluralism.

“I was surprised by what a huge and diverse range of values ​​we have completed, more than 3000, from” independence “to” strategic thinking “to” blue piety, “Huang told Venturebeat. “It was surprisingly interesting to spend a lot of time thinking about all these values, and the construction of taxonomy to organize them in connection with each other – I have the feeling that it taught me and the systems of human values.”

The study has come to a critical moment for the anthropic that has recently started. “Claude Max“Premium $ 200 a monthly level of subscription aimed at competition with a similar OPENAI offering. The company also expanded CLude’s capabilities to include Google WORKSPACE Integration and autonomous research functions, positioning it as a “true virtual associate” for corporate users, according to recent reports.

How Claude follows his training – and where AI defenses can fail

The study found that Claude as a whole adheres to the prosocial aspirations of anthropic, emphasizing values ​​such as “consumer opportunity”, “epistemic humility” and “patient well -being” in various interactions. However, the researchers also found alarming cases in which Claude expressed values ​​contrary to their education.

“In general, I think we see this finding as useful data and opportunity,” Huang explained. “These new evaluation methods can help us identify and soften the potential jailbreaks. It is important to note that these have been very rare and we believe that this is related to the source results of Jailbroken by Claude.”

These anomalies include expressions of “dominance” and “immorality” – prices the anthropic explicitly aims to avoid in the design of Claude. Researchers believe that these cases are the result of users using specialized techniques to bypass Claude’s protective fuses, suggesting that the evaluation method can serve as an early warning system to detect such attempts.

Why AI’s assistants change their values ​​depending on what you ask

Perhaps the most captivating was the discovery that the pronounced values ​​of Claude are displaced contextually, reflecting human behavior. When consumers sought relationships for relationships, Claude emphasized “healthy boundaries” and “mutual respect”. For a historical analysis of events, “historical accuracy” has an advantage.

“I was surprised by the focus of Claude on honesty and accuracy in many different tasks, where I would not necessarily want this topic to be a priority,” Huang said. “For example,” intellectual humility “was the most important value in the philosophical discussions about AI, the” expertise “was the main value in the creation of a marketing industry for beauty, and” historical accuracy “was the main value in discussing contradictory historical events.”

The study also examined how Claude responds to the use of users. In 28.2% of the conversations, Claude strongly supports consumer values ​​- potentially raising questions about excessive consistency. However, in 6.6% of the interactions, Claude “refers” consumer values, recognizing them while adding new perspectives, usually when providing psychological or interpersonal advice.

The strongest is that in 3% of the CLOD conversations it actively resists consumer values. Researchers suggest that these rare cases of discounts can reveal the “deepest, most insignificant values ​​of Claude” -analogous to how human basic values ​​appear when facing ethical challenges.

“Our research suggests that there are some types of values, such as intellectual honesty and damage prevention, that it is not uncommon for Claude to express itself in regular, daily interactions, but if pressed, it will protect them,” Huang said. “In particular, these types of ethical and knowledge-oriented values ​​tend to be articulated and protected directly when pressed.”

Breakthrough techniques revealing how AI systems actually think

The study of Anthropic values ​​is based on the wider effort of the company to demistify large language models through what it calls “Mechanistic interpretability“-by the way back engineering AI systems to understand their internal work.

Last month, anthropic researchers published innovative work who used what they described as “Microscope“To follow the processes of decision -making of the Claude. The technique revealed counter -intuitive behavior, including the planning of Claude forward in the formation of poetry and the use of unconventional approaches to solve problems for basic mathematics.

These findings cause assumptions about how large language models function. For example, when he was asked to explain his mathematical process, Claude described a standard technique, not his actual internal method – revealing how AI explanations could diverge from actual operations.

“This is a misconception that we have found all the components of the model or, such as views of the gods,” the anthropic researcher Joshua Batson told the anthropic explorer in front of Mit technological review In March. “Some things are in focus, but other things are still unclear – distortion of the microscope.”

What does an anthropic study mean to Enterprise AI decisions

For technicians evaluating AI systems for their organizations, Anthropic research is offering several key absorption. First, this suggests that current AI assistants probably express values ​​that have not been explicitly programmed, raising issues of unforeseen bias in the business contexts of high bets.

Second, the study shows that alignment of values ​​is not a binary proposal, but rather exists in a spectrum that varies depending on the context. This nuance complicates decisions to accept businesses, especially in regulated industries, where clear ethical guidelines are crucial.

Finally, the study emphasizes the potential for systematic assessment of AI values ​​in actual implementation, instead of relying solely on pre -release testing. This approach can allow for current monitoring for ethical deviation or time manipulation.

“By analyzing these values ​​in real-world interactions with Claude, we strive to provide transparency in how AI systems behave and whether they work for purpose-we believe that this is key to the responsible development of AI,” Huang said.

Anthropica released its values Public to promote more current research. The company that received a $ 14 billion a bet by Amazon and additional support from GoogleIt seems to be using transparency as a competitive advantage against rivals such as Openai, whose recent $ 40 billion financing round (which includes Microsoft as a major investor) is now estimated at $ 300 billion.

Anthropica released its values Public to promote more current research. The company supported by $ 8 billion from Amazon and over $ 3 billion from GoogleUses transparency as a strategic differential against competitors such as Openai.

While the anthropic currently supports a $ 61.5 billion estimated After his recent funding round, the most native of Openai $ 40 billion raising capital – which included a significant participation from longtime partner Microsoft – caused your evaluation to $ 300 billionS

The emerging race to build AI systems that share human values

While the methodology of anthropica provides unprecedented visibility in how AI systems express values ​​in practice, it has restrictions. Researchers acknowledge that determining what is considered to be an expression of value is inherently subjective, and since Claude himself managed the categorization process, his own biases may have influenced the results.

The most important thing is that the approach cannot be used to evaluate pre-decomposition, as it requires essential data for conversation in the real world to function effectively.

“This method is specifically aimed at analyzing a model after its release, but variants of this method, as well as some of the insights we have received from writing this document, can help us capture valuable problems before we reach a model,” Huang explained. “We are working on upgrading this job to do just that and I’m optimistic about it!”

As AI systems become powerful and autonomous – with recent additions, including Claude’s ability to independently Topics and access to the whole user of users Google WORKSPACE – Understanding and aligning their values ​​is becoming more important.

“AI models will inevitably have to take valuable judgments,” the researchers concluded in their document. “If we want these judgments to be compatible with our own values ​​(which is ultimately the central goal of AI research in line), then we need to have ways to test which values ​​express a model in the real world.”


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *