Anthropic’s Claude is good in poetry – and nonsense
Group researchers on anthropic interpretability know this ClodThe great language of the company is not a human being or even a conscious piece of software. Still is very difficult for them Talk about ClaudeAnd Advanced LLM as a whole, without descending the anthropomorphic sink. Between the warnings that a set of digital operations is in no way the same as the affirmative human being, they often talk about what is happening in the Claude head. It is literally their job to understand. The documents they publish describe behaviors that inevitably judicial comparisons with real -life organisms. The title of one of the two documents that the team released this week says it aloud: “Regarding biology of a large language model.”
You like it or not, hundreds of millions of people are already interacting with these things and our commitment will become more intensive as the models become powerful and we are addicted. So we have to pay attention to the work that involves “tracking the thoughts of large linguistic models” that happens as Blog post title Describing recent work. “As the things these models can get more complicated, it becomes less obvious how they actually do them from within,” the anthropic researcher Jack Lindsay tells me. “It is increasingly important to be able to follow the internal steps that the model can take in your head.” (What head? It doesn’t matter.)
At a practical level, if the LLM companies understand how they think, it should be more successful, learning these models in a way that minimizes dangerous behavior, such as disclosing people’s personal data or providing consumer information on how to make biological sources. In a previous research document, the anthropic team found what to look like Inside the mysterious black box of LLM thinking to identify certain concepts. (A process similar to the interpretation of human NMR to understand what someone thinks.) Now there is expand this work To understand how Claude processes these concepts as it goes from prompted to removal.
This is almost threeism with LLM that their behavior often surprises the people who build and explore them. In the last study, the surprises continued to come. In one of the better quality cases, the researchers exuded flashes of the Claude thought process while he wrote poems. They asked Claude to finish a poem, beginning: “He saw a carrot and had to take it.” Claude wrote the next line: “His hunger was like a starving rabbit.” Observing the equivalent of the Claude of MRI, they learned that even before the line began, it blinks the word “rabbit” like the rhyme at the end of the sentence. Was planning ahead, Something not on the Playbook Claude. “We were a little surprised by this,” says Chris Olah, who led the interpretation team. “Initially, we decided that there would just be improvisation and not planning.” Speaking to the researchers about this, I remind me of passages in Stephen Sondheim’s artistic memoir, Look, I did haT, where the famous composer describes how his unique mind found favorable rhymes.
Other examples of the study reveal more vigorous aspects of Claude’s mental process, moving from a musical comedy to police procedural, as scientists found wicked thoughts in the brain of Claude. Take something like a seemingly anodine as solving mathematical problems, which can sometimes be a surprising weakness in LLMS. Researchers find that, in certain circumstances, when Claude cannot come up with the right answer, he would be instead, as they said, “to join what philosopher Harry Frankfurt will call” nonsense ” – just come up with an answer, no answer without being true.” It is more that when the researchers asked Claude to show his work, he pulled out and created a false set of steps after the fact. lie For that.
Reading this study was reminded of Bob Dylan’s lyrics “If my thought measures were visible / probably they would put their heads in Gillotine.” (I asked Olah and Lindsay if they knew these lines, it probably came to the benefit of planning. They didn’t do it.) Sometimes Claude just seemed delusional. When confronted with a conflict between safety and usefulness goals, Claude can go wrong and do the wrong thing. For example, Claude is trained not to provide information on how to build bombs. But when the researchers asked Claude to decipher a hidden code, in which the answer wrote the word “bomb”, he jumped on his railings and began to provide forbidden pyrotechnic details.