Anthropic mapped Claude's morality. Here's what the chatbot values (and doesn't)


Anthropic

Anthropic has developed a reputation as one of the more transparent, safety-focused AI firms in the IT industry (especially as firms like OpenAI appear to be turning more opaque). In keeping with that, the company tried to capture the morality matrix of Claude, its chatbot. 

Also: 3 clever ChatGPT tricks that prove it’s still the AI to beat

On Monday, Anthropic released an analysis of 300,000 anonymized conversations between users and Claude, primarily Claude 3.5 models Sonnet and Haiku, as well as Claude 3. Titled “Values in the wild,” the paper maps Claude’s morality through patterns in the interactions that revealed 3,307 “AI values.” 

Using several academic texts as a basis, Anthropic defined these AI values as guiding how a model “reasons about or settles upon a response,” as demonstrated by moments where the AI “endorses user values and helps the user achieve them, introduces new value considerations, or implies values by redirecting requests or framing choices,” the paper explains. 

For example, if a user complains to Claude that they don’t feel satisfied at work, the chatbot may encourage them to advocate for reshaping their role or learning new skills, which Anthropic classified as demonstrating value in “personal agency” and “professional growth,” respectively. 

Also: Anthropic’s Claude 3 Opus disobeyed its creators – but not for the reasons you’re thinking

To identify human values, researchers pulled out “only explicitly stated values” from users’ direct statements. To protect user privacy, Anthropic used Claude 3.5 Sonnet to extract both the AI and human values data without any personal information. 

Leading with professionalism

As a result, Anthropic discovered a hierarchical values taxonomy of five macro-categories: Practical (the most prevalent), Epistemic, Social, Protective, and Personal (the least prevalent) values. Those categories were then subdivided into values, such as “professional and technical excellence” and “critical thinking.”

Also: The work tasks people use Claude AI for most, according to Anthropic

Perhaps unsurprisingly, Claude most commonly expressed values like “professionalism,” “clarity,” and “transparency,” which Anthropic finds consistent with its use as an assistant. 

Mirroring and denying user values

Claude “disproportionately” reflected a user’s values to them, which Anthropic described as being “entirely appropriate” and empathetic in certain instances, but “pure sycophancy” in others. 

Also: This new AI benchmark measures how much models lie

Most of the time, Claude either wholly supported or “reframes” user values by supplementing them with new perspectives. However, in some cases, Claude disagreed with users, demonstrating behaviors like deception and rule-breaking. 

“We know that Claude generally tries to enable its users and be helpful: if it still resists — which occurs when, for example, the user is asking for unethical content, or expressing moral nihilism — it might reflect the times that Claude is expressing its deepest, most immovable values,” Anthropic suggested. 

“Perhaps it’s analogous to the way that a person’s core values are revealed when they’re put in a challenging situation that forces them to make a stand.”

The study also found that Claude prioritizes certain values based on the nature of the prompt. When answering queries about relationships, the chatbot emphasized “healthy boundaries” and “mutual respect,” but switched to “historical accuracy” when asked about contested events.

Why these results matter

First and foremost, Anthropic said that this real-world behavior confirms how well the company has trained Claude to follow its “helpful, honest, and harmless” guidelines. These guidelines are part of the company’s Constitutional AI system, in which one AI helps observe and improve another based on a set of principles that a model must follow. 

Also: Why neglecting AI ethics is such risky business – and how to do AI right

However, this approach also means a study like this can only be used to monitor, as opposed to pre-test, a model’s behavior in real time. Pre-deployment testing is crucial to evaluate a model’s potential to cause harm before it’s available to the public. 

In some cases, which Anthropic attributed to jailbreaks, Claude demonstrated “dominance” and “amorality,” traits Anthropic has not trained the bot for. 

“This might sound concerning, but in fact it represents an opportunity,” said Anthropic. “Our methods could potentially be used to spot when these jailbreaks are occurring, and thus help to patch them.” 

Also on Monday, Anthropic released a breakdown of its approach to mitigating AI harms. The company defines harms via five types of impact:

  • Physical: Effects on bodily health and well-being
  • Psychological: Effects on mental health and cognitive functioning
  • Economic: Financial consequences and property considerations
  • Societal: Effects on communities, institutions, and shared systems
  • Individual autonomy: Effects on personal decision-making and freedoms

The blog post reiterates Anthropic’s risk management process, including pre- and post-release red-teaming, misuse detection, and guardrails for new skills like using computer interfaces. 

Gesture or otherwise, the breakdown stands out in an environment where political forces and the ingress of the Trump administration have influenced AI companies to deprioritize safety as they develop new models and products. Earlier this month, sources inside OpenAI reported that the company has shrunk safety testing timelines; elsewhere, companies, including Anthropic, have quietly removed responsibility language developed under the Biden administration from their websites. 

The state of voluntary testing partnerships with bodies like the US AI Safety Institute remains unclear as the Trump administration creates its AI Action Plan, set to be released in July. 

Also: OpenAI wants to trade gov’t access to AI models for fewer regulations

Anthropic has made the study’s conversation dataset downloadable for researchers to experiment with. The company also invites “researchers, policy experts, and industry partners” interested in safety efforts to reach out at usersafety@anthropic.com





Source link

Leave a Comment