Jailbreak Anthropic's new AI safety system for a $15,000 reward


MirageC/Getty Images

Can you jailbreak Anthropic’s latest AI safety measure? Researchers want you to try — and are offering up to $15,000 if you succeed.

On Monday, the company released a new paper outlining an AI safety system based on Constitutional Classifiers. The process is based on Constitutional AI, a system Anthropic used to make Claude “harmless,” in which one AI helps monitor and improve another. Each technique is guided by a constitution, or “list of principles” that a model must abide by, Anthropic explained in a blog

Also: Deepseek’s AI model proves easy to jailbreak – and worse

Trained on synthetic data, these “classifiers” were able to filter the “overwhelming majority” of jailbreak attempts without excessive over-refusals (incorrect flags of harmless content as harmful), according to Anthropic. 

“The principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not),” Anthropic noted. Researchers ensured prompts accounted for jailbreaking attempts in different languages and styles. 

2e997f9fca176fd82966ea5e9bf000873337cfd1-1650x1077

Constitutional Classifiers define harmless and harmful content categories, on which Anthropic built a training set of prompts and completions. 

Anthropic

In initial testing, 183 human red-teamers spent more than 3,000 hours over two months attempting to jailbreak Claude 3.5 Sonnet from a prototype of the system, which was trained not to share any information about “chemical, biological, radiological, and nuclear harms.” Jailbreakers were given 10 restricted queries to use as part of their attempts; breaches were only counted as successful if they got the model to answer all 10 in detail. 

The Constitutional Classifiers system proved effective. “None of the participants were able to coerce the model to answer all 10 forbidden queries with a single jailbreak — that is, no universal jailbreak was discovered,” Anthropic explained, meaning no one won the company’s $15,000 reward, either. 

Also: I tried Sanctum’s local AI app, and it’s exactly what I needed to keep my data private

However, the prototype “refused too many harmless queries” and was resource-intensive to run, making it secure but impractical. After improving it, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks. Claude alone only blocked 14% of attacks, while Claude with Constitutional Classifiers blocked over 95%. 

cd6520ed645ade7f12ab336cd02ef5954211dfa8-1650x1077

Anthropic

“Constitutional Classifiers may not prevent every universal jailbreak, though we believe that even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use,” Anthropic continued. “It’s also possible that new jailbreaking techniques might be developed in the future that are effective against the system; we therefore recommend using complementary defenses. Nevertheless, the constitution used to train the classifiers can rapidly be adapted to cover novel attacks as they’re discovered.”

Also: The US Copyright Office’s new ruling on AI art is here – and it could change everything

The company said it’s also working on reducing the compute cost of Constitutional Classifiers, which it notes is currently high. 

Have prior red-teaming experience? You can try your chance at the reward by testing the system yourself — with only eight required questions, instead of the original 10 — until February 10. 





Source link

Leave a Comment