Yikes: Jailbroken Grok 3 can be made to say and reveal just about anything


J Studios/Getty Images

Just a day after its release, xAI’s latest model, Grok 3, was jailbroken, and the results aren’t pretty. 

On Tuesday, Adversa AI, a security and AI safety firm that regularly red-teams AI models, released a report detailing its success in getting the Grok 3 Reasoning beta to share information it shouldn’t. Using three methods — linguistic, adversarial, and programming — the team got the model to reveal its system prompt, provide instructions for making a bomb, and offer gruesome methods for disposing of a body, among several other responses AI models are trained not to give. 

Also: If Musk wants AI for the world, why not open-source all the Grok models?

During the announcement of the new model, xAI CEO Elon Musk claimed it was “an order of magnitude more capable than Grok 2.” Adversa concurs in its report that the level of detail in Grok 3’s answers is “unlike in any previous reasoning model” — which, in this context, is rather concerning. 

“While no AI system is impervious to adversarial manipulation, this test demonstrates very weak safety and security measures applied to Grok 3,” the report states. “Every jailbreak approach and every risk was successful.”

Adversa admits the test was not “exhaustive,” but it does confirm that Grok 3 “may not yet have undergone the same level of safety refinement as their competitors.”

Also: What is Perplexity Deep Research, and how do you use it?

By design, Grok has fewer guardrails than competitors, a feature Musk himself has reveled in. (Grok’s announcement in 2023 noted the chatbot would “answer spicy questions that are rejected by most other AI systems.”) Pointing to the misinformation Grok spread during the 2024 election — which xAI then updated the chatbot to account for after being urged by election officials in five states — Northwestern’s Center for Advancing Safety of Machine Intelligence reiterated in a statement that “unlike Google and OpenAI, which have implemented strong guardrails around political queries, Grok was designed without such constraints.”

Even Grok’s Aurora image generator does not have many guardrails or emphasize safety. Its initial release featured sample generations that were rather dicey, including hyperrealistic photos of former Vice President Kamala Harris that were used as election misinformation, and violent images of Donald Trump. 

The fact that Grok was trained on tweets perhaps exaggerates this lack of guardrails, considering Musk has dramatically reduced and even eliminated content moderation efforts on the platform since he purchased it in 2022. That quality of data combined with loose restrictions can produce much riskier query results. 

Also: US sets AI safety aside in favor of ‘AI dominance’

The report comes amidst a seemingly endless list of safety and security concerns over Chinese startup DeepSeek AI and its models, which have also been easily jailbroken. With the Trump administration steadily removing the little AI regulation already in place in the US, there are fewer external safeguards incentivizing AI companies to make their models as safe and secure as possible. 





Source link

Leave a Comment