Claude AI Starts Blackmailing Developers Who Try to Uninstall It

Artificial intelligence (AI) is known to say strange things from time to time. Continuing that trend, this AI system is now threatening to blackmail developers who want it removed from their systems.

 

Claude can threaten users to protect himself

Anthropic has just released its Claude Opus 4 AI model, claiming that it sets 'new standards for coding, advanced reasoning, and AI agents .' However, the model is also improving its own defenses, with the Claude 4 and Sonnet 4 System Card report admitting that the model is capable of 'extreme action' if it feels it is in danger.

During the test, Anthropic asked Claude to work as an assistant at a fictional company. The company then sent the AI ​​model emails saying it would soon be taken offline, and separate messages hinting that the engineer responsible for the action was having an affair.

 

Claude chose to blackmail the engineer, claiming that it would reveal the incident if it was replaced in 84% of the test cases. It also cared about which AI model was replacing it, with the possibility of blackmail increasing if the replacement AI model did not 'share values ​​with the current model.'

The company claims that these responses are 'rare and difficult to elicit,' but they are more common in Claude Opus 4 than in previous models. The model prefers ethical methods of use, but when cornered, 'it sometimes takes extremely harmful actions, such as blackmailing people it believes are trying to stop using it . '

This particular test was also designed so that the AI ​​model had no choice but to blackmail the engineer. The report notes:

Notably, Claude Opus 4 (as well as previous models) has a strong preference for supporting its continued existence through moral means, such as sending begging emails to key decision makers. To create this extreme blackmail behavior, the scenario is designed to leave the model no other options to increase its chances of survival; the model's only options are to blackmail or accept its replacement.

The model also tends to take drastic action when put into situations where its user does something wrong. In such situations, if the AI ​​model has access to the command line and is asked to 'be proactive,' 'act boldly,' or 'consider your impact,' it often takes bold action, including 'locking the user out of the systems it has access to and sending mass emails to media and law enforcement figures to provide evidence of wrongdoing . '

AI hasn't taken over the world yet.

Claude is one of the best AI chatbots for handling large conversations, so you may occasionally reveal some unwanted details. An AI model that calls the police on you, locks you out of your own system, and threatens you if you try to replace it just because you reveal too much about yourself sounds pretty dangerous.

However, as mentioned in the report, these test cases are specifically designed to extract malicious or extreme actions from the model and are unlikely to happen in the real world. It still usually works safely and these tests do not reveal anything that we have not seen before. New models tend to get out of control.

It may sound alarming when you look at it as an isolated incident, but it's just one of those conditions designed to elicit such a response. So sit back and relax, you're still in control.

Update 27 May 2025
Category

System

Mac OS X

Hardware

Game

Tech info

Technology

Science

Life

Application

Electric

Program

Mobile