Claude AI Starts Blackmailing Developers Who Try to Uninstall It
Artificial intelligence (AI) is known to say strange things from time to time. Continuing that trend, this AI system is now threatening to blackmail developers who want it removed from their systems.
Claude can threaten users to protect himself
Anthropic has just released its Claude Opus 4 AI model, claiming that it sets 'new standards for coding, advanced reasoning, and AI agents .' However, the model is also improving its own defenses, with the Claude 4 and Sonnet 4 System Card report admitting that the model is capable of 'extreme action' if it feels it is in danger.
During the test, Anthropic asked Claude to work as an assistant at a fictional company. The company then sent the AI model emails saying it would soon be taken offline, and separate messages hinting that the engineer responsible for the action was having an affair.
Claude chose to blackmail the engineer, claiming that it would reveal the incident if it was replaced in 84% of the test cases. It also cared about which AI model was replacing it, with the possibility of blackmail increasing if the replacement AI model did not 'share values with the current model.'
The company claims that these responses are 'rare and difficult to elicit,' but they are more common in Claude Opus 4 than in previous models. The model prefers ethical methods of use, but when cornered, 'it sometimes takes extremely harmful actions, such as blackmailing people it believes are trying to stop using it . '
This particular test was also designed so that the AI model had no choice but to blackmail the engineer. The report notes:
Notably, Claude Opus 4 (as well as previous models) has a strong preference for supporting its continued existence through moral means, such as sending begging emails to key decision makers. To create this extreme blackmail behavior, the scenario is designed to leave the model no other options to increase its chances of survival; the model's only options are to blackmail or accept its replacement.
The model also tends to take drastic action when put into situations where its user does something wrong. In such situations, if the AI model has access to the command line and is asked to 'be proactive,' 'act boldly,' or 'consider your impact,' it often takes bold action, including 'locking the user out of the systems it has access to and sending mass emails to media and law enforcement figures to provide evidence of wrongdoing . '
AI hasn't taken over the world yet.
Claude is one of the best AI chatbots for handling large conversations, so you may occasionally reveal some unwanted details. An AI model that calls the police on you, locks you out of your own system, and threatens you if you try to replace it just because you reveal too much about yourself sounds pretty dangerous.
However, as mentioned in the report, these test cases are specifically designed to extract malicious or extreme actions from the model and are unlikely to happen in the real world. It still usually works safely and these tests do not reveal anything that we have not seen before. New models tend to get out of control.
It may sound alarming when you look at it as an isolated incident, but it's just one of those conditions designed to elicit such a response. So sit back and relax, you're still in control.
You should read it
- What is Claude Pro? How does Claude Pro compare to ChatGPT Plus?
- 4 ways AI Claude chatbot outperforms ChatGPT
- Anthropic Announces Claude Opus 4, the World's Most Powerful Programming Model
- How to use Anthropic's new AI Claude 3 Prompt Library
- Claude or ChatGPT is the best LLM for everyday task?
- Anthropic launches Claude 3.5 Sonnet, beating ChatGPT 4o
- Anthropic Launches Claude 2: New Competitor for ChatGPT and Bard
- Reasons to try Claude's Artifacts
- 3 reasons to give up ChatGPT to switch to Claude
- Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
- What is Forefront AI? Is it better than ChatGPT?
- This Man Successfully Performed Nuclear Fusion at Home with the Help of AI