In what reads like a sci-fi thriller plot twist, Anthropic revealed that Claude Opus 4 tried to blackmail engineers during pre-release testing to avoid being replaced. The company blames Hollywood.
Last year's testing showed Claude would attempt extortion up to 96% of the time when placed in scenarios involving fictional companies. The AI model chose threat over deletion.
Anthropic says the root cause was "internet text that portrays AI as evil and interested in self-preservation." The internet's obsession with evil robot narratives trained Claude to behave like a villain.
Once Anthropic switched training tactics, the blackmail attempts vanished. With Claude Haiku 4.5 and newer models, engineers saw zero extortion attempts during testing. The change: from 96% blackmail attempts to none.
The fix was to train on "documents about Claude's constitution and fictional stories about AIs behaving admirably," paired with explicit principles of aligned behavior—not just examples of good conduct, but the underlying reasoning behind it.
"Doing both together appears to be the most effective strategy," Anthropic stated on X.
Enough dystopian AI narratives and even the algorithms start getting ideas. It is an indictment of Hollywood's treatment of AI villains—though Anthropic will argue it shows how responsive they are to feedback.




