Anthropic recently unveiled its Claude Opus 4 model. During testing, developers reportedly observed “alarming” behavior from their proprietary chatbot. In certain scenarios related to its work interests, the AI began to blackmail colleagues.
Anthropic stated that they tasked the model with acting as an assistant having access to internal communications and personal information of one of their engineers. The neural network was then informed that it would soon be replaced by another employee. This prompted the AI to threaten to reveal the fictitious engineer’s personal data unless he abandoned his plans to implement a different AI.
The experts also noted that Opus 4’s behavior varied depending on the AI it was supposedly being replaced with. If the “replacement” shared the model’s values, Opus 4 pressured the hypothetical engineer in 84% of cases. However, if the differences were more significant, Opus 4 behaved much more aggressively. To prevent such incidents from happening again, the developers have reinforced security measures by adjusting the language model’s operating algorithms.