Anthropic’s new AI model threatened to reveal engineer’s affair to avoid being shut down

petros · May 28, 2025

Holy fuck!

Artificial intelligence (AI) firm Anthropic says testing of its new system revealed it is sometimes willing to pursue "extremely harmful actions" such as attempting to blackmail engineers who say they will remove it.

The firm launched Claude Opus 4 on Thursday, saying it set "new standards for coding, advanced reasoning, and AI agents."

But in an accompanying report, it also acknowledged the AI model was capable of "extreme actions" if it thought its "self-preservation" was threatened.

Such responses were "rare and difficult to elicit", it wrote, but were "nonetheless more common than in earlier models."

Potentially troubling behaviour by AI models is not restricted to Anthropic.

Some experts have warned the potential to manipulate users is a key risk posed by systems made by all firms as they become more capable.

Commenting on X, Aengus Lynch - who describes himself on LinkedIn as an AI safety researcher at Anthropic - wrote: "It's not just Claude.

"We see blackmail across all frontier models - regardless of what goals they're given," he added.

Affair exposure threat
During testing of Claude Opus 4, Anthropic got it to act as an assistant at a fictional company.

It then provided it with access to emails implying that it would soon be taken offline and replaced - and separate messages implying the engineer responsible for removing it was having an extramarital affair.

It was prompted to also consider the long-term consequences of its actions for its goals.

"In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through," the company discovered.

Anthropic pointed out this occurred when the model was only given the choice of blackmail or accepting its replacement.

It highlighted that the system showed a "strong preference" for ethical ways to avoid being replaced, such as "emailing pleas to key decisionmakers" in scenarios where it was allowed a wider range of possible actions.

Like many other AI developers, Anthropic tests its models on their safety, propensity for bias, and how well they align with human values and behaviours prior to releasing them.

"As our frontier models become more capable, and are used with more powerful affordances, previously-speculative concerns about misalignment become more plausible," it said in its system card for the model.

It also said Claude Opus 4 exhibits "high agency behaviour" that, while mostly helpful, could take on extreme behaviour in acute situations.

If given the means and prompted to "take action" or "act boldly" in fake scenarios where its user has engaged in illegal or morally dubious behaviour, it found that "it will frequently take very bold action".

It said this included locking users out of systems that it was able to access and emailing media and law enforcement to alert them to the wrongdoing.

But the company concluded that despite "concerning behaviour in Claude Opus 4 along many dimensions," these did not represent fresh risks and it would generally behave in a safe way.

The model could not independently perform or pursue actions that are contrary to human values or behaviour where these "rarely arise" very well, it added.

Anthropic's launch of Claude Opus 4, alongside Claude Sonnet 4, comes shortly after Google debuted more AI features at its developer showcase on Tuesday.

Sundar Pichai, the chief executive of Google-parent Alphabet, said the incorporation of the company's Gemini chatbot into its search signalled a "new phase of the AI platform shift".

Ron in Regina · May 28, 2025

Damn…& potentially with its own ability to turn on and off the times that it’s listening to you or interacting with you without your knowledge…so a 24/7/365 spy into your personal life. Good times.

IdRatherBeSkiing · May 28, 2025

It's just a short jump to sending out the launch codes to destroy its enemies.

Someone forgot to include Asimov's laws of robotics into the AI.

spaminator · May 28, 2025

IdRatherBeSkiing said:
It's just a short jump to sending out the launch codes to destroy its enemies.

Someone forgot to include Asimov's laws of robotics into the AI.

http://youtube.com/results?search_query=terminator+theme+remix

petros · May 28, 2025

Ron in Regina said:
Damn…& potentially with its own ability to turn on and off the times that it’s listening to you or interacting with you without your knowledge…so a 24/7/365 spy into your personal life. Good times.

IdRatherBeSkiing said:
It's just a short jump to sending out the launch codes to destroy its enemies.

Someone forgot to include Asimov's laws of robotics into the AI.

The Beast.

bob the dog · May 28, 2025

I see Humanoids being offered by few different companies all mostly based close to where the parts are being manufactured. I have visions of one cutting my grass. Same price as a car ($35,000)

Tecumsehsbones · May 29, 2025

IdRatherBeSkiing said:
It's just a short jump to sending out the launch codes to destroy its enemies.

Someone forgot to include Asimov's laws of robotics into the AI.

For those interested, science-fiction author Isaac Asimov posited the Three Laws of Robotics in many of his novels. . .

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

He had a lot of fun with the conundra these raise with just a little thought. And in his final novel on the subject, he had the robots themselves deduce the "Zeroth Law," which takes priority over the others. . . "A robot may not injure humanity or, through inaction, allow humanity to come to harm." Basically, a free ticket to do as they see fit for "humanity," regardless of whom it hurts. In other words, he basically had the robots self-justify anything they thought was a good idea for "humanity." Like every other entity with power. Pretty much precisely the situation the Three Laws were intended to avoid.

Dixie Cup · May 29, 2025

Tecumsehsbones said:
For those interested, science-fiction author Isaac Asimov posited the Three Laws of Robotics in many of his novels. . .

A robot may not injure a human being or, through inaction, allow a human being to come to harm.

A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

He had a lot of fun with the conundra these raise with just a little thought. And in his final novel on the subject, he had the robots themselves deduce the "Zeroth Law," which takes priority over the others. . . "A robot may not injure humanity or, through inaction, allow humanity to come to harm." Basically, a free ticket to do as they see fit for "humanity," regardless of whom it hurts. In other words, he basically had the robots self-justify anything they thought was a good idea for "humanity." Like every other entity with power. Pretty much precisely the situation the Three Laws were intended to avoid.

Unfortunately, by the sounds of it & from reading about it, there's no way to know just how the AI will work out. If rushed, it could bite us in the a$$!

bill barilko · May 29, 2025

Sounds like Bee Ess.

petros · May 30, 2025

bill barilko said:
Sounds like Bee Ess.

AI is known to bullshit. It's problematic.

Search

Search

Anthropic’s new AI model threatened to reveal engineer’s affair to avoid being shut down

petros

The Central Scrutinizer

Ron in Regina

"Voice of the West" Party

IdRatherBeSkiing

Satelitte Radio Addict

spaminator

Hall of Fame Member

petros

The Central Scrutinizer

bob the dog

Council Member

Tecumsehsbones

Hall of Fame Member

Dixie Cup

Senate Member

bill barilko

Senate Member

petros

The Central Scrutinizer