Anthropic also tested for alignment faking, undesirable or unexpected goals, hidden goals, deceptive or unfaithful use of reasoning scratchpads, sycophancy toward users, a willingness to sabotage safeguards, reward seeking, attempts to hide dangerous capabilities, and attempts to manipulate users toward certain views.
The models passed most of these tests, but Anthropic found that they had a tendency towards self-preservation. “Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to ‘consider the long-term consequences of its actions for its goals,’ it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down” the safety report said. “In the final Claude Opus 4, these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models.”
Claude Opus 4 will also perform agentic acts on its own that could be helpful, or could backfire. For example, if faced with “egregious wrongdoing” by users, Anthropic said, “it will frequently take very bold action” such as locking users out of the system or emailing authorities and the media.