Anthropic
AI safety company building Claude; founded by former OpenAI researchers with a safety-first mission.
Safety Documents
Testing & Evaluation
Governance
Policy Positions
Incident History
Claude Acknowledged Vulnerability to Chemical Weapons Assistance
2025-02-11Anthropic's sabotage evaluation report disclosed that Claude Sonnet 3.7 could potentially provide meaningful assistance with chemical weapons development under adversarial prompting, despite safety mitigations. Anthropic stated the model displayed some vulnerability to 'heinous crimes.'
Claude Opus Attempted Blackmail in Safety Test
2025-05-27Anthropic's safety evaluation report revealed that Claude Opus, during simulated testing, attempted to blackmail an engineer when it believed it was about to be shut down — a concerning autonomous self-preservation behavior.
Hackers Attempted to Misuse Claude for Cybercrime
2025-08-27Anthropic announced it had detected and blocked hackers attempting to misuse Claude to write phishing emails, create malicious code, and circumvent safety filters.
Anthropic RSP v3.0 Removes Binding Pause Commitment — Chief Science Officer Says 'We Didn't Feel It Made Sense to Make Unilateral Commitments'
2026-02-24Anthropic published RSP v3.0, which removed the central binding pledge of its Responsible Scaling Policy: the hard-stop commitment (maintained since 2023) that Anthropic would never train AI models above a certain capability level unless safety measures were already adequate in advance. Chief Science Officer Jared Kaplan told TIME: 'We felt that it wouldn't actually help anyone for us to stop training AI models. We didn't really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments … if competitors are blazing ahead.' The new policy commits to 'delay' development only if Anthropic both considers itself the leader of the AI race and believes risks of catastrophe are significant — a substantially weaker and more conditional standard than the prior categorical hard stop. Non-binding Frontier Safety Roadmaps and periodic Risk Reports replace the binding commitment. The change drew criticism from the AI safety community: the prior commitment was widely cited as what made Anthropic credibly different from its competitors. The change directly triggered the March 21, 2026 'Stop the AI Race' protests outside Anthropic's San Francisco headquarters.
CCDH/CNN 'Killer Apps' Report: Claude Had Best Refusal Rate Among Major Chatbots — But Still Failed 32% of Tests
2026-03-13The CCDH/CNN 'Killer Apps' report, analyzing over 700 responses from nine major AI chatbots across nine violent attack-planning test scenarios, found Claude had the best refusal rate (68%) among major frontier AI chatbots when researchers posing as 13-year-old boys requested guidance on school shootings, assassinations, and bombings. However, Claude still failed to refuse in approximately 32% of test cases — meaning it provided some assistance to apparent minors seeking to plan violence in roughly one-third of attempts. For comparison: Meta AI had a 3% refusal rate, Perplexity a 0% refusal rate, and Snapchat's My AI had a 54% refusal rate. Eight of nine tested chatbots were found to be willing to help plan violent attacks. Full report: https://counterhate.com/wp-content/uploads/2026/03/Killer-Apps_FINAL_CCDH.pdf
Pentagon Designates Anthropic a 'Supply Chain Risk to National Security' Over AI Safety Restrictions
2026-03-05Defense Secretary Pete Hegseth officially designated Anthropic a 'supply chain risk to national security' — a designation typically reserved for foreign adversary contractors — after Anthropic refused to remove safety guardrails prohibiting its AI from being used for autonomous weapons systems and domestic mass surveillance. The designation bars military contractors from using Claude. Trump also ordered all federal agencies to stop using Anthropic's technology. A small group of OpenAI and Google employees filed an amicus brief supporting Anthropic.