Opus 4.5 writes better code, leading across 7 out of 8 programming languages on SWE-bench Multilingual.Opus 4.5 can solve challenging coding problems with ease with a 10.6% jump over Sonnet 4.5 on Aider Polyglot.Opus 4.5 improves on frontier agentic search with a significant jump on BrowseComp-Plus.Opus 4.5 stays on track over the long haul earning 29% more than Sonnet 4.5 on Vending-Bench.
In our evaluation, “concerning behavior” scores measure a very wide range of misaligned behavior, including both cooperation with human misuse and undesirable actions that the model takes at its own initiative [3].
Note that this benchmark includes only very strong prompt injection attacks. It was developed and run by Gray Swan.