Outlook Business Desk
Anthropic has released Claude Opus 4.5, presenting it as its strongest version yet and claiming it leads the field in coding performance, agent behaviour, and tasks involving computer-based operations.
Opus 4.5 secures an 80.9% score on SWE-bench Verified, becoming the first model to cross the 80% line and setting a new industry performance milestone.
Claude Opus 4.5 scores above Gemini 3 Pro’s 76.2% and GPT-5.1 Codex Max’s 77.9% on SWE-bench Verified, showing a clear performance edge over recently introduced competitors.
Anthropic says Opus 4.5 surpasses all human applicants on its two-hour engineering assessment, a test designed to gauge technical judgement and pressure-based problem-solving, though it does not measure broader collaborative strengths.
On the τ2-bench, which evaluates multi-turn real-world tasks, Claude Opus 4.5 performs above rival models, displaying stronger reasoning depth and more consistent step-by-step execution in practical scenarios.
During an airline-service test scenario, Opus 4.5 managed a non-modifiable booking by upgrading the cabin first and then adjusting the flights, offering a valid solution that satisfied the benchmark’s requirements.
Anthropic presents Opus 4.5 as its safest model so far, noting improved resistance to prompt-injection attempts that aim to push the system towards unintended actions or misleading instructions.
Claude Opus 4.5 can now be used through the Claude app on Android and iOS, as well as the website, with developers gaining immediate access for integration purposes.