New benchmarks show Claude Sonnet 4.5 achieving a 50% success rate on multi-step software engineering tasks โ with an average time horizon of ~1 hour 53 minutes (CI: 50โ235 min).
Thatโs a 66% improvement in duration compared to Sonnet 4, a statistically significant leap in sustained reasoning and task persistence.
Hereโs how it stacks up ๐
- ๐น Sonnet 4.5 > Sonnet 4 by a large margin
- ๐น Sonnet 4.5 โ Opus 4.1 (no significant difference)
- ๐น Still slightly behind the longest-lasting model overall
The takeaway: Claude 4.5 marks a clear step forward in long-horizon reasoning โ a key benchmark for real-world software engineering and autonomous agent workflows.

