Claude Sonnet 4.5 shows major gains in long-horizon coding tasks

Claude Sonnet 4.5 shows major gains in long-horizon coding tasks

New benchmarks show Claude Sonnet 4.5 achieving a 50% success rate on multi-step software engineering tasks โ€” with an average time horizon of ~1 hour 53 minutes (CI: 50โ€“235 min).

Thatโ€™s a 66% improvement in duration compared to Sonnet 4, a statistically significant leap in sustained reasoning and task persistence.

Hereโ€™s how it stacks up ๐Ÿ‘‡

  • ๐Ÿ”น Sonnet 4.5 > Sonnet 4 by a large margin
  • ๐Ÿ”น Sonnet 4.5 โ‰ˆ Opus 4.1 (no significant difference)
  • ๐Ÿ”น Still slightly behind the longest-lasting model overall

The takeaway: Claude 4.5 marks a clear step forward in long-horizon reasoning โ€” a key benchmark for real-world software engineering and autonomous agent workflows.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *