2025

August 19, 2025
in Benchmarks, mini-swe-agent
4 min read

mini-SWE-agent roulette mode: Randomly switching between models at every step can boost performance

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.

August 8, 2025
in Benchmarks, mini-swe-agent
11 min read

GPT-5 on SWE-bench with `mini`: Cost & performance deep-dive

This blog post covers the results of running mini-SWE-agent with GPT-5, GPT-5-mini, and GPT-5-nano. Results will be added to the SWE-bench (bash-only) leaderboard shortly.

GPT-5 is as good as Sonnet 4, but quite a bit cheaper

GPT-5 is as good as Sonnet 4, but quite a bit cheaper
For sacrificing only a little bit of performance (5%pt), GPT-5-mini is incredibly cheap
GPT-5-nano is even cheaper, I would say you pay half for half the performance
You can reproduce our numbers for just $18 (with GPT-5-mini) using the command at the bottom!

mini-SWE-agent roulette mode: Randomly switching between models at every step can boost performance

GPT-5 on SWE-bench with mini: Cost & performance deep-dive

GPT-5 on SWE-bench with `mini`: Cost & performance deep-dive