[SWE-bench Verified] Detecting cheating in submissions

2025-11-19 • by John Yang

How similar are agent solutions to the ground truth?

[mini-SWE-agent] Roulette mode!

2025-08-19 • by Kilian Lieret

Randomly switching between models at every step can boost performance

[mini-SWE-agent] GPT-5

2025-08-08 • by Kilian Lieret

Cost & performance deep-dive