SWE-bench Bash Only
Evaluating language models in minimal bash environments (mini-SWE-agent) for fair LM comparison.
Overview
The original SWE-bench benchmark aims to evaluate arbitrary systems on their ability to resolve GitHub issues. Currently, top-performing systems represent a wide variety of AI scaffolds; from simple LM agent loops, to RAG systems, to multi-rollout and review type systems. Each of these systems are totally valid solutions to the problem of solving GitHub issues.
However, when we first created SWE-bench, we were initially interested in evaluating LMs primarily. To make an apples-to-apples comparison of LMs easier, we've introduced the SWE-bench Bash Only leaderboard. In this setting, we use our mini-SWE-agent package to evaluate LMs in a minimal bash environment. No tools, no special scaffold structure; just a simple ReAct agent loop. Results on SWE-bench Bash Only represent the state-of-the-art LM performance when given just a bash shell and a problem.
Leaderboard
SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite, 517 Multimodal).
Citation
If you use SWE-bench Bash Only in your research, please cite our paper:
@inproceedings{ jimenez2024swebench, title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?}, author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=VTF8yNQM66} }
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? arXiv preprint arXiv:2310.06770.
Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).