SWE-bench Bash Only
Evaluating language models in minimal bash environments (mini-SWE-agent) for fair LM comparison.
Overview
The original SWE-bench benchmark aims to evaluate arbitrary systems on their ability to resolve GitHub issues. Currently, top-performing systems represent a wide variety of AI scaffolds; from simple LM agent loops, to RAG systems, to multi-rollout and review type systems. Each of these systems are totally valid solutions to the problem of solving GitHub issues.
However, when we first created SWE-bench, we were initially interested in evaluating LMs primarily. To make an apples-to-apples comparison of LMs easier, we've introduced the SWE-bench Bash Only leaderboard. In this setting, we use our mini-SWE-agent package to evaluate LMs in a minimal bash environment. No tools, no special scaffold structure; just a simple ReAct agent loop. Results on SWE-bench Bash Only represent the state-of-the-art LM performance when given just a bash shell and a problem.
Details
- We use this configuration for all models.
- The LM temperature is set to 0.0 if the temperature parameter is supported.
- This guide shows how to run the evaluation yourself.
- Small changes in the setup and configuration are captured by the version number in the leaderboard. Version numbers correspond to tags in the mini-SWE-agent repository. Since the mini-SWE-agent repository contains other components as well, a new version number does not necessarily mean that anything of relevance has changed for the bash-only leaderboard setting. We do not aim to tune the configuration and setup to reach higher and higher scores. Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs.
Leaderboard
SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite, 517 Multimodal).
Citation
If you use SWE-bench Bash Only in your research, please cite our paper:
@inproceedings{ jimenez2024swebench, title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?}, author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=VTF8yNQM66} }
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? arXiv preprint arXiv:2310.06770.
Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).