SWE-bench Verified

Overview

SWE-bench Verified is a human-filtered subset of 500 instances from SWE-bench, created in collaboration with OpenAI. Human annotators reviewed each instance to ensure the problem descriptions are clear, the test patches are correct, and the tasks are solvable given the available information. Read more in the OpenAI blog post.

The Verified leaderboard features results from a wide variety of AI coding systems, from simple LM agent loops to RAG systems to multi-rollout and review type systems.

Bash Only: Comparing Language Models

While the full leaderboard compares arbitrary systems, we are also interested in evaluating language models directly. To make an apples-to-apples comparison of LMs easier, we evaluate all LMs using mini-SWE-agent in a minimal bash environment. No tools, no special scaffold structure; just a simple ReAct agent loop. These results represent the state-of-the-art LM performance when given just a bash shell and a problem.

On the leaderboard, use the Agent dropdown to select between the mini-SWE-agent results and the full leaderboard with all agents.

Click for more details on the bash-only setup

We use this configuration for all models.
The release number in the leaderboard corresponds to the version of the mini-SWE-agent used to run the evaluation.
Results of release 1.x and 2.x are not necessarily comparable to each other, as 2.x uses tool calling to invoke actions, whereas 1.x parses action from the output strings. Read more about the changes in the mini-SWE-agent v2 migration guide.
For all results of release 1.x and earlier, the LM temperature is set to 0.0 if the temperature parameter is supported. For all results of release 2.x and later, the temperature parameter is not set.
Other than the aforementioned notes, small changes in the setup and configuration are captured by the version number in the leaderboard. Version numbers correspond to tags in the mini-SWE-agent repository. Since the mini-SWE-agent repository contains other components as well, a new version number does not necessarily mean that anything of relevance has changed for the bash-only leaderboard setting. We do not aim to tune the configuration and setup to reach higher and higher scores. Instead, we only make general fixes to the framework, as well as clarifications in the prompt to provide a maximally fair evaluation setup for the LMs. Generally, everything in the minor or patch release version number should be a minor change for the purpose of the bash-only leaderboard.
This guide shows how to run the evaluation yourself.

Citation

If you use SWE-bench Verified in your research, please cite our paper:

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? arXiv preprint arXiv:2310.06770.

Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).