Leaderboard
Model |
% Resolved |
Date |
Logs |
Trajs |
Site |
Verified? |
Open? |
---|---|---|---|---|---|---|---|
🥇 Factory Code Droid |
19.27 |
2024-06-17 |
- |
✘ |
✘ |
||
🥈 AutoCodeRover (v20240620) + GPT 4o (2024-05-13) |
18.83 |
2024-06-28 |
- |
✘ |
✘ |
||
🥉 AppMap Navie + GPT 4o (2024-05-13) |
14.60 |
2024-06-15 |
- |
✓ |
✓ |
||
Amazon Q Developer Agent (v20240430-dev) |
13.82 |
2024-05-09 |
- |
✘ |
✘ |
||
SWE-agent + GPT 4 (1106) |
12.47 |
2024-04-02 |
✓ |
✓ |
|||
SWE-agent + Claude 3 Opus |
10.51 |
2024-04-02 |
- |
✓ |
✓ |
||
RAG + Claude 3 Opus |
3.79 |
2024-04-02 |
- |
✓ |
✓ |
||
RAG + Claude 2 |
1.96 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + GPT 4 (1106) |
1.31 |
2024-04-02 |
- |
- |
✓ |
✓ |
|
RAG + SWE-Llama 13B |
0.70 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + SWE-Llama 7B |
0.70 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + ChatGPT 3.5 |
0.17 |
2023-10-10 |
- |
- |
✓ |
✓ |
- The % Resolved metric refers to the percentage of SWE-bench instances (2294 total)
that were resolved by the model.
- "Verified" indicates that we, the SWE-bench team, received access to the system and
were able to reproduce the patch generations.
- "Open" refers to submissions that have open-source code. This does not
necessarily mean the underlying model is open-source.
- The leaderboard is updated once a week on Monday.
- If you would like to submit your model to the leaderboard, please check the submission page.
- All submissions are Pass@1, do not use
hints_text
,
and are in the unassisted setting.
Leaderboard (Lite)
SWE-bench Lite is a subset of SWE-bench that's been curated to make evaluation less costly and more accessible. If you'd like to learn more, please read our blog post.
Model |
% Resolved |
Date |
Logs |
Trajs |
Site |
Verified? |
Open? |
---|---|---|---|---|---|---|---|
🥇 CodeStory Aide + Mixed Models |
43.00 |
2024-07-02 |
- |
✘ |
✘ |
||
🥈 AbanteAI MentatBot + GPT 4o (2024-05-13) |
38.00 |
2024-06-27 |
- |
✘ |
✘ |
||
🥉 Alibaba Lingma Agent |
33.00 |
2024-06-22 |
- |
✘ |
✘ |
||
Factory Code Droid |
31.33 |
2024-06-17 |
- |
✘ |
✘ |
||
AutoCodeRover (v20240620) + GPT 4o (2024-05-13) |
30.67 |
2024-06-21 |
- |
✘ |
✘ |
||
CodeR + GPT 4 (1106) |
28.33 |
2024-06-04 |
- |
✘ |
✘ |
||
MASAI + GPT 4o (2024-05-13) |
28.00 |
2024-06-12 |
- |
✘ |
✘ |
||
SIMA + GPT 4o (2024-05-13) |
27.67 |
2024-07-06 |
- |
✘ |
✘ |
||
Agentless + GPT 4o (2024-05-13) |
27.33 |
2024-06-30 |
- |
✘ |
✓ |
||
IBM Research Agent-101 |
26.67 |
2024-06-12 |
- |
✘ |
✘ |
||
Moatless Tools + Claude 3.5 Sonnet |
26.67 |
2024-06-23 |
✓ |
✓ |
|||
Aider + GPT 4o & Claude 3 Opus |
26.33 |
2024-05-23 |
- |
✘ |
✓ |
||
Bytedance MarsCode Agent + GPT 4o (2024-05-13) |
25.33 |
2024-06-12 |
- |
✘ |
✘ |
||
Moatless Tools + GPT 4o (2024-05-13) |
24.67 |
2024-06-17 |
✓ |
✓ |
|||
OpenCSG StarShip CodeGenAgent + GPT 4 (0613) |
23.67 |
2024-05-24 |
- |
✘ |
✘ |
||
AppMap Navie + GPT 4o (2024-05-13) |
21.67 |
2024-06-15 |
- |
✓ |
✓ |
||
Amazon Q Developer Agent (v20240430-dev) |
20.33 |
2024-05-09 |
- |
✘ |
✘ |
||
AutoCodeRover (v20240408) + GPT 4 (0125) |
19.00 |
2024-05-30 |
- |
✘ |
✓ |
||
SWE-agent + GPT 4 (1106) |
18.00 |
2024-04-02 |
✓ |
✓ |
|||
SWE-agent + GPT 4o (2024-05-13) |
17.00 |
2024-06-03 |
✓ |
✓ |
|||
SWE-agent + Claude 3 Opus |
11.67 |
2024-04-02 |
- |
✓ |
✓ |
||
RAG + Claude 3 Opus |
4.33 |
2024-04-02 |
- |
✓ |
✓ |
||
RAG + Claude 2 |
3.00 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + GPT 4 (1106) |
2.67 |
2024-04-02 |
- |
- |
✓ |
✓ |
|
RAG + SWE-Llama 7B |
1.33 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + SWE-Llama 13B |
1.00 |
2023-10-10 |
- |
- |
✓ |
✓ |
|
RAG + ChatGPT 3.5 |
0.33 |
2023-10-10 |
- |
- |
✓ |
✓ |
The % Resolved metric is out of 300 instances for SWE-bench Lite.
Resources
You can download the SWE-bench task instances from HuggingFace or directly as a JSON file (development, test sets). For your convenience, to fine tune your own model for evaluation on SWE-bench, we provide five pre-processed datasets at different retrieval settings ("Oracle", 13K, 27K, 40K, 50K "Llama"). We recommend using the 13K, 27K, or 40K datasets for evaluation. The 50K "Llama" dataset is provided for reproducing the results of the SWE-bench paper.
SWE-bench Lite is also available for download from HuggingFace.
We also provide the full SWE-Llama model weights at 13b and 7b parameters, along with their PEFT LoRA weights.
About
![](img/teaser.png)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. Read more about SWE-bench in our paper! Citation:
@inproceedings{
jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=VTF8yNQM66}
}
Disclaimer: SWE-bench is for research purposes only. Models
trained and evaluated on SWE-bench can produce unexpected results.
We are not responsible for any damages caused by the use of
SWE-bench, including but not limited to, any loss of profit, data,
or use of data.
Correspondence to: carlosej@princeton.edu, johnby@stanford.edu