SWE-bench

Can Language Models Resolve Real-World GitHub Issues?

ICLR 2024

Carlos E. Jimenez*, John Yang*,
Alexander Wettig, Shunyu Yao, Kexin Pei,
Ofir Press, Karthik Narasimhan

News

📣 [10/2024] Introducing SWE-bench Multimodal! Can AI systems "see" bugs and fix them? 👀 💻 [Link]

📣 [08/2024] SWE-bench x OpenAI = SWE-bench Verified, a human-validated subset of 500 problems reviewed by software engineers! [Report]

📣 [06/2024] We've Docker-ized SWE-bench for easier, containerized, reproducible evaluation. [Report]

📣 [03/2024] Check out our latest work, SWE-agent, which achieves a 12.47% resolve rate on SWE-bench! [Link]

📣 [03/2024] We've released SWE-bench Lite! Running all of SWE-bench can take time. This subset makes it easier! [Report]

Leaderboard

Model
% Resolved
Date
Logs
Trajs
Site

🥇 Honeycomb

22.06

2024-08-20

🔗

🔗

🔗

🥈 Amazon Q Developer Agent (v20240719-dev)

19.75

2024-07-21

🔗

🔗

🔗

🥉 Factory Code Droid

19.27

2024-06-17

🔗

-

🔗

AutoCodeRover (v20240620) + GPT 4o (2024-05-13)

18.83

2024-06-28

🔗

-

🔗

🤠 ✅ SWE-agent + Claude 3.5 Sonnet

18.13

2024-06-20

🔗

🔗

-

🤠 ✅ AppMap Navie + GPT 4o (2024-05-13)

14.60

2024-06-15

🔗

-

🔗

Amazon Q Developer Agent (v20240430-dev)

13.82

2024-05-09

🔗

-

🔗

🤠 ✅ SWE-agent + GPT 4 (1106)

12.47

2024-04-02

🔗

🔗

🔗

🤠 ✅ SWE-agent + GPT 4o (2024-05-13)

11.99

2024-07-28

🔗

🔗

🔗

🤠 ✅ SWE-agent + Claude 3 Opus

10.51

2024-04-02

🔗

🔗

-

🤠 ✅ RAG + Claude 3 Opus

3.79

2024-04-02

🔗

-

🔗

🤠 ✅ RAG + Claude 2

1.96

2023-10-10

🔗

-

-

🤠 ✅ RAG + GPT 4 (1106)

1.31

2024-04-02

🔗

-

-

🤠 ✅ RAG + SWE-Llama 13B

0.70

2023-10-10

🔗

-

-

🤠 ✅ RAG + SWE-Llama 7B

0.70

2023-10-10

🔗

-

-

🤠 ✅ RAG + ChatGPT 3.5

0.17

2023-10-10

🔗

-

-

Model
% Resolved
Date
Logs
Trajs
Site

🥇 Gru(2024-08-24)

45.20

2024-08-24

🔗

🔗

🔗

🥈 Honeycomb

40.60

2024-08-20

🔗

🔗

🔗

🥉 Amazon Q Developer Agent (v20240719-dev)

38.80

2024-07-21

🔗

🔗

🔗

AutoCodeRover (v20240620) + GPT 4o (2024-05-13)

38.40

2024-06-28

🔗

-

🔗

Factory Code Droid

37.00

2024-06-17

🔗

-

🔗

🤠 ✅ SWE-agent + Claude 3.5 Sonnet

33.60

2024-06-20

🔗

🔗

-

🤠 ✅ AppMap Navie + GPT 4o (2024-05-13)

26.20

2024-06-15

🔗

-

🔗

Amazon Q Developer Agent (v20240430-dev)

25.60

2024-05-09

🔗

-

🔗

EPAM AI/Run Developer Agent + GPT4o

24.00

2024-08-20

🔗

🔗

🔗

🤠 ✅ SWE-agent + GPT 4o (2024-05-13)

23.20

2024-07-28

🔗

🔗

🔗

🤠 ✅ SWE-agent + GPT 4 (1106)

22.40

2024-04-02

🔗

🔗

🔗

🤠 ✅ SWE-agent + Claude 3 Opus

18.20

2024-04-02

🔗

🔗

-

🤠 ✅ RAG + Claude 3 Opus

7.00

2024-04-02

🔗

-

🔗

🤠 ✅ RAG + Claude 2

4.40

2023-10-10

🔗

-

-

🤠 ✅ RAG + GPT 4 (1106)

2.80

2024-04-02

🔗

-

-

🤠 ✅ RAG + SWE-Llama 7B

1.40

2023-10-10

🔗

-

-

🤠 ✅ RAG + SWE-Llama 13B

1.20

2023-10-10

🔗

-

-

🤠 ✅ RAG + ChatGPT 3.5

0.40

2023-10-10

🔗

-

-

Model
% Resolved
Date
Logs
Trajs
Site

🥇 CodeStory Aide + Mixed Models

43.00

2024-07-02

🔗

-

🔗

🥈 Honeycomb

38.33

2024-08-20

🔗

🔗

🔗

🥉 AbanteAI MentatBot + GPT 4o (2024-05-13)

38.00

2024-06-27

🔗

-

🔗

Gru(2024-08-11)

35.67

2024-08-11

🔗

🔗

🔗

Isoform

35.00

2024-08-29

🔗

🔗

🔗

SuperCoder2.0

34.00

2024-08-06

🔗

🔗

🔗

Bytedance MarsCode Agent + GPT 4o (2024-05-13)

34.00

2024-07-23

🔗

-

🔗

Alibaba Lingma Agent

33.00

2024-06-22

🔗

🔗

🔗

Factory Code Droid

31.33

2024-06-17

🔗

-

🔗

🤠 AutoCodeRover (v20240620) + GPT 4o (2024-05-13)

30.67

2024-06-21

🔗

🔗

🔗

Amazon Q Developer Agent (v20240719-dev)

29.67

2024-07-21

🔗

🔗

🔗

🤠 Agentless + RepoGraph + GPT-4o

29.67

2024-08-08

🔗

🔗

🔗

CodeR + GPT 4 (1106)

28.33

2024-06-04

🔗

-

🔗

MASAI + GPT 4o (2024-05-13)

28.00

2024-06-12

🔗

-

🔗

SIMA + GPT 4o (2024-05-13)

27.67

2024-07-06

🔗

🔗

🔗

🤠 Agentless + GPT 4o (2024-05-13)

27.33

2024-06-30

🔗

-

🔗

🤠 ✅ Moatless Tools + Claude 3.5 Sonnet

26.67

2024-06-23

🔗

🔗

🔗

🤠 ✅ OpenDevin + CodeAct v1.8

26.67

2024-07-25

🔗

🔗

🔗

IBM Research Agent-101

26.67

2024-06-12

🔗

-

🔗

🤠 Aider + GPT 4o & Claude 3 Opus

26.33

2024-05-23

🔗

-

🔗

🤠 ✅ Moatless Tools + GPT 4o (2024-05-13)

24.67

2024-06-17

🔗

🔗

🔗

OpenCSG StarShip CodeGenAgent + GPT 4 (0613)

23.67

2024-05-24

🔗

-

🔗

🤠 ✅ SWE-agent + Claude 3.5 Sonnet

23.00

2024-06-20

🔗

🔗

-

🤠 ✅ AppMap Navie + GPT 4o (2024-05-13)

21.67

2024-06-15

🔗

-

🔗

Bytedance AutoSE (based on SWE-Agent) + GPT4/GPT4o Mixed (20240828)

21.67

2024-08-28

🔗

🔗

-

Amazon Q Developer Agent (v20240430-dev)

20.33

2024-05-09

🔗

-

🔗

🤠 AutoCodeRover (v20240408) + GPT 4 (0125)

19.00

2024-05-30

🔗

-

🔗

🤠 ✅ SWE-agent + GPT 4o (2024-05-13)

18.33

2024-07-28

🔗

🔗

🔗

🤠 ✅ SWE-agent + GPT 4 (1106)

18.00

2024-04-02

🔗

🔗

🔗

🤠 ✅ SWE-agent + Claude 3 Opus

11.67

2024-04-02

🔗

🔗

-

🤠 ✅ RAG + Claude 3 Opus

4.33

2024-04-02

🔗

-

🔗

🤠 ✅ RAG + Claude 2

3.00

2023-10-10

🔗

-

-

🤠 ✅ RAG + GPT 4 (1106)

2.67

2024-04-02

🔗

-

-

🤠 ✅ RAG + SWE-Llama 7B

1.33

2023-10-10

🔗

-

-

🤠 ✅ RAG + SWE-Llama 13B

1.00

2023-10-10

🔗

-

-

🤠 ✅ RAG + ChatGPT 3.5

0.33

2023-10-10

🔗

-

-

SWE-bench Lite is a subset of SWE-bench that's been curated to make evaluation less costly and more accessible [Post].
SWE-bench Verified is a human annotator filtered subset that has been deemed to have a ceiling of 100% resolution rate [Post].

- The % Resolved metric refers to the percentage of SWE-bench instances (2294 for test, 500 for verified, 300 for lite) that were resolved by the model.
- ✅ Checked indicates that we, the SWE-bench team, received access to the system and were able to reproduce the patch generations.
- 🤠 Open refers to submissions that have open-source code. This does not necessarily mean the underlying model is open-source.
- The leaderboard is updated once a week on Monday.
- If you would like to submit your model to the leaderboard, please check the submission page.
- All submissions are Pass@1, do not use hints_text, and are in the unassisted setting.

Resources

You can download the SWE-bench task instances from HuggingFace or directly as a JSON file (development, test sets). For your convenience, to fine tune your own model for evaluation on SWE-bench, we provide five pre-processed datasets at different retrieval settings ("Oracle", 13K, 27K, 40K, 50K "Llama"). We recommend using the 13K, 27K, or 40K datasets for evaluation. The 50K "Llama" dataset is provided for reproducing the results of the SWE-bench paper.

SWE-bench Lite is also available for download from HuggingFace.

SWE-bench Verified can be downloaded from HuggingFace.

We also provide the full SWE-Llama model weights at 13b and 7b parameters, along with their PEFT LoRA weights.

About

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. Read more about SWE-bench in our paper!

Citation

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

Disclaimer: SWE-bench is for research purposes only. Models trained and evaluated on SWE-bench can produce unexpected results. We are not responsible for any damages caused by the use of SWE-bench, including but not limited to, any loss of profit, data, or use of data.

Usage: If you would like to use this website template for your own leaderboard, please send Carlos & John an email requesting permission. If granted, please make sure to acknowledge the SWE-bench team and link to this leaderboard on the home page of the website.

Correspondence to: carlosej@princeton.edu, johnby@stanford.edu