Leaderboard
Model |
% Resolved |
% Date |
---|---|---|
Claude 2 + BM25 Retrieval |
1.96 |
2023-10-10 |
SWE-Llama 13B + BM25 Retrieval |
0.70 |
2023-10-10 |
SWE-Llama 7B + BM25 Retrieval |
0.70 |
2023-10-10 |
ChatGPT 3.5 + BM25 Retrieval |
0.20 |
2023-10-10 |
GPT 4 + BM25 Retrieval* |
0.00 |
2023-10-10 |
Model |
% Resolved |
% Date |
---|---|---|
Claude 2 |
4.80 |
2023-10-10 |
SWE-Llama 13B |
3.97 |
2023-10-10 |
SWE-Llama 7B |
3.01 |
2023-10-10 |
GPT 4* |
1.74 |
2023-10-10 |
ChatGPT 3.5 |
0.52 |
2023-10-10 |
*GPT-4 is evaluated on a random 25% subset of the dataset.
The % Resolved metrics refers to the percentage of SWE-bench instances (2294 total)
that were resolved by the model.
For the Unassisted leaderboard, we only consider systems that have no assistance
finding the relevant files in the repository.
For the Assisted leaderboard, we consider models that generate based on the "oracle"
retrieval setting. "Oracle" retrieval refers to the setting where systems are provided the list
of files that were modified in the pull request.
Resources
You can download the SWE-bench task instances from HuggingFace or directly as a JSON file (development, test sets). For your convenience, to fine tune your own model for evaluation on SWE-bench, we provide five pre-processed datasets at different retrieval settings ("Oracle", 13K, 27K, 40K, 50K "Llama"). We recommend using the 13K, 27K, or 40K datasets for evaluation. The 50K "Llama" dataset is provided for reproducing the results of the SWE-bench paper.
We also provide the full SWE-Llama model weights at 13b and 7b parameters, along with their PEFT LoRA weights.
About
Citation:
@misc{jimenez2023swebench,
title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?},
author={Carlos E. Jimenez and John Yang
and Alexander Wettig and Shunyu Yao
and Kexin Pei and Ofir Press and Karthik Narasimhan},
year={2023},
eprint={2310.06770},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
SWE-bench is a dataset that tests systems’ ability to solve GitHub
issues automatically. The dataset collects 2,294 Issue-Pull Request
pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. Read more about SWE-bench in our paper!
Disclaimer: SWE-bench is for research purposes only. Models
trained and evaluated on SWE-bench can produce unexpected results.
We are not responsible for any damages caused by the use of
SWE-bench, including but not limited to, any loss of profit, data,
or use of data.