All official submissions to the SWE-bench leaderboard are maintained at
SWE-bench/experiments
Submit to SWE-bench Leaderboard
If you are interested in submitting your model to the SWE-bench Leaderboard, please do the following:
- Fork this repository.
- Under the split that you evaluate on (
evaluation/lite/
orevaluation/test
), create a new folder with the submission date and the model name (e.g.20240415_sweagent_gpt4
). - Within the folder, please include the following files:
all_preds.jsonl
: A JSONL file containing the predictions for the task instances in the split.results.json
: A JSON file containing the results of the evaluation, generated with get_model_report.logs/
: A folder containing the execution logs for the model run.trajs/
: (For Agent-Based Approaches) A folder containing the trajectories for the model run, such as for SWE-agent.README.md
: (Recommended) Include anything you'd like to share about your model here!
- Create a pull request to this repository with the new folder.
You can refer to this tutorial for a quick overview of how to evaluate your model on SWE-bench.
Submission Guidelines
Please note that we consider an eligible submission to the SWE-bench [Lite] leaderboard to satisfy these criteria:
- The use of the
hints_text
field is not allowed. See our explanation here. - The result should be pass@1. There should be one execution log per task instance for all 2294 task instances.
- The result should not be in the "Oracle" retrieval setting. The agent cannot be told the correct files to edit, where "correct" refers to the files modified by the reference solution patch.
Verify Your Results
The Verified check ✓ indicates that we (the SWE-bench team) received access to the model and were able to reproduce the patch generations.
If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:
- Create an issue
- In the issue, provide us instructions on how to run your model on SWE-bench.
- We will run your model on a random subset of SWE-bench and verify the results.