All official submissions to the SWE-bench leaderboard are maintained at SWE-bench/experiments

Submit to SWE-bench Leaderboard

If you are interested in submitting your model to the SWE-bench Leaderboard, please do the following:

  1. Fork this repository.
  2. Under the split that you evaluate on (evaluation/lite/ or evaluation/test), create a new folder with the submission date and the model name (e.g. 20240415_sweagent_gpt4).
  3. Within the folder, please include the following files:
    • all_preds.jsonl: A JSONL file containing the predictions for the task instances in the split.
    • results.json: A JSON file containing the results of the evaluation, generated with get_model_report.
    • logs/: A folder containing the execution logs for the model run.
    • trajs/: (For Agent-Based Approaches) A folder containing the trajectories for the model run, such as for SWE-agent.
    • (Recommended) Include anything you'd like to share about your model here!
  4. Create a pull request to this repository with the new folder.

You can refer to this tutorial for a quick overview of how to evaluate your model on SWE-bench.

Submission Guidelines

Please note that we consider an eligible submission to the SWE-bench [Lite] leaderboard to satisfy these criteria:

  1. The use of the hints_text field is not allowed. See our explanation here.
  2. The result should be pass@1. There should be one execution log per task instance for all 2294 task instances.
  3. The result should not be in the "Oracle" retrieval setting. The agent cannot be told the correct files to edit, where "correct" refers to the files modified by the reference solution patch.

Verify Your Results

The Verified check ✓ indicates that we (the SWE-bench team) received access to the model and were able to reproduce the patch generations.

If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:

  1. Create an issue
  2. In the issue, provide us instructions on how to run your model on SWE-bench.
  3. We will run your model on a random subset of SWE-bench and verify the results.