Overview

SWE-bench Multilingual extends the SWE-bench benchmark to evaluate language models across 9 programming languages: C, C++, Go, Java, JavaScript/TypeScript, PHP, Ruby, and Rust. The dataset consists of 300 curated tasks derived from real-world GitHub pull requests across 42 repositories.

This leaderboard uses a standardized evaluation environment to enable fair comparison of different language models. For detailed information about the benchmark construction and evaluation methodology, see the full documentation.

Click for more details
  • The benchmark includes tasks from popular repositories spanning web frameworks, data processing tools, core utilities, and common libraries.
  • Language distribution: C (30), C++ (12), Go (42), Java (43), JS/TS (43), PHP (43), Ruby (44), Rust (43).
  • Tasks are validated to ensure they have well-defined problems and unambiguous test criteria.
  • See the detailed documentation page for full methodology and statistics.

Leaderboard

Filters:

SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].

Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite & Multilingual, 517 Multimodal).


Citation

If you use SWE-bench Multilingual in your research, please cite our paper:

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
    }