SWE-bench Multilingual

Overview

SWE-bench Multilingual extends the SWE-bench benchmark to evaluate language models across 9 programming languages: C, C++, Go, Java, JavaScript/TypeScript, PHP, Ruby, and Rust. The dataset consists of 300 curated tasks derived from real-world GitHub pull requests across 42 repositories.

This leaderboard uses a standardized evaluation environment to enable fair comparison of different language models. For detailed information about the benchmark construction and evaluation methodology, see the full documentation.

Click for more details

The benchmark includes tasks from popular repositories spanning web frameworks, data processing tools, core utilities, and common libraries.
Language distribution: C (30), C++ (12), Go (42), Java (43), JS/TS (43), PHP (43), Ruby (44), Rust (43).
Tasks are validated to ensure they have well-defined problems and unambiguous test criteria.
See the detailed documentation page for full methodology and statistics.

Leaderboard

SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].

Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite & Multilingual, 517 Multimodal).

Citation

If you use SWE-bench Multilingual in your research, please cite our paper:

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
    }

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? arXiv preprint arXiv:2310.06770.

Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).