Overview
SWE-bench Multilingual extends the SWE-bench benchmark to evaluate language models across 9 programming languages: C, C++, Go, Java, JavaScript/TypeScript, PHP, Ruby, and Rust. The dataset consists of 300 curated tasks derived from real-world GitHub pull requests across 42 repositories.
This leaderboard uses a standardized evaluation environment to enable fair comparison of different language models. For detailed information about the benchmark construction and evaluation methodology, see the full documentation.
Click for more details
- The benchmark includes tasks from popular repositories spanning web frameworks, data processing tools, core utilities, and common libraries.
- Language distribution: C (30), C++ (12), Go (42), Java (43), JS/TS (43), PHP (43), Ruby (44), Rust (43).
- Tasks are validated to ensure they have well-defined problems and unambiguous test criteria.
- See the detailed documentation page for full methodology and statistics.
Leaderboard
SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post].
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post].
SWE-bench Lite is a subset curated for less costly evaluation [Post].
SWE-bench Verified is a human-filtered subset [Post].
SWE-bench Multimodal features issues with visual elements [Post].
Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified & Bash Only, 300 Lite & Multilingual, 517 Multimodal).
Citation
If you use SWE-bench Multilingual in your research, please cite our paper:
@inproceedings{
jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=VTF8yNQM66}
}
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. R. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues? arXiv preprint arXiv:2310.06770.
Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).