
SWE-bench Leaderboards
Aug 3, 2025 · SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post]. SWE-bench Lite is a subset curated for less costly evaluation …
Overview - SWE-bench
You can find the full leaderboard at swebench.com! 📋 Overview SWE-bench provides: Real-world GitHub issues - Evaluate LLMs on actual software engineering tasks Reproducible evaluation - Docker …
SWE-bench
SWE-bench tests AI systems' ability to solve GitHub issues. We collect 2,294 task instances by crawling Pull Requests and Issues from 12 popular Python repositories. Each instance is based on a pull …
SWE-bench Bash Only
Aug 3, 2025 · SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post]. SWE-bench Lite is a subset curated for less costly evaluation …
FAQ - SWE-bench
You can also set --cache_level=env and --clean=True when running swebench.harness.run_evaluation to only dynamically remove instance images after they are used.
SWE-bench Multilingual
Originally posted as a blog post on Kabir's website. Summary This post introduces SWE-bench Multilingual, a new benchmark in the SWE-bench family designed to evaluate the software …
SWE-bench Results Viewer
Select the split & model below to get automated analyses of the model's performance on the SWE-bench split.
SWE-bench Multimodal
Citation If you use SWE-bench Multimodal in your research, please cite our paper:
Installation - SWE-bench
This will install the package in development mode, allowing you to make changes to the code if needed. Install dependencies for dataset generation or RAG inference To install the dependencies for dataset …
SWE-bench Lite
Removed instances that create or remove files Removed instances that contain tests with error message checks Finally, sampled 300 test instances and 23 development instances from the …