SWE-bench is a widely used benchmark for measuring AI coding agents. This guide breaks down methodology, variants, scoring mechanics, and what leaderboard results mean for production engineering.

Imagine you're testing an AI coding agent on a real shipping-platform repository. You wouldn't only ask it to write a toy sorting function. You'd want to see how it works inside a legacy codebase, finds a billing-idempotency bug, and writes a fix without breaking auth or billing.
That's what SWE-bench (Software Engineering Benchmark) tries to measure for AI coding agents. Model releases often quote "X% on SWE-bench" as proof of coding progress. Under that headline sits a benchmark with variants, hidden tests, scaffolds, sampling policies, and contamination caveats. By May 2026, the most-quoted variant, SWE-bench Verified, had largely saturated through contamination, and frontier labs were already pointing readers toward a harder, contamination-resistant successor, SWE-bench Pro. Understanding both is now part of reading any coding score honestly.
This article walks through what SWE-bench tests, how it grades agents, and how to read the leaderboard without overreading it. We'll trace a representative repository task step by step so you can see what "repository-level engineering" means in practice.
Previous coding benchmarks like HumanEval [1] and Mostly Basic Programming Problems (MBPP) [2] test whether a model can write a single function in isolation. The prompt gives you the signature, the docstring, and a few test cases. The model has to fill in the body.
SWE-bench tests something fundamentally different. It hands the agent a messy GitHub issue description written by a real user and a checkout of an entire open-source repository. The agent has to figure out which files matter, understand the existing code, write a patch, and prove the fix works without breaking anything else.
Writing a sorting function is a different skill from exploring a 500-file Django codebase, identifying the three files that need changes, and producing a patch that resolves the issue without regressions.
That difference is the core reason SWE-bench became harder to interpret than earlier code benchmarks:
In traditional benchmarks, the context is fully self-contained. The AI receives the exact inputs, outputs, and edge cases. In contrast, SWE-bench requires the agent to build its own context by searching through thousands of lines of code to understand the system architecture before writing a single line of implementation.
SWE-bench doesn't only test "can the model write code?" It tests whether a system can work through, patch, and verify code inside a repository.
| Dimension | HumanEval / MBPP | SWE-bench |
|---|---|---|
| Scope | Single function | Full repository |
| Context | Self-contained spec | Issue description + codebase |
| Navigation | None | Must find relevant files |
| Testing | Simple I/O checks | Full test suite (existing + new) |
| Realism | Synthetic | Real-world pull requests |
Before we dive into the scoring mechanics, here's a mental model that will help you understand why this benchmark is so hard.
Imagine a warehouse agent (the LLM) inside a large fulfillment center (the repository). The agent only has a small scanner view (the context window). To find a specific mislabeled parcel (the bug), they can't see the whole warehouse at once. They must:
find, grep, search tools).The agent can't memorize every bin. It needs tools. In SWE-bench, the Agent-Computer Interface (ACI) is the set of tools the agent uses to interact with the repository. Early agents used basic shell commands, but systems like SWE-agent [3] provide specialized commands for code traversal and editing.
The ACI is the keyboard and monitor for the LLM. It includes tools for file searching, file editing, and running environment-specific tests. The design of that interface can matter as much as the base model. A strong model in a weak interface can underperform a weaker model wrapped in a disciplined search, edit, and test loop.
When you evaluate an AI coding tool, look at its tool set, not just its model name. Does it give the model precise line-level editing, or does it force the model to rewrite entire files? That design choice changes the result you are measuring.
Let's trace how an agent might approach a representative scikit-learn-style task. The steps below are schematic: they show the repository workflow that SWE-bench requires, not a literal reproduction of one benchmark instance or its exact upstream patch.
Representative issue: In
LogisticRegression, themax_iterparameter is ignored whensolver='liblinear'.
In a benchmark task, the agent receives an issue statement and a repository checkout at the commit just before the fix. Here's what a successful trace looks like.
The agent searches the repository for "LogisticRegression" and "liblinear" to find where the solver logic lives. It might run something like:
1import subprocess
2
3# Agent runs a grep-like search inside the repo
4result = subprocess.run(
5 ["grep", "-r", "liblinear", "--include=*.py", "."],
6 capture_output=True,
7 text=True,
8)
9print(result.stdout[:500])The output points toward sklearn/linear_model/_logistic.py. The agent also finds test files that mention LogisticRegression, which helps it understand how the class is expected to behave.
A careful agent doesn't try to fix the bug immediately. It writes a tiny reproduction script to confirm the bug exists. If you can't break it, you can't prove you fixed it.
1from sklearn.linear_model import LogisticRegression
2from sklearn.datasets import make_classification
3
4X, y = make_classification(n_samples=100, n_features=4, random_state=42)
5
6# In the pinned benchmark checkout, this exposes the reported behavior.
7model = LogisticRegression(solver='liblinear', max_iter=1)
8model.fit(X, y)
9
10# If max_iter is ignored, this won't raise the expected warning
11print("Fit completed without respecting max_iter")This script isn't meant to be copied into an arbitrary modern scikit-learn install and treated as proof of a live bug. In a real SWE-bench run, the agent executes its reproduction inside the pinned repository checkout and environment for that benchmark instance. The important habit is the same: reproduce the issue before patching it.
The agent reads _logistic.py and traces how the liblinear solver path handles parameters. In a task like this, the important localization step is to find the narrow branch where user-provided solver settings are passed into the lower-level training routine.
This step is where most agents win or lose. An agent that picks the right files on the first try can solve the issue in a few iterations. One that starts in the wrong part of the codebase burns through its token budget exploring dead ends.
The agent generates a small patch. Strong fixes change only the lines that matter. In this case, the agent adds the missing max_iter check inside the liblinear branch.
1--- a/sklearn/linear_model/_logistic.py
2+++ b/sklearn/linear_model/_logistic.py
3@@ -...,... @@
4 if solver == 'liblinear':
5- train_liblinear(..., max_iter=default_iter, ...)
6+ train_liblinear(..., max_iter=max_iter, ...)This diff is schematic, not the exact upstream patch. Key idea: the agent should pass the existing user-provided parameter through the solver path instead of inventing a new behavior.
Over-editing is a common failure mode. Some agents rewrite 200 lines to fix a 1-line bug, accidentally breaking unrelated features. The scoring harness catches those regressions through PASS_TO_PASS tests.
The agent reruns the reproduction script and confirms the expected behavior now appears. It also runs the existing scikit-learn tests for LogisticRegression to make sure nothing else broke.
Only after this local verification does the agent submit the patch to the benchmark harness. The harness then applies the hidden test patch and runs the official grading suite.
When an agent submits a patch, the evaluation harness applies that patch to the repository, applies the hidden test patch from the original pull request, and then runs the grading tests inside a Dockerized environment [4][5].
The harness checks two test categories:
Using our LogisticRegression example, the hidden FAIL_TO_PASS test might create a model with solver='liblinear' and max_iter=1, then assert that the solver respects the iteration limit. The PASS_TO_PASS tests run existing logistic-regression tests to make sure the patch didn't break other solver behavior.
The grading path is strict because the benchmark cares about both fixing the target issue and preserving existing behavior:
Conceptually, you can bucket outcomes as FULL, PARTIAL, or NO. The public leaderboard reports % Resolved, which counts only cases where the issue is fixed without regressions [5].
If FAIL_TO_PASS isn't fully green, the fix is incomplete. If PASS_TO_PASS breaks, the patch regresses existing behavior. On the public leaderboard, partials count as zero toward the headline metric.
Here's a toy mirror of that scoring logic. It's intentionally tiny, but it captures the rule readers need: resolved means all fix tests and all regression tests pass.
1def classify_swebench_result(fail_to_pass, pass_to_pass):
2 fixed_target_issue = all(fail_to_pass)
3 preserved_existing_behavior = all(pass_to_pass)
4
5 if fixed_target_issue and preserved_existing_behavior:
6 return "FULL"
7 if any(fail_to_pass) and preserved_existing_behavior:
8 return "PARTIAL"
9 return "NO"
10
11for fail_to_pass, pass_to_pass in [
12 ([True, True], [True, True]),
13 ([True, False], [True, True]),
14 ([True, True], [True, False]),
15]:
16 print(classify_swebench_result(fail_to_pass, pass_to_pass))1FULL
2PARTIAL
3NOOn the public % Resolved leaderboard, an agent that almost works but misses the final grading criteria contributes the same headline score as one that produces nonsense: zero resolved instances.[4][5]
The benchmark's native question is binary: did the system resolve the issue or not? If a paper submits one patch per task, % Resolved is effectively a single-shot score. If it samples several candidate patches and keeps the best one, authors may report Pass@k instead. Check which setup was used. A single-shot score and a best-of-k score are different capability claims, even if the headline percentages look similar.
The original SWE-bench dataset contains 2,294 task instances drawn from 12 popular Python repositories [4]. Not all tasks are equally useful for evaluation. Some require obscure setup steps. Others have ambiguous issue descriptions where even a skilled human developer couldn't determine the correct fix.
The main variants are best treated as related benchmarks, not interchangeable rows on one universal scoreboard:
The variants differ in how aggressively they trade off realism, noise, language coverage, modality, and evaluation cost. They are related, but their scores are not interchangeable.
A 300-task test subset designed to be faster and cheaper to run [6]. The authors remove instances with images, external hyperlinks, references to specific commit SHAs, or references to other pull requests and issues. They also remove instances with fewer than 40 words in the problem statement, fixes that edit more than one file, gold patches with more than three edit hunks, cases that create or remove files, and tests that depend on error-message string checks.
Lite is a common choice when teams are iterating on a new agent design. It's small enough to run faster than Full, yet hard enough to separate weak scaffolds from useful ones.
A 500-instance human-filtered subset created in collaboration with OpenAI [7]. Human annotators reviewed each instance to screen out underspecified issue descriptions, broken or overly narrow FAIL_TO_PASS tests, and other major issues. That made Verified much cleaner than the original public test set, which is why it quickly became a common public comparison benchmark.
That said, cleaner doesn't mean perfect. On February 23, 2026, OpenAI said SWE-bench Verified was increasingly contaminated and recommended SWE-bench Pro for frontier coding capability measurement [8]. Verified remains useful, but small leaderboard deltas need caution.
SWE-bench Multilingual adds a 300-task view across 9 programming languages on the official leaderboard [5]. It matters because classic Full, Lite, and Verified are Python-heavy. If your team works in Java, TypeScript, Go, Rust, or mixed stacks, a Python-only score is an incomplete signal.
The main interpretation rule stays the same: don't compare a Multilingual score directly against a Verified or Lite score. Different task pools measure different failure modes.
SWE-bench Multimodal (SWE-bench M) extends the family into visually grounded JavaScript tasks. The public leaderboard reports 517 multimodal issues, and the paper frames the benchmark as image-grounded JavaScript repair across 17 libraries [9][5].
Unlike standard SWE-bench where the bug is usually described in text, multimodal tasks may require the agent to look at a rendered screenshot and figure out what code produced the visible problem. For example, an issue might say "the tooltip is misaligned with the hovered element" and attach a screenshot of the bug. The agent must process the image, trace it back to the relevant JavaScript or CSS code, and write a fix.
Visual bugs often lack clear error logs or stack traces. In the paper, SWE-agent resolved 12% of the SWE-bench M task instances versus 6% for the next-best baseline, which shows how much harder visual grounding still is than text-only Python bug fixing [9].
Strong SWE-bench systems aren't just LLMs generating code. They are multi-step systems that combine exploration, reasoning, and iteration. This orchestration matters because large context windows don't guarantee reliable use of every relevant file or detail.
The typical loop is closer to disciplined repository work than one-shot code completion:
Only after submission does the benchmark harness apply the hidden test patch and compute FAIL_TO_PASS and PASS_TO_PASS outcomes [5].
SWE-agent [3] was one of the first purpose-built agents for this benchmark. It introduced the concept of a specialized Agent-Computer Interface (ACI) with custom shell commands designed for code traversal and editing. Devin [10] later demonstrated that giving an agent a full development environment (terminal, browser, editor) could push performance even higher.
Not every approach needs a complex agent loop. The "Agentless" approach showed that a simple pipeline (localize files, generate a patch, then re-rank candidates) without iterative tool use can achieve competitive results at low cost.[11] Unlike agentic loops that rely on the LLM to autonomously decide the next action or execute shell commands, this method removes that overhead.
By hardcoding the workflow into discrete, deterministic phases, Agentless achieved a 32% solve rate on SWE-bench Lite at approximately $0.70 per task. This raises a benchmark question: does SWE-bench reward autonomous engineering skill, or does it primarily reward the ability to make a small, targeted fix once external scaffolding has narrowed the context?
The Agentless result suggests that for many SWE-bench tasks, the bottleneck is finding the right files, not code generation alone. Once localized, simpler pipelines can often write a viable fix.
The official site now separates arbitrary agent systems from mini-SWE-agent views meant to compare models with a more standardized scaffold [5]. That split exists because a "SWE-bench score" is usually a joint score of:
In practice, the agent loop can matter as much as the model. A strong model in a weak scaffold can underperform a weaker model wrapped in a disciplined search, edit, and test loop.
When you see a headline like "Agent X achieves 50% on SWE-bench," here's what you should consider:
| Factor | What to Watch For |
|---|---|
| Which variant? | Full, Lite, Verified, Multimodal, and Multilingual are different benchmarks |
| Which leaderboard mode? | The full-agent leaderboard and bash-only mini-SWE-agent leaderboard answer different questions |
| Which release/config? | Even mini-SWE-agent release versions are not always directly comparable [5] |
| Sampling policy | Single-shot and best-of-k results are not the same claim |
| Cost per solved task | Accuracy without an economics story is hard to operationalize |
| Contamination risk | Public Verified issues now appear in training data; prefer Pro for frontier signal |
It's also useful to separate general-purpose models from purpose-built agentic systems. The official Verified page notes that different mini-SWE-agent release lines aren't directly comparable because the agent interaction changed over time [5]. Small deltas at the top of a leaderboard often say as much about scaffold details as they do about raw model ability.
Fair leaderboard comparisons should disclose whether the system attempted the whole split. If a report doesn't say how many tasks were attempted, treat the headline number skeptically.
Taking leaderboard numbers at face value is risky. Cost and latency can make some high-scoring approaches impractical for real-world use. A 40% score at $0.10 per task might be more useful than a 60% score at $5.00 per task.
Here's a concrete exercise to test your understanding. Imagine an agent receives this issue:
Issue:
mean()on an empty list should raiseStatisticsError, but it currently returns0.0.
The agent proposes this patch:
1--- a/statistics.py
2+++ b/statistics.py
3@@ -...,... @@
4 def mean(data):
5+ if not data:
6+ return 0.0
7 n = len(data)
8 return sum(data) / nThe patch does the exact opposite of what the issue asks. It explicitly returns 0.0 for empty data instead of raising the expected exception. This is a "plausible but incorrect" patch. It looks like it handles the edge case, but it violates the specification.
In real SWE-bench evaluations, agents sometimes pass visible tests but fail hidden tests because they implemented a behavior that looks reasonable but doesn't match the issue requirements. When you review agent output, compare the patch against the original issue description, not just whether the code "seems" correct.
Symptom: The agent guesses file names like src/core.py instead of checking what actually exists.
Cause: The model is trained to be helpful and confident. Without explicit tool use, it will invent paths that sound plausible.
Fix: Force the agent to run ls or find before editing. The tool output grounds the model in reality.
Symptom: The agent writes a fix for a bug it never confirmed existed.
Cause: Skipping the reproduction step feels efficient, but it means the agent can't verify its own work.
Fix: Require a reproduction script before any patch generation. If you can't break it, you can't prove you fixed it.
Symptom: The agent rewrites 200 lines to fix a 1-line bug, accidentally breaking unrelated features.
Cause: Large file rewrites are easier for an LLM to generate than precise line-level edits.
Fix: Use a search-and-replace or unified-diff tool that constrains the edit to specific lines. The Agentless paper [11] shows that constrained localization and repair can be competitive with more complex agent loops.
Symptom: The agent stuffs the entire repository into the prompt and then loses track of the actual bug.
Cause: The context window looks large, but long inputs don't guarantee reliable retrieval. Lost-in-the-Middle research shows that models often miss details in the middle of long contexts.[12]
Fix: Use targeted retrieval. Feed the model only the files that search tools identify as relevant, plus a summary of the repository structure.
Understanding the limitations is as important as understanding what the benchmark measures:
Greenfield development: SWE-bench is entirely about fixing bugs and adding features to existing codebases. It doesn't test the ability to architect a new system from scratch.
Ambiguous requirements: The "gold" solution is always the original developer's pull request. In practice, many bugs have multiple valid fixes, and the right fix may depend on context that isn't captured in the issue description.
Long-running tasks: Most SWE-bench tasks are still localized bug fixes or feature adjustments. It doesn't test multi-day refactoring projects or large-scale migrations.
Collaboration: Real software engineering involves code review, discussion, and iteration with teammates. SWE-bench tests solo problem-solving in isolation.
Language diversity: The classic Full, Lite, and Verified leaderboards all come from 12 Python repositories. Newer family members like Multimodal and Multilingual widen coverage, but many headline SWE-bench numbers still reflect Python-heavy repository maintenance [4][5].
Performance and scalability reasoning: The benchmark tests correctness (do the tests pass?), not whether the fix is performant, secure, or maintainable. An agent could introduce a correct but solution where an fix exists, and the benchmark wouldn't penalize it.
Domain-specific codebases: The benchmark repositories (Django, scikit-learn, matplotlib) are well-maintained, well-documented open-source projects. Performance on a legacy enterprise codebase with sparse documentation and tangled dependencies would likely be significantly worse.
These missing elements highlight the difference between an AI coding assistant and a full engineering workflow. A coding assistant can excel at targeted fixes, while production engineering also requires system design, requirement negotiation, review, and long-term maintainability.
A high SWE-bench score doesn't mean an agent can replace a human software engineer. It proves the agent is skilled at a very specific subset of engineering tasks: self-contained bug fixes in well-tested repositories.
SWE-bench still matters. It remains one of the most useful public execution-based benchmarks for repository-level bug fixing. But the benchmark is now mature enough that its limitations are part of the story.
OpenAI's February 23, 2026 audit makes the case bluntly [8]. Its Frontier Evals team re-examined 138 problems (about 28% of the 500-task set) that its models could not solve consistently across many runs. Two findings stand out. First, 59.4% of those hardest unsolved problems had flawed tests: narrow tests that reject functionally correct fixes, or wide tests that check behavior the issue never described. Second, frontier models including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview could reproduce the original gold patch verbatim from the task ID alone, which is direct evidence of training-data contamination. OpenAI concluded that gains on Verified increasingly reflect benchmark exposure rather than real engineering ability, stopped reporting Verified scores, and now recommends the public split of SWE-bench Pro instead.
SWE-bench Pro is the Scale AI successor built to address those problems [13]. It expands to 1,865 long-horizon tasks across 41 repositories in four languages (Python, Go, TypeScript, JavaScript), split into a 731-task public set, an 858-task held-out set, and a 276-task commercial set drawn from private startup codebases. Two design choices resist contamination: the public and held-out repositories use strong copyleft (GPL) licenses as a legal deterrent against inclusion in training data, and the commercial set uses proprietary code that was never publicly available.
The tasks are also genuinely harder. They are long-horizon changes that touch multiple files, closer to a day of real engineering work than a localized one-line fix. The effect on scores is dramatic and is the clearest single argument that Verified had saturated through contamination: models that clear 70 to 80%+ on Verified land far lower on Pro. When Scale AI introduced the public set, top models resolved roughly 23% of it [13]; standardized leaderboard scores have since climbed but still sit well below Verified levels, and the same model can drop by tens of points moving from Verified to Pro on comparable task types.
When you read Pro numbers, separate the two leaderboard modes the same way you do for classic SWE-bench. Scale AI's standardized SEAL leaderboard runs every model through identical tooling so you compare raw model capability, while self-reported agent systems bring their own scaffolding and report higher numbers that are not directly comparable. Treat any specific percentage as a point-in-time snapshot, because the frontier moves fast.
The practical takeaway is simple. Use public SWE-bench scores to understand broad competence on repository maintenance. Don't use a 2-point delta at the top of a public Verified leaderboard as proof that one model will outperform another in your private codebase, and prefer contamination-resistant evaluations like SWE-bench Pro when you want a frontier-capability signal.
If you're evaluating AI coding tools for your team, SWE-bench scores give you a useful signal, but they're not the full picture. Here's a practical framework:
| SWE-bench correlates well with... | SWE-bench says less about... |
|---|---|
| Bug fixing in well-tested Python codebases | Performance in your specific tech stack |
| Feature additions with clear specifications | Handling of proprietary codebases or internal APIs |
| Code exploration in large repositories | Speed and cost efficiency for your use case |
| Following existing patterns and conventions | Quality of code explanations and documentation |
Take SWE-bench scores as a first filter to narrow your shortlist, then run your own evaluation on a dataset of 20 to 50 real issues from your codebase. This gives you a signal that no public benchmark can provide, because it accounts for your specific language, frameworks, coding conventions, and issue types.
When reading SWE-bench results, ask these questions:
You now understand how SWE-bench works, how it grades agents, and how to read the leaderboard without overreading it. The next step is to learn the Agent-Computer Interface (ACI) in detail: how tool design, prompt scaffolding, and retry policies shape what an agent can do in a repository. If you're building agents, that topic will help you design systems that score well on benchmarks and work reliably in production.
If you're more interested in evaluation, study contamination and benchmark hygiene: how to tell whether a model's score reflects task-solving behavior or memorization of training data. The 2026 shift from SWE-bench Verified to SWE-bench Pro [13] is the canonical case study, copyleft licensing and held-out proprietary code as deliberate defenses against training-data leakage. That knowledge matters for anyone who runs or interprets AI evaluations in production.
Either way, the key skill you've built here is reading a benchmark like an engineer, not like a headline scanner. That's the difference between someone who trusts a percentage and someone who asks what the percentage actually measures.
Evaluating Large Language Models Trained on Code (HumanEval).
Chen, M., et al. 路 2021 路 arXiv preprint
Program Synthesis with Large Language Models
Austin, J., et al. 路 2021
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Yang et al. 路 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.
Jimenez, C. E., et al. 路 2024 路 ICLR 2024
SWE-bench Official Leaderboards
SWE-bench Team 路 2026
SWE-bench Lite: A Filtered Subset for Practical Evaluation
Jimenez et al. 路 2024
Introducing SWE-bench Verified
OpenAI 路 2024
Why SWE-bench Verified no longer measures frontier coding capabilities
OpenAI 路 2026
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Yang, J., et al. 路 2025 路 ICLR 2025
Introducing Devin, the first AI software engineer
Cognition Labs 路 2024
Agentless: Demystifying LLM-based Software Engineering Agents
Xia et al. 路 2024
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. 路 2023 路 TACL 2023
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Scale AI 路 2025