BenchmarksEvaluationSWE-benchAgents+1

Understanding SWE-bench

SWE-bench measures whether coding agents can fix real repository issues: task format, scoring, benchmark variants, contamination limits, and leaderboard claims.

LeetLLM TeamFebruary 17, 2026Updated June 30, 202613 min read

SWE-bench tests whether an AI coding system can fix real GitHub issues inside real repositories. That's a harder question than "can the model write a function?"

Model releases often quote a headline like "X% on SWE-bench." Read that number carefully. It depends on the benchmark variant, the agent scaffold, the tool interface, the sampling policy, and whether the task set has become exposed to training data. OpenAI's February 23, 2026 audit now says SWE-bench Verified is increasingly contaminated and recommends SWE-bench Pro for frontier coding evaluation.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/}

Start with the practical mental model: what an instance contains, how the harness grades a patch, why variants aren't interchangeable, and what SWE-bench can and can't tell you about a coding agent.

From toy functions to repository work

HumanEval and MBPP mostly test isolated function synthesis.^{[2]Reference 2Evaluating Large Language Models Trained on Code (HumanEval).https://arxiv.org/abs/2107.03374} ^{[3]Reference 3Program Synthesis with Large Language Modelshttps://arxiv.org/abs/2108.07732} HumanEval provides a function signature and docstring. MBPP provides a short natural-language programming task with test cases. In both cases, the model writes a small function or program rather than navigating a repository.

SWE-bench starts with a GitHub issue and a repository checkout.^{[4]Reference 4SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.https://arxiv.org/abs/2310.06770} The agent has to find the relevant files, understand existing conventions, write a patch, and avoid regressions.

Comparison where a toy coding prompt points directly at one function, while SWE-bench starts from an issue and requires searching a repository before patching and testing. — HumanEval and MBPP usually start at a known function. SWE-bench starts with an unresolved issue and a repository search space.

The main skill is repository repair. A system has to decide what to inspect, what to edit, and what to test before it submits.

What one task contains

A SWE-bench instance has four moving parts:

Part	What it contains
Problem statement	The original issue text
Repository snapshot	The codebase at the commit before the human fix
FAIL_TO_PASS tests	Tests from the original pull request that fail before the fix and should pass after
PASS_TO_PASS tests	Existing tests that should keep passing after the patch

The model doesn't see the hidden tests. It receives the issue and repository, then produces a patch. The harness applies that patch in an isolated environment and runs the grading tests.^{[5]Reference 5SWE-bench Official Leaderboardshttps://www.swebench.com/}

The original SWE-bench dataset has 2,294 task instances from 12 popular Python repositories.^{[4]Reference 4SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.https://arxiv.org/abs/2310.06770} That scope is what made the benchmark useful: it moved coding evaluation from toy snippets toward maintenance work in real projects.

How the test patch is constructed

Dataset construction starts from a resolved issue and its human-authored pull request. The repository's base commit provides the broken snapshot. The human patch supplies both a reference code change and, when present, tests that distinguish broken behavior from repaired behavior.^{[4]Reference 4SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.https://arxiv.org/abs/2310.06770}

The benchmark builder separates tests into two roles. FAIL_TO_PASS tests demonstrate the requested repair: they fail on the base commit and pass with the human fix. PASS_TO_PASS tests protect existing behavior: they pass before the candidate patch and must remain green afterward.

That split prevents two easy false positives. A patch can't earn credit by deleting the failing test, and it can't earn credit by fixing one path while breaking unrelated covered behavior. It must satisfy every required transition under the benchmark harness.

Test construction still has limits. The hidden patch reflects one accepted fix and its tests, not every possible interpretation of an issue. Weak tests can accept an incomplete implementation, while overspecific tests can reject a different valid implementation. Human verification and harder successor sets reduce that risk but can't remove it entirely.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/}

The repository map problem

Before an agent can fix anything, it has to build a working context. It can't read every file with equal care, so it needs a disciplined loop:

Inspect the tree.
Search for relevant symbols.
Read the likely files.
Reproduce the issue when possible.
Patch narrowly.
Run visible checks before submission.

Diagram showing Issue text, Search + reproduce, Narrow patch, and Visible tests. — Issue text, Search + reproduce, Narrow patch, and Visible tests.

Benchmark rule: A model name isn't a complete SWE-bench result. Record the variant, scaffold, tool interface, sampling budget, and grading protocol beside every score.

This is where the Agent-Computer Interface (ACI) matters. The ACI is the tool layer the model uses to interact with the repo: shell commands, file viewers, search tools, edit tools, test runners, and sometimes browser or language-server access. SWE-agent made this point early by showing that tool design can matter as much as the base model.^{[6]Reference 6SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineeringhttps://arxiv.org/abs/2405.15793}

A strong model with clumsy tools can waste its budget rewriting files. A weaker model inside a disciplined search, edit, and test loop can look much better.

A representative task trace

Take a scikit-learn-style issue:

Issue: In LogisticRegression, the max_iter parameter is ignored when solver='liblinear'.

A good agent trace is short and boring: search for LogisticRegression, liblinear, and nearby tests; reproduce the bug; trace the solver branch; pass max_iter through the liblinear call; rerun the reproduction and nearby tests.

A successful patch might be one argument handoff, such as passing max_iter through the liblinear call instead of using a default. This isn't an upstream SWE-bench patch. It's a sketch of the habit the benchmark rewards: reproduce, localize, patch, verify.

How scoring works

After submission, the harness applies the agent patch to the pinned repository, applies the hidden test patch from the original pull request, and runs tests in a controlled environment.^{[4]Reference 4SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.https://arxiv.org/abs/2310.06770} ^{[5]Reference 5SWE-bench Official Leaderboardshttps://www.swebench.com/}

SWE-bench grading path where a candidate patch must pass hidden fix tests and regression tests before it counts as resolved. — Resolved means the fix tests pass and the regression tests stay green. Plausible-looking code isn't enough.

Public leaderboards report % Resolved: the share of instances where every FAIL_TO_PASS test passes and every PASS_TO_PASS test stays green.^{[5]Reference 5SWE-bench Official Leaderboardshttps://www.swebench.com/} If the fix tests still fail, or if the patch breaks existing behavior, the task isn't resolved. A partial patch contributes zero to the headline metric.

Resolved rate vs. Pass@k

Single-shot and best-of-k are different claims. If a system submits one patch per task, % Resolved is a direct single-attempt score. If it samples many candidate patches and reports the best one, the number is easier to inflate with more budget. Always check sampling policy before comparing systems.

Benchmark variants

SWE-bench is now a family of related benchmarks. Treat the variants as separate test sets, not one shared scoreboard.

SWE-bench variant map showing Full as broad coverage, Verified as cleaner tasks, Lite as quick iteration, and specialized branches for multimodal and multilingual evaluations. — Variant choice changes what failure mode you're measuring. Read each leaderboard as its own benchmark.

Common public variants are Full, Lite, Verified, Multilingual, and Multimodal. They differ in size, language mix, filtering, and modality, so their percentages aren't one shared scoreboard.

SWE-bench Lite removes many expensive or noisy instances, such as tasks with images, external links, broad patches, new files, or very short issue descriptions.^{[7]Reference 7SWE-bench Lite: A Filtered Subset for Practical Evaluationhttps://www.swebench.com/lite.html} It's useful for iteration, but it isn't a final frontier-capability claim.

Verified and multimodal

SWE-bench Verified was created with OpenAI as a human-filtered 500-instance subset.^{[8]Reference 8Introducing SWE-bench Verifiedhttps://openai.com/index/introducing-swe-bench-verified/} It removed many underspecified or broken tasks and became the dominant public comparison set. As of 2026, it's still useful historically, but small leaderboard deltas deserve caution because the set is public, small, and heavily studied.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/}

SWE-bench Multimodal adds visually grounded JavaScript issues. Some tasks require the agent to interpret screenshots before tracing the bug to code. The original multimodal paper introduced 617 instances, while the current public leaderboard reports 517 multimodal issues.^{[9]Reference 9SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?https://arxiv.org/abs/2410.03859} ^{[5]Reference 5SWE-bench Official Leaderboardshttps://www.swebench.com/}

How strong agents work

Strong SWE-bench systems aren't just "LLM plus prompt." They wrap the model in a repository workflow.

Repository agent loop moving from issue and repo to exploration, localization, patching, visible tests, and retry before hidden evaluation. — Most useful work happens before hidden grading: exploration, localization, patching, local tests, and retries.

That scaffold can include file search, symbol search, line-level file viewing, constrained edit tools, visible test execution, retry policy, and budget limits.

Agentless is the useful counterexample.^{[10]Reference 10Agentless: Demystifying LLM-based Software Engineering Agentshttps://arxiv.org/abs/2407.01489} It showed that a simple pipeline (localize files, generate candidate patches, then re-rank) can be competitive without a fully autonomous loop. That result clarifies what the benchmark often measures: file localization plus narrow repair, not general intelligence alone.

Reading leaderboards without overreading them

When someone says "our agent scores 60% on SWE-bench," ask these questions:

Which variant? Full, Lite, Verified, Multilingual, Multimodal, and Pro measure different task pools.
Which scaffold? Base model and agent loop are mixed in the same score.
Single-shot or best-of-k? More samples can raise success without changing per-attempt reliability.
What cost per solved task? Accuracy without cost is hard to use in production.
How many tasks were attempted? Selective reporting can hide failures.
Is contamination likely? Public issues and solutions can leak into training data.

Cost and latency matter too. A 40% system at $0.10 per task may be more useful for your workflow than a 60% system at $5.00 per task.

Hold the scaffold constant

A model comparison is clean only when the surrounding agent stays fixed. Keep the repository image, tool definitions, system prompt, edit format, test commands, retry budget, token budget, sampling settings, and timeout constant. Change one model field, then rerun the same task IDs.

Use a run manifest instead of reconstructing settings from memory:

swebench-run-manifest.json

{
  "benchmark": "<variant and dataset revision>",
  "task_ids": ["<ordered task ids>"],
  "model": "<provider and model version>",
  "scaffold_commit": "<git commit>",
  "container_image": "<immutable image digest>",
  "samples_per_task": 1,
  "temperature": 0,
  "max_steps": "<fixed budget>",
  "timeout_seconds": "<fixed timeout>"
}

Don't fill placeholders with guessed defaults. Copy values from the actual run configuration. Store per-task patch, trace, test output, cost, latency, and final harness verdict beside the manifest.

When scaffolds differ, report a system comparison rather than a model comparison. Tool access or retry policy may explain the result. That isn't invalid, but it answers a different question.

Why teams are moving beyond Verified

OpenAI's February 23, 2026 audit is the clearest warning sign.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/} It re-examined 138 Verified problems that OpenAI o3 did not solve consistently over 64 runs. OpenAI reported that 59.4% of those audited problems had material test or problem-description issues. It also found contamination evidence: frontier models could recover gold patches or task-specific details from public task IDs and descriptions.

OpenAI stopped reporting SWE-bench Verified scores and recommends SWE-bench Pro for frontier coding capability reporting.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/}

Pro comparison

SWE-bench Pro was built to be harder and more contamination-resistant.^{[11]Reference 11SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?https://arxiv.org/abs/2509.16941} It has 1,865 tasks across 41 professional repositories, split into 731 public tasks, 858 held-out tasks, and 276 commercial tasks from private codebases. In the paper's public-set evaluation with SWE-Agent, the strongest reported run solved about 23% of tasks, while the same class of frontier systems often exceeded 70% on SWE-bench Verified. That gap is a benchmark-difficulty signal, not a sudden capability collapse.^{[11]Reference 11SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?https://arxiv.org/abs/2509.16941}

Public SWE-bench scores are useful, but they aren't final truth. For frontier comparisons, prefer harder and less exposed benchmarks. For product decisions, run your own eval on real issues from your codebase.

Adoption rule: Treat a public score as screening evidence. Before rollout, replay representative issues from your repositories with your permissions, tests, latency budget, and review policy.

Reproduce a leaderboard claim

Start from the exact leaderboard row or paper table. Record benchmark variant, dataset revision, public or held-out split, model identifier, scaffold version, sampling count, date, and reported score. A screenshot without those fields isn't enough.

Then verify the denominator. % Resolved must use all attempted tasks in the stated slice, including setup failures and timeouts unless the published protocol explicitly excludes them. Recompute it from per-task verdicts:

resolved-rate.txt

resolved_rate = resolved_tasks / evaluated_tasks

Next, inspect representative successes and failures. Confirm that successful patches passed all FAIL_TO_PASS and PASS_TO_PASS tests. Count infrastructure failures separately, but don't silently remove them from the denominator. If the claim uses best-of-k, preserve every sample and the selection method.

Finally, run the published scaffold on a small fixed slice before spending on a full reproduction. The slice checks environment setup, patch application, test discovery, timeout behavior, and result parsing. It doesn't validate the headline score, but it can expose protocol drift early.

A reproducibility report should say which parts matched exactly and which didn't. Dataset or container drift can make an exact score impossible to reproduce. Report that boundary instead of forcing a comparison between different protocols.

What SWE-bench doesn't test

A high score is evidence of targeted repository repair. It doesn't prove that an agent can own the full engineering workflow.

SWE-bench says less about:

greenfield architecture
product negotiation and ambiguous requirements
multi-week migrations
code review and teammate collaboration
security and performance review
private-code quirks
your exact language, framework, and test culture

Run an internal eval before buying or adopting an agent. Pick 20 to 50 real issues, record the scaffold, sampling policy, cost, latency, and success criteria, and grade against your own repository conventions.

Practice: spot the bad patch

Issue:

mean() on an empty list should raise StatisticsError, but it currently returns 0.0.

Candidate patch: add if not data: return 0.0 before the division. That handles the edge case mechanically, but it implements the opposite behavior. The issue asks for an exception.

That's the core review habit for coding agents: don't ask whether the diff "looks reasonable." Ask whether it satisfies the stated behavior and preserves old behavior.

Common failure modes

Common failures are easy to spot: hallucinated paths, weak reproduction, over-editing, and context flooding. Force ls, find, or search before editing; require a failing case before repair; use constrained edit tools; and prefer targeted retrieval plus summaries over dumping the whole repo into context.

Long context isn't a substitute for search. Lost-in-the-Middle work shows that models can miss details buried inside long prompts.^{[12]Reference 12Lost in the Middle: How Language Models Use Long Contextshttps://arxiv.org/abs/2307.03172} For code agents, targeted file selection usually beats dumping everything into the model.

Quick self-check

Three checks matter. A SWE-bench Pro single-shot score is a different claim from a Verified best-of-16 score. A visible reproduction pass can still score zero if hidden FAIL_TO_PASS or PASS_TO_PASS tests fail. SWE-bench Lite is a cheap scaffold loop, not a replacement for harder public or internal evaluations.

What to study next

If you're building coding agents, study the Agent-Computer Interface next: search tools, edit tools, test runners, and retry policy shape the result as much as the base model. If you're evaluating models, study benchmark hygiene next. The shift from SWE-bench Verified to SWE-bench Pro is the practical case study: public benchmarks can saturate, hidden tests can encode narrow assumptions, and contamination can turn a capability score into a memorization score.^{[1]Reference 1Why SWE-bench Verified no longer measures frontier coding capabilitieshttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/} ^{[11]Reference 11SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?https://arxiv.org/abs/2509.16941}

PreviousWhat Does an AI Engineer Actually Do?NextHow to Prepare for ML & LLM Engineering Interviews in 2026

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI · 2026

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Program Synthesis with Large Language Models

Austin, J., et al. · 2021

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

SWE-bench Official Leaderboards

SWE-bench Team · 2026

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. · 2024

SWE-bench Lite: A Filtered Subset for Practical Evaluation

Jimenez et al. · 2024

Introducing SWE-bench Verified

OpenAI · 2024

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Yang, J., et al. · 2025 · ICLR 2025

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. · 2024

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI · 2025

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

Blog