LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Ā© 2026 LeetLLM. All rights reserved.

All Posts
BlogUnderstanding SWE-bench
šŸ“Š BenchmarksšŸ“ Evaluation🧪 SWE-benchšŸ¤– AgentsšŸŠ Deep Dive

Understanding SWE-bench

SWE-bench has become the gold standard for measuring AI coding agents, but what does it actually test? We break down the benchmark methodology, its variants, scoring mechanics, and what the leaderboard results really mean for production engineering.

LeetLLM TeamFebruary 17, 202612 min read

What Does SWE-bench Actually Evaluate?

Imagine you are hiring a new software engineer. You wouldn't just ask them to write a sorting algorithm on a whiteboard; you would want to see how they explore a massive legacy codebase, figure out where a bug is hiding, and write a fix without breaking anything else. That is exactly what SWE-bench tries to do for AI coding agents. Every week, a new AI claims to have "solved X% of SWE-bench." The leaderboard has become the de facto scoreboard for the AI coding agent arms race. But beneath the headline numbers lies a surprisingly complex benchmark that most people misunderstand.

The Core Idea

SWE-bench evaluates whether an AI system can resolve real GitHub issues from popular open-source Python repositories[1]. Not toy problems. Not LeetCode puzzles. Actual bug reports and feature requests pulled from projects like Django, Flask, scikit-learn, sympy, and matplotlib.

Each task in SWE-bench consists of:

  1. •A GitHub issue description (the natural language specification)
  2. •A codebase snapshot at the commit just before the fix was merged
  3. •A test patch that verifies whether the fix is correct

The following Python code demonstrates how to load a specific task from the SWE-bench dataset. It takes the dataset name and split as inputs, and returns a dictionary containing the task details. We extract and print the unique task ID and the first 100 characters of the problem statement (the GitHub issue description) to show what the agent receives:

python
1from datasets import load_dataset 2 3# Load a task from SWE-bench Lite 4ds = load_dataset("princeton-nlp/SWE-bench_Lite", split="test") 5task = ds[0] 6 7print(f"Task ID: {task['instance_id']}") 8# Output: django__django-11001 9 10print(f"Problem Statement:\n{task['problem_statement'][:100]}...") 11# Output: SQL compiler: distinct() and order_by() crash when...

The AI agent receives the issue description and the full repository. Its job: produce a code patch that makes the failing tests pass without breaking existing ones.

Why SWE-bench Hits Different

Previous coding benchmarks like HumanEval and MBPP test whether a model can write a single function in isolation. SWE-bench tests something fundamentally different.

Comparison of coding benchmarks: SWE-bench, HumanEval, MBPP, and LiveCodeBench across difficulty, real-world relevance, and evaluation method. Comparison of coding benchmarks: SWE-bench, HumanEval, MBPP, and LiveCodeBench across difficulty, real-world relevance, and evaluation method.

Writing a sorting function is a different skill from navigating a 500-file Django codebase, identifying the three files that need changes, and producing a patch that resolves the issue without regressions.

šŸ’” Key insight: SWE-bench doesn't just test "can the model write code?" It tests "can the model engineer code?" The gap between those two questions is enormous.

DimensionHumanEval / MBPPSWE-bench
ScopeSingle functionFull repository
ContextSelf-contained specIssue description + codebase
NavigationNoneMust find relevant files
TestingSimple I/O checksFull test suite (existing + new)
RealismSyntheticReal-world pull requests

The Benchmark Variants

The original SWE-bench dataset contains 2,294 task instances drawn from 12 popular Python repositories. The distribution is intentionally skewed toward large, well-maintained projects: Django accounts for roughly 30% of all tasks, followed by scikit-learn and sympy. Each task corresponds to an actual pull request that was merged in the repository's history, meaning the test patches are written by real developers solving real problems.

Over time, the community realized that not all tasks are equally useful for evaluation. Some tasks require obscure setup steps that make them impractical. Others have ambiguous issue descriptions where even a skilled human developer couldn't determine the correct fix. This led to several curated subsets, each designed for a specific evaluation goal.

SWE-bench variants: Lite (300 curated tasks), Full (2,294 tasks), and Verified (500 human-validated tasks), with difficulty distributions for each. SWE-bench variants: Lite (300 curated tasks), Full (2,294 tasks), and Verified (500 human-validated tasks), with difficulty distributions for each.

SWE-bench Lite

A 300-task subset designed to be more practical for evaluation[2]. The filtering criteria:

  • •Removes tasks requiring changes to test files
  • •Removes tasks with overly complex setup requirements
  • •Removes tasks where the issue description is too vague
  • •Focuses on self-contained bugs that a human developer could reasonably debug from the issue description alone

SWE-bench Verified

OpenAI worked with human software developers to review SWE-bench Lite and create a 500-task subset with higher quality[3]:

  • •Each task was verified by at least one human developer
  • •Ambiguous or poorly-specified tasks were removed
  • •Tasks where the "gold" patch might not be the only valid fix were flagged
  • •The result is a more reliable benchmark with less noise in the scores

āš ļø Common mistake: Comparing scores across variants. A 50% on Lite is a very different achievement than 50% on the full set. Always check which subset is being reported before drawing conclusions.

SWE-bench Multimodal

An emerging variant that includes issues with screenshots, error traces, and visual artifacts. This tests whether agents can reason about UI bugs, chart rendering issues, and visual regressions in addition to pure code logic.

As web development and data science increasingly rely on visual output, pure text-based reasoning is often insufficient for debugging. For example, an issue might say "the chart legend overlaps the x-axis" and provide an image of the bug. To solve this, an agent must process the image, correlate the visual overlap with the corresponding matplotlib or CSS code, and write a fix that addresses the layout issue.

šŸ’” Key insight: Visual bugs often lack clear error logs or stack traces. The agent must independently verify what the rendered UI looks like to confirm if its CSS or visualization code changes actually solved the problem.

This multimodal capability represents the next frontier for AI coding agents, moving them closer to how human engineers naturally debug frontend and graphical issues by looking at both the code and the rendered result side-by-side.

How Scoring Works

A task is counted as resolved if and only if all three conditions are met: the patch applies cleanly, the new tests pass, and no existing tests break.

SWE-bench scoring pipeline: a model receives a GitHub issue, generates a patch, the patch is applied to the repo, and tests determine pass/fail. SWE-bench scoring pipeline: a model receives a GitHub issue, generates a patch, the patch is applied to the repo, and tests determine pass/fail.

There's no partial credit. Either the patch fully resolves the issue or it doesn't. The final score is the percentage of tasks resolved out of the total dataset.

The evaluation pipeline runs in a Docker container with the exact environment specified by the task. This matters because many issues involve specific Python versions, dependency configurations, or database setups. An agent's patch might work on Python 3.11 but fail on the task's required Python 3.9. The containerized evaluation catches these issues.

The "Gold Patch" Problem

Each task has a "gold" patch: the actual fix that was merged in the original pull request. But in practice, many bugs have multiple valid fixes. The scoring system handles this by evaluating against the test suite, not against the gold patch itself. If your patch is completely different from the original but passes all tests, it counts as resolved. However, if the gold patch is overly specific (e.g., fixing a string to match a particular format), alternative correct solutions may fail.

šŸ”¬ Research insight: This binary scoring has important implications. An agent that produces a patch fixing 90% of the failing tests but breaking one existing test gets the same score as an agent that produces garbage output: zero for that task.[1]

What Top Agents Actually Do

The highest-performing agents on SWE-bench aren't just Large Language Models (LLMs) generating code. They're sophisticated multi-step systems that combine exploration, reasoning, and iteration. The following diagram illustrates the typical workflow of these agents, taking an issue as input and producing a verified patch as output through a continuous loop of code understanding and testing:

Diagram Diagram

Step 1: Repository Exploration. The agent explores the repository structure, identifying relevant directories, reading README files, and building a mental model of the codebase architecture.

Step 2: File Localization. Using the issue description as a guide, the agent searches for relevant files. This might involve grep searches for error messages, following import chains, or using the project's test structure to find the right module.

Step 3: Code Understanding. The agent reads and reasons about the relevant source code, understanding how the current implementation works and where the bug or missing feature is located.

šŸŽÆ Production tip: The file localization step is where most agents win or lose. An agent that picks the right files on the first try can solve the issue in a few iterations. One that starts in the wrong part of the codebase burns through its token budget exploring dead ends.

Step 4: Patch Generation. Based on its understanding, the agent generates a diff patch. The best agents iterate multiple times, refining their patches based on test feedback.

Step 5: Verification. Before submitting, the agent runs the test suite to verify its patch. If tests fail, it loops back to refine the solution.

SWE-agent[4] was one of the first purpose-built agents for this benchmark. It introduced the concept of a specialized Agent-Computer Interface (ACI) with custom shell commands designed for code navigation and editing. Devin[5] later demonstrated that giving an agent a full development environment (terminal, browser, editor) could push performance even higher.

The Agentless Alternative

Not every approach needs a complex agent loop. The "Agentless" approach showed that a simple pipeline (localize files → generate a patch → re-rank candidates) without any iterative tool use can achieve competitive results at a fraction of the cost.[6] Unlike traditional agentic loops that rely on the LLM to autonomously decide the next action or execute shell commands, this method strips away the overhead.

By hardcoding the workflow into discrete, deterministic phases, Agentless achieved a 32% solve rate on SWE-bench Lite at just $0.70 per task. This raises an important question about the benchmark: does SWE-bench reward genuine autonomous engineering skill, or does it primarily reward the ability to make a small, targeted fix when the problem context is already narrowed down by external scaffolding?

šŸ’” Key insight: The success of the Agentless approach suggests that for many SWE-bench tasks, the primary bottleneck is finding the right files, not the actual code generation. Once localized, even simpler pipelines can often write the correct fix.

This finding has profoundly influenced subsequent agent designs, demonstrating that much of the performance previously attributed to complex agentic reasoning was actually due to the underlying model's code generation capability when given the right context.

Modern Agent Architecture Patterns (2025)

By late 2025, the most successful SWE-bench agents share several design choices:

  • •File localization as a separate, optimized step. Rather than having the agent explore freely, dedicated retrieval models or code search indices narrow down relevant files before the main LLM ever sees the code.
  • •Test-driven iteration. The agent writes a patch, runs the tests, reads the error output, and refines. The best agents cap this at 3-5 iterations to avoid cost blowup.
  • •Smaller models for scaffolding. Many top systems use a cheaper model (like GPT-4o-mini or DeepSeek-V3) for file exploration and a more capable model (like Claude Opus 4 or GPT-5) for the final patch generation. This keeps costs manageable while maintaining quality on the hardest step.

The Leaderboard Reality Check

When you see a headline like "Agent X achieves 50% on SWE-bench," here's what you should consider:

FactorWhat to Watch For
Which variant?Lite, Verified, and Full scores are NOT comparable
Cost per taskTop agents often make 50-100+ LLM calls, costing $1-5 per attempt
Submission methodSome evaluations let agents skip hard tasks, inflating pass rates
Contamination riskSome benchmark issues may now appear in model training data

The standard evaluation protocol requires attempting all tasks. If an evaluation report doesn't specify this, treat the numbers skeptically.

āš ļø Common mistake: Taking leaderboard numbers at face value. The cost and latency required to achieve top scores make many approaches impractical for real-world use. A 40% score at 0.10pertaskmightbemoreusefulthana600.10 per task might be more useful than a 60% score at 0.10pertaskmightbemoreusefulthana605.00 per task.

What SWE-bench Doesn't Test

Understanding the limitations is as important as understanding what the benchmark measures:

  • •

    Greenfield development: SWE-bench is entirely about fixing bugs and adding features to existing codebases. It doesn't test the ability to architect a new system from scratch.

  • •

    Ambiguous requirements: The "gold" solution is always the original developer's pull request. In practice, many bugs have multiple valid fixes, and the "best" fix depends on context that isn't captured in the issue description.

  • •

    Long-running tasks: Most SWE-bench tasks can be solved with relatively small patches (median of roughly 20 lines changed). It doesn't test multi-day refactoring projects or large-scale migrations.

  • •

    Collaboration: Real software engineering involves code review, discussion, and iteration with teammates. SWE-bench tests solo problem-solving in isolation.

  • •

    Language diversity: The benchmark is exclusively Python. Performance may not generalize to other languages, especially those with very different characteristics (statically typed, functional, systems languages).

  • •

    Performance and scalability reasoning: The benchmark tests correctness (do the tests pass?), not whether the fix is performant, secure, or maintainable. An agent could introduce a correct but O(n²) solution where an O(n) fix exists, and the benchmark wouldn't penalize it.

  • •

    Domain-specific codebases: The benchmark repositories (Django, scikit-learn, matplotlib) are well-maintained, well-documented open-source projects. Performance on a legacy enterprise codebase with sparse documentation and tangled dependencies would likely be significantly worse.

The Claude Effect

Anthropic's Claude models have shown particularly strong performance on SWE-bench[7], which has sparked an interesting debate: is SWE-bench measuring general software engineering ability, or is it measuring something more specific about how well a model handles Python codebases with good test coverage?

The distinction matters. A model that excels at SWE-bench might still struggle with:

  • •Debugging production incidents from logs and metrics
  • •Refactoring brittle code without test coverage
  • •Making architectural decisions with long-term tradeoffs
  • •Writing code in unfamiliar frameworks or languages
  • •Debugging issues that span multiple services or processes
  • •Handling requirements that evolve during the implementation

This doesn't diminish the benchmark's value. It means you need to understand what it measures and extrapolate carefully.

šŸ’” Key insight: SWE-bench success correlates strongly with "fixing well-tested Python code." That's a valuable skill, but it's a subset of what production software engineering actually looks like.

State of the Art (Early 2026)

The leaderboard moved rapidly through 2025. Here are the SWE-bench Verified scores from independent evaluation by NIST's U.S. AI Safety Institute (CAISI) group[8], as of late 2025:

ModelSWE-bench Verified (CAISI)SWE-bench Verified (Self-Reported)
GPT-563.0%74.9%
Claude Opus 466.7%72.5%
DeepSeek V3.154.8%66.0%
DeepSeek R1-052844.6%57.6%
OpenAI gpt-oss42.6%62.4%
DeepSeek R125.4%49.2%

A few things jump out from this data:

  • •

    The self-reported gap is real. Every model performs worse under independent evaluation than in the lab's own benchmarks. This isn't necessarily dishonest; lab evaluations often use optimized prompting, higher token budgets, or agent scaffolding. But it means you should always look for independent verification.

  • •

    Reasoning models don't automatically win. DeepSeek R1, a purpose-built reasoning model, scored lower than the general-purpose V3.1 on SWE-bench. This makes sense: SWE-bench tasks reward practical code navigation and pattern matching more than deep chain-of-thought reasoning. Knowing when to reason deeply and when to just apply the obvious fix is itself a skill.

  • •

    The 60-70% barrier. The top models cluster in the 60-70% range on Verified (independently measured). The remaining tasks likely require capabilities that current models lack: understanding complex cross-module interactions, reasoning about performance implications, or handling tasks where the issue description provides minimal guidance.

The Cost-Performance Tradeoff

What the leaderboard doesn't show is the economics. The Aider project tested DeepSeek V3.1 on 225 programming tasks and found it achieves a 71.6% pass rate, comparable to Claude Opus, at 1/68th the cost per task[9]. For enterprise teams running thousands of evaluations, this cost difference is transformative.

šŸ”¬ Research insight: A 68x cost reduction with roughly equivalent quality is a stronger signal than a 5% accuracy improvement at the same price. When evaluating AI coding tools, always ask about cost-per-solved-task, not just solve rate.

What This Means for Engineers

If you're evaluating AI coding tools for your team, SWE-bench scores give you a useful signal, but they're not the full picture. Here's a practical framework:

SWE-bench correlates well with...SWE-bench says less about...
Bug fixing in well-tested Python codebasesPerformance in your specific tech stack
Feature additions with clear specificationsHandling of proprietary codebases or internal APIs
Code navigation in large repositoriesSpeed and cost efficiency for your use case
Following existing patterns and conventionsQuality of code explanations and documentation

šŸŽÆ Production tip: Take SWE-bench scores as a first filter to narrow your shortlist, then run your own evaluation on a dataset of 20-50 real issues from your codebase. This gives you a signal that no public benchmark can provide, because it accounts for your specific language, frameworks, coding conventions, and the types of issues your team actually encounters.

When reading SWE-bench results, ask these questions:

  1. •Which variant was used? Verified, Lite, and Full are different benchmarks with different difficulty curves.
  2. •Was it independently evaluated? Self-reported scores run 10-20 percentage points higher than independent evaluations.
  3. •What was the cost per task? The economic viability of an agent matters as much as its accuracy.
  4. •What scaffolding was used? A bare model versus a purpose-built agent with search tools, test runners, and retry logic will give very different results from the same underlying LLM.

The benchmark is a valuable tool for tracking progress in AI coding capabilities. The gap between "resolves GitHub issues in popular Python repos" and "replaces a software engineer" remains vast. Use the scores as one data point among many when making tool adoption decisions.

Key Takeaways

  • •SWE-bench is valuable because it evaluates real repository maintenance tasks, not toy coding prompts.
  • •Variant, cost, and evaluation protocol matter as much as the headline solve rate.
  • •Independent evaluations are often 10-20 points lower than self-reported results, so source quality matters.
  • •The benchmark is strongest for well-tested Python bug-fixing workflows and weaker for broader engineering work.
  • •The right workflow is benchmark shortlist first, then run your own internal eval on real issues from your codebase.

Looking Forward

The benchmarking space is evolving in several directions as of early 2026:

  • •

    Multi-language expansion. The Python-only limitation is being addressed. Community efforts to build SWE-bench equivalents for JavaScript/TypeScript, Java, and Rust are underway. Early results suggest that model performance varies significantly across languages, with TypeScript results generally tracking close to Python while systems languages show larger drops.

  • •

    Agentic benchmarks. New benchmarks like Terminal-Bench and τ²-Bench (Tau-Bench) evaluate agents in more realistic settings: interacting with APIs, managing stateful sessions, and handling ambiguous instructions that require clarification. These complement SWE-bench by testing the broader set of skills that production coding agents need.

  • •

    The multimodal frontier. SWE-bench Multimodal adds visual artifacts (screenshots, chart renderings, UI mockups) to issue descriptions. This tests whether agents can reason about what code produces, not just what code is. This matters increasingly as frontend and full-stack AI coding becomes more practical.

āš ļø Warning: As SWE-bench issues have been publicly available for over two years and are widely discussed in AI research, the risk of test set contamination grows. Models trained after mid-2024 may have seen solutions to some benchmark tasks during pre-training. The community is exploring dynamic benchmark generation and held-out test sets to address this.

  • •

    Cost-normalized evaluation. Independent evaluation efforts from NIST's U.S. AI Safety Institute (CAISI) group and platforms like Artificial Analysis now report both accuracy and per-task cost. This is arguably the most important trend: an agent that solves 55% of tasks at 0.05perattemptcandelivermorevaluethanonesolving650.05 per attempt can deliver more value than one solving 65% at 0.05perattemptcandelivermorevaluethanonesolving655 per attempt. As coding agents move from demos to production, cost efficiency is becoming a first-class evaluation metric alongside raw accuracy.

  • •

    Real-time, streaming benchmarks. Rather than evaluating agents on historical issues, some groups are experimenting with live evaluation: agents receive newly filed issues in real time and must resolve them before human developers do. This eliminates contamination risk entirely and tests the agent under realistic time pressure. Early experiments suggest that agent performance drops 10-15% compared to the standard (offline, no time limit) setting.

The most honest reading of the current state: AI coding agents can now solve the majority of well-specified bugs in well-tested Python codebases. That's genuinely impressive, and commercially useful. But the remaining 30-40% of tasks, and the vast space of engineering work that SWE-bench doesn't cover at all, represent the harder frontier that the field is now confronting.

References

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. Ā· 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. Ā· 2024 Ā· ICLR 2024

SWE-bench Lite: A Filtered Subset for Practical Evaluation

Jimenez et al. Ā· 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. Ā· 2024

SWE-bench Verified: Improving the Quality and Reliability of SWE-bench

OpenAI Ā· 2025

Introducing Devin, the first AI software engineer

Cognition Labs Ā· 2024

Claude's performance on SWE-bench

Anthropic Ā· 2025

Evaluation of DeepSeek AI Models

NIST CAISI Ā· 2025

Aider LLM Leaderboards.

Dev.to Community Ā· 2025