LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogUnderstanding SWE-bench
📊 Benchmarks📐 Evaluation🧪 SWE-bench🤖 Agents🏊 Deep Dive

Understanding SWE-bench

SWE-bench has become the gold standard for measuring AI coding agents, but what does it test? We break down the benchmark methodology, its variants, scoring mechanics, and what the leaderboard results really mean for production engineering.

LeetLLM TeamFebruary 17, 202618 min read

Imagine you're hiring a new software engineer. You wouldn't just ask them to write a sorting algorithm on a whiteboard; you'd want to see how they work within a massive legacy codebase, figure out where a bug is hiding, and write a fix without breaking anything else. That's exactly what SWE-bench (Software Engineering Benchmark) tries to do for AI coding agents. Every week, a new AI claims to have "solved X% of SWE-bench." The leaderboard has become the de facto scoreboard for the AI coding agent arms race. But beneath the headline numbers lies a surprisingly complex benchmark that most people misunderstand.

The core idea

SWE-bench evaluates whether an AI system can resolve real GitHub issues from popular open-source Python repositories [1]. Not toy problems, not LeetCode puzzles: actual bug reports and feature requests pulled from projects like Django, Flask, scikit-learn, sympy, and matplotlib.

Unlike synthetic benchmarks where the problem is mathematically defined, these tasks represent the messy reality of production software. The issues are written by users who might not fully understand the root cause, requiring the agent to interpret vague symptoms and translate them into code changes.

💡 Key insight: The shift from synthetic tasks to real-world issues is what makes SWE-bench so challenging. The agent must bridge the gap between a user's natural language complaint and the underlying codebase logic.

Each task in SWE-bench consists of:

  1. •A GitHub issue description (the natural language specification)
  2. •A codebase snapshot at the commit just before the fix was merged
  3. •A test patch that verifies whether the fix is correct

The following Python code demonstrates how to load a specific task from the SWE-bench dataset. It takes the dataset name and split as inputs, and returns a dictionary containing the task details. We extract and print the unique task ID and the first 100 characters of the problem statement (the GitHub issue description) to show what the agent receives:

python
1from datasets import load_dataset 2 3# Load a task from SWE-bench Lite 4ds = load_dataset("princeton-nlp/SWE-bench_Lite", split="test") 5task = ds[0] 6 7print(f"Task ID: {task['instance_id']}") 8# Output: django__django-11001 9 10print(f"Problem Statement:\n{task['problem_statement'][:100]}...") 11# Output: SQL compiler: distinct() and order_by() crash when...

The AI agent receives the issue description and the full repository. Its job is to produce a code patch that makes the failing tests pass without breaking existing ones.

Why SWE-bench is fundamentally different

Previous coding benchmarks like HumanEval [2] and MBPP [3] test whether a model can write a single function in isolation. SWE-bench tests something fundamentally different.

Comparison of coding benchmarks: SWE-bench, HumanEval, MBPP, and LiveCodeBench across difficulty, real-world relevance, and evaluation method. Comparison of coding benchmarks: SWE-bench, HumanEval, MBPP, and LiveCodeBench across difficulty, real-world relevance, and evaluation method.

Writing a sorting function is a different skill from exploring a 500-file Django codebase, identifying the three files that need changes, and producing a patch that resolves the issue without regressions.

In traditional benchmarks, the context is fully self-contained. The AI receives the exact inputs, outputs, and edge cases. In contrast, SWE-bench requires the agent to build its own context by searching through thousands of lines of code to understand the system architecture before writing a single line of implementation.

💡 Key insight: SWE-bench doesn't just test "can the model write code?" It tests "can the model engineer code?" The gap between those two questions is enormous.

DimensionHumanEval / MBPPSWE-bench
ScopeSingle functionFull repository
ContextSelf-contained specIssue description + codebase
NavigationNoneMust find relevant files
TestingSimple I/O checksFull test suite (existing + new)
RealismSyntheticReal-world pull requests

The benchmark variants

The original SWE-bench dataset contains 2,294 task instances drawn from 12 popular Python repositories [1]. The distribution spans large, well-maintained projects: Django, scikit-learn, matplotlib, sympy, and others each contribute significant portions. Each task corresponds to an actual pull request that was merged in the repository's history, meaning the test patches are written by real developers solving real problems.

Over time, the community realized that not all tasks are equally useful for evaluation. Some tasks require obscure setup steps that make them impractical. Others have ambiguous issue descriptions where even a skilled human developer couldn't determine the correct fix. This led to several curated subsets, each designed for a specific evaluation goal.

These variations highlight an important challenge in evaluating AI models: balancing realism with reproducibility. While the full dataset offers a comprehensive view of real-world complexity, the curated subsets provide a cleaner, faster signal for model developers and engineering teams.

SWE-bench variants: Full (2,294 tasks, unfiltered), Verified (500 tasks, human-validated), and Lite (300 tasks, auto-filtered), showing how curation reduces task count from the full dataset. SWE-bench variants: Full (2,294 tasks, unfiltered), Verified (500 tasks, human-validated), and Lite (300 tasks, auto-filtered), showing how curation reduces task count from the full dataset.

The three variants differ primarily in their filtering criteria. SWE-bench Full includes every merged PR that has a clear test patch, regardless of how obscure the bug or how vague the issue description. SWE-bench Lite applies automatic filters to keep only self-contained bugs with clear specifications. SWE-bench Verified goes further, having each task manually reviewed by a professional engineer to confirm the issue is unambiguous and the tests reliably verify the fix.

SWE-bench Lite

A 300-task subset designed to be more practical for evaluation [4]. The filtering criteria:

  • •Removes tasks requiring changes to test files
  • •Removes tasks with overly complex setup requirements
  • •Removes tasks where the issue description is too vague
  • •Focuses on self-contained bugs that a human developer could reasonably debug from the issue description alone

SWE-bench Verified

A 500-task subset that was human-reviewed by professional software engineers to ensure quality and clarity [1]. Unlike the automatically-filtered Lite subset, Verified tasks were manually inspected to confirm that:

  • •The issue description is clear and unambiguous
  • •The unit tests reliably verify the fix (no flaky tests)
  • •A skilled human developer could determine the correct solution from the description alone

This curation makes SWE-bench Verified the preferred benchmark for comparing leading agents. The leaderboard results you'll see reported from independent evaluators like NIST CAISI use this subset because its rigorous filtering produces more consistent, trustworthy scores.

SWE-bench Multimodal

SWE-bench Multimodal extends the benchmark to JavaScript, covering 617 task instances across 17 libraries for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. [5] Each task includes at least one image alongside the issue description, testing whether agents can reason about UI bugs, chart rendering issues, and visual regressions in addition to pure code logic.

Unlike standard SWE-bench where the bug is fully described in text, multimodal tasks require the agent to look at a rendered screenshot and figure out what code produced the visible problem. For example, an issue might say "the tooltip is misaligned with the hovered element" and attach a screenshot of the bug. The agent must process the image, trace it back to the relevant JavaScript or CSS code, and write a fix.

💡 Key insight: Visual bugs often lack clear error logs or stack traces. The agent must independently verify what the rendered output looks like to confirm if its code changes actually solved the problem. Top systems currently resolve only about 12% of multimodal tasks, showing how far agents still need to go on this dimension. [5]

How scoring works

When an agent submits a patch, the evaluation harness applies it to the repository and runs two separate test suites:

  • •FAIL_TO_PASS (FT): Tests added in the original pull request that verify the fix. These fail before the patch is applied and should pass after.
  • •PASS_TO_PASS (PT): Existing tests that were already passing in the repository. These should remain passing to confirm no regressions.

A task is resolved only if both suites pass. FAIL_TO_PASS failing means the fix is incomplete. PASS_TO_PASS failing means the fix broke something else. There's no partial credit: one failing test in either suite is enough to fail the task.

SWE-bench evaluation pipeline: an agent receives a GitHub issue, generates a patch, applies it to the repository, runs both the fix verification tests (FAIL_TO_PASS) and regression tests (PASS_TO_PASS), and receives a binary pass or fail result with no partial credit. SWE-bench evaluation pipeline: an agent receives a GitHub issue, generates a patch, applies it to the repository, runs both the fix verification tests (FAIL_TO_PASS) and regression tests (PASS_TO_PASS), and receives a binary pass or fail result with no partial credit.

The final score is the percentage of tasks resolved out of the total dataset. The Docker container ensures consistent package versions, Python interpreter, and environment configuration so results are reproducible.

The Pass@1 metric

The standard evaluation metric for SWE-bench is Pass@1: did the agent solve the task on its first attempt? This is a strict measure of accuracy without retries. Some evaluations report Pass@k (allowing k attempts), but Pass@1 is the gold standard because it reflects realistic scenarios where you want the agent to get it right the first time.

💡 Key insight: The gap between Pass@1 and Pass@k highlights how much agents rely on trial and error. An agent scoring 60% on Pass@5 might drop to 40% on Pass@1, revealing that a significant portion of its "success" comes from iteration rather than first-shot accuracy.

⚠️ Common mistake: The strictness of this scoring system is both its greatest strength and its primary limitation. On one hand, it guarantees that any reported "resolve" translates to a genuinely working fix in a production-like environment. On the other hand, it masks incremental progress. An agent might write 95% of the correct logic but fail due to a single missed edge case, resulting in the same score as an agent that hallucinated complete nonsense.

The "gold patch" problem

Each task has a "gold" patch: the actual fix that was merged in the original pull request. But the scoring system doesn't compare your patch to the gold patch (it evaluates your patch against the test suite). If your patch is completely different from the original but makes the same tests pass, it counts as resolved. This is by design: there are often multiple valid ways to fix a bug.

The complication is that the gold patch sometimes contains more than just the bug fix. It might include refactoring, style changes, or cleanup that are incidental to the core solution. If those incidental changes affect the code paths that the tests check, an agent with a correct bug fix but without the refactoring can still fail.

🔬 Research insight: This binary scoring has important implications. An agent that produces a patch fixing 90% of the failing tests but breaking one existing test gets the same score as an agent that produces garbage output: zero for that task. [1]

How top agents operate

The highest-performing agents on SWE-bench aren't just Large Language Models (LLMs) generating code. They are sophisticated multi-step systems that combine exploration, reasoning, and iteration. This agentic orchestration is critical because the context window of even the largest models can't effectively process an entire repository at once while maintaining the sharp focus needed to spot subtle bugs. By breaking the problem down into manageable phases, agents emulate the systematic approach of a human developer.

This multi-step approach also allows agents to correct course. If a generated patch fails the test suite, the agent can analyze the failure logs, revisit the source code, and try a different strategy, exactly as a human would.

The following diagram illustrates the typical workflow of these agents. It takes an issue as input and produces a verified patch as output. The agent operates in a continuous loop of code understanding and testing until a working solution is found:

Diagram Diagram

Step 1: Repository exploration

The agent explores the repository structure, identifying relevant directories, reading README files, and building a mental model of the codebase architecture.

Step 2: File localization

Using the issue description as a guide, the agent searches for relevant files. This might involve grep searches for error messages, following import chains, or using the project's test structure to find the right module.

Step 3: Code understanding

The agent reads and reasons about the relevant source code, understanding how the current implementation works and where the bug or missing feature is located.

🎯 Production tip: The file localization step is where most agents win or lose. An agent that picks the right files on the first try can solve the issue in a few iterations. One that starts in the wrong part of the codebase burns through its token budget exploring dead ends.

Step 4: Patch generation

Based on its understanding, the agent generates a diff patch. The best agents iterate multiple times, refining their patches based on test feedback.

Step 5: Verification

Before submitting, the agent runs the test suite to verify its patch. If tests fail, it loops back to refine the solution.

SWE-agent [6] was one of the first purpose-built agents for this benchmark. It introduced the concept of a specialized Agent-Computer Interface (ACI) with custom shell commands designed for code traversal and editing. Devin [7] later demonstrated that giving an agent a full development environment (terminal, browser, editor) could push performance even higher.

The Agentless alternative

Not every approach needs a complex agent loop. The "Agentless" approach showed that a simple pipeline (localize files, generate a patch, then re-rank candidates) without any iterative tool use can achieve competitive results at a fraction of the cost. [8] Unlike traditional agentic loops that rely on the LLM to autonomously decide the next action or execute shell commands, this method strips away the overhead.

By hardcoding the workflow into discrete, deterministic phases, Agentless achieved a 32% solve rate on SWE-bench Lite at approximately $0.70 per task. This raises an important question about the benchmark: does SWE-bench reward genuine autonomous engineering skill, or does it primarily reward the ability to make a small, targeted fix when the problem context is already narrowed down by external scaffolding?

💡 Key insight: The success of the Agentless approach suggests that for many SWE-bench tasks, the primary bottleneck is finding the right files, not the actual code generation. Once localized, even simpler pipelines can often write the correct fix.

This finding has profoundly influenced subsequent agent designs, demonstrating that much of the performance previously attributed to complex agentic reasoning was due to the underlying model's code generation capability when given the right context.

Modern agent architecture patterns (2025)

By late 2025, the most successful SWE-bench agents share several design choices:

  • •File localization as a separate, optimized step. Rather than having the agent explore freely, dedicated retrieval models or code search indices narrow down relevant files before the main LLM ever sees the code.
  • •Test-driven iteration. The agent writes a patch, runs the tests, reads the error output, and refines. The best agents cap this at 3-5 iterations to avoid cost blowup.
  • •Smaller models for scaffolding. Many top systems use a cheaper model (like Gemini 3 Flash or MiniMax M2.5) for file exploration and a more capable model (like Claude Opus 4.6 or GPT-5.4) for the final patch generation. This keeps costs manageable while maintaining quality on the hardest step.

The leaderboard reality check

When you see a headline like "Agent X achieves 50% on SWE-bench," here's what you should consider:

FactorWhat to Watch For
Which variant?Lite, Verified, and Full scores are NOT comparable
Cost per taskTop agents often make 50-100+ LLM calls, costing $1-5 per attempt
Submission methodSome evaluations let agents skip hard tasks, inflating pass rates
Contamination riskSome benchmark issues may now appear in model training data

It's also crucial to differentiate between general-purpose models and purpose-built agentic systems. A model evaluated "bare" (using just a simple prompt to generate a patch) will score much lower than the exact same model wrapped in a complex agent loop with access to tools, search, and a terminal.

The standard evaluation protocol requires attempting all tasks. If an evaluation report doesn't specify this, treat the numbers skeptically.

⚠️ Common mistake: Taking leaderboard numbers at face value. The cost and latency required to achieve top scores make many approaches impractical for real-world use. A 40% score at $0.10 per task might be more useful than a 60% score at $5.00 per task.

What SWE-bench doesn't test

Understanding the limitations is as important as understanding what the benchmark measures:

  • •

    Greenfield development: SWE-bench is entirely about fixing bugs and adding features to existing codebases. It doesn't test the ability to architect a new system from scratch.

  • •

    Ambiguous requirements: The "gold" solution is always the original developer's pull request. In practice, many bugs have multiple valid fixes, and the "best" fix depends on context that isn't captured in the issue description.

  • •

    Long-running tasks: Most SWE-bench tasks can be solved with relatively small patches (median of roughly 20 lines changed). It doesn't test multi-day refactoring projects or large-scale migrations.

  • •

    Collaboration: Real software engineering involves code review, discussion, and iteration with teammates. SWE-bench tests solo problem-solving in isolation.

  • •

    Language diversity: The benchmark is exclusively Python. Performance might not generalize to other languages, especially those with very different characteristics (statically typed, functional, systems languages).

  • •

    Performance and scalability reasoning: The benchmark tests correctness (do the tests pass?), not whether the fix is performant, secure, or maintainable. An agent could introduce a correct but O(n2)O(n^2)O(n2) solution where an O(n)O(n)O(n) fix exists, and the benchmark wouldn't penalize it.

  • •

    Domain-specific codebases: The benchmark repositories (Django, scikit-learn, matplotlib) are well-maintained, well-documented open-source projects. Performance on a legacy enterprise codebase with sparse documentation and tangled dependencies would likely be significantly worse.

These missing elements highlight the difference between an AI coding assistant and an autonomous engineer. While an assistant can excel at targeted fixes, an engineer must orchestrate entire systems, negotiate ambiguous requirements, and ensure long-term maintainability.

⚠️ Common mistake: A high SWE-bench score doesn't mean an agent can replace a human software engineer. It proves the agent is skilled at a very specific subset of engineering tasks (self-contained bug fixes in well-tested Python codebases).

Understanding these blind spots helps calibrate expectations. A high SWE-bench score proves that a model has strong foundational reasoning for code exploration and logic patching, but it doesn't guarantee success in a chaotic, undocumented corporate environment where requirements change daily.

The Claude effect

Anthropic's Claude models have shown particularly strong performance on SWE-bench, which has sparked an interesting debate: is SWE-bench measuring general software engineering ability, or is it measuring something more specific about how well a model handles Python codebases with good test coverage?

Early evaluations of Claude Opus found that it resolved 66.7% of SWE-bench Verified tasks under independent evaluation conditions [9]. As both the models and agent scaffolding have improved, the leaderboard has climbed significantly. Current results using the open-source mini-SWE-agent scaffold show Claude Opus 4 models resolving 76.8% of SWE-bench Verified tasks, with Gemini 3 Flash (high reasoning) close behind at 75.8%.

The key variable is the agent scaffolding. The same model wrapped in a bare prompt gets a fraction of the score as the same model with a purpose-built exploration loop, test runner, and retry logic. This means that a model's "SWE-bench score" is really a joint score of the model's reasoning ability and the quality of the agent system built around it.

🎯 Production tip: When comparing SWE-bench results, always ask what scaffolding was used. The gap between a bare model and a well-tuned agent loop can be 30+ percentage points for the same underlying model.

SWE-bench success correlates strongly with "fixing well-tested Python code when given good tooling." That's a valuable skill, but it's a subset of what production software engineering looks like.

State of the art (early 2026)

The leaderboard has moved rapidly through 2025 and into 2026. Early 2024 evaluations showed Claude Opus resolving 66.7% of SWE-bench Verified tasks under independent conditions [9]. By early 2026, the top of the leaderboard sits around 76-77% on the same Verified subset, using the open-source mini-SWE-agent scaffold.

Looking across the range of reported scores, several patterns emerge:

  • •

    Scaffolding dominates the score more than the base model. A mediocre model with a well-tuned agent loop often outperforms a superior model running bare. The agent framework handles file localization, test execution, and iterative refinement in ways that amplify the base model's strengths.

  • •

    The gap between independent and self-reported results is real and persistent. Most models perform worse under independent evaluation than in provider benchmarks. Lab evaluations often use optimized prompting, higher token budgets, or purpose-built agent scaffolding. Always look for independent verification when comparing model claims.

  • •

    Reasoning models show mixed results. Some purpose-built reasoning models score lower than general-purpose counterparts under independent evaluation. This suggests that SWE-bench tasks reward practical code exploration and pattern matching more than deep chain-of-thought reasoning. Knowing when to reason deeply and when to just apply the obvious fix is itself a skill.

The cost-performance tradeoff

What the leaderboard doesn't show is the economics. Independent evaluations show that the most capable model is not always the most cost-effective one. Efficient models often achieve competitive pass rates at a small fraction of the cost of frontier alternatives. For enterprise teams running thousands of evaluations, this cost difference is transformative.

High inference costs have historically gated the adoption of AI agents in continuous integration pipelines. If every automated PR review costs $5, it's difficult to justify deploying it at scale. However, when the cost drops to mere cents per task, it becomes feasible to attach an autonomous agent to every single GitHub issue or pull request without manual approval.

🔬 Research insight: A large cost reduction with roughly equivalent quality is often a stronger signal than a modest accuracy improvement at the same price. When evaluating AI coding tools, always ask about cost-per-solved-task, not just solve rate.

What this means for engineers

If you're evaluating AI coding tools for your team, SWE-bench scores give you a useful signal, but they're not the full picture. Here's a practical framework:

SWE-bench correlates well with...SWE-bench says less about...
Bug fixing in well-tested Python codebasesPerformance in your specific tech stack
Feature additions with clear specificationsHandling of proprietary codebases or internal APIs
Code exploration in large repositoriesSpeed and cost efficiency for your use case
Following existing patterns and conventionsQuality of code explanations and documentation

🎯 Production tip: Take SWE-bench scores as a first filter to narrow your shortlist, then run your own evaluation on a dataset of 20-50 real issues from your codebase. This gives you a signal that no public benchmark can provide, because it accounts for your specific language, frameworks, coding conventions, and the types of issues your team encounters.

When reading SWE-bench results, ask these questions:

  1. •Which variant was used? Verified, Lite, and Full are different benchmarks with different difficulty curves.
  2. •Was it independently evaluated? Self-reported scores run 10-20 percentage points higher than independent evaluations.
  3. •What was the cost per task? The economic viability of an agent matters as much as its accuracy.
  4. •What scaffolding was used? A bare model versus a purpose-built agent with search tools, test runners, and retry logic will give very different results from the same underlying LLM.

The benchmark is a valuable tool for tracking progress in AI coding capabilities. The gap between "resolves GitHub issues in popular Python repos" and "replaces a software engineer" remains vast. Use the scores as one data point among many when making tool adoption decisions.

Summary

SWE-bench has fundamentally changed how we measure AI coding agents. By moving away from isolated, synthetic tasks and toward real-world repository maintenance, it provides a much more realistic assessment of an agent's ability to explore, understand, and modify complex codebases.

However, as we've seen, a high score on a leaderboard doesn't automatically translate to a drop-in replacement for a human engineer. The benchmark primarily tests a specific set of skills, largely focused on fixing localized bugs in well-documented, well-tested Python projects, and leaves out crucial aspects of software engineering like system design, collaboration, and performance optimization.

When evaluating AI coding tools for your own team, keep these core principles in mind:

  • •SWE-bench is valuable because it evaluates real repository maintenance tasks, not toy coding prompts.
  • •Variant, cost, and evaluation protocol matter as much as the headline solve rate.
  • •Independent evaluations are often 10-20 points lower than self-reported results, so source quality matters.
  • •The benchmark is strongest for well-tested Python bug-fixing workflows and weaker for broader engineering work.
  • •The right workflow is benchmark shortlist first, then run your own internal eval on real issues from your codebase.

Looking ahead, the benchmarking space is evolving in several directions as of early 2026:

  • •

    Multi-language expansion. The Python-only limitation is being addressed. Community efforts to build SWE-bench equivalents for JavaScript/TypeScript, Java, and Rust are underway. Early results suggest that model performance varies significantly across languages, with TypeScript results generally tracking close to Python while systems languages show larger drops.

  • •

    Agentic benchmarks. New benchmarks like Tau-Bench [10] evaluate agents in more realistic settings: interacting with APIs, managing stateful sessions, and handling ambiguous instructions that require clarification. These complement SWE-bench by testing the broader set of skills that production coding agents need.

  • •

    The multimodal frontier. SWE-bench Multimodal adds visual artifacts (screenshots, chart renderings, UI mockups) to issue descriptions. This tests whether agents can reason about what code produces, not just what code is. This matters increasingly as frontend and full-stack AI coding becomes more practical.

⚠️ Common mistake: As SWE-bench issues have been publicly available for over two years and are widely discussed in AI research, the risk of test set contamination grows. Models trained after mid-2024 might have seen solutions to some benchmark tasks during pre-training. The community is exploring dynamic benchmark generation and held-out test sets to address this.

  • •

    Cost-normalized evaluation. Some independent evaluation efforts now report both accuracy and per-task cost alongside each other. This is arguably the most important trend: an agent that solves 55% of tasks at $0.05 per attempt can deliver more value than one solving 65% at $5 per attempt. As coding agents move from demos to production, cost efficiency is becoming a first-class evaluation metric alongside raw accuracy.

  • •

    Real-time, streaming benchmarks. Rather than evaluating agents on historical issues, some groups are experimenting with live evaluation: agents receive newly filed issues in real time and must resolve them before human developers do. This eliminates contamination risk entirely and tests the agent under realistic time pressure. Early experiments suggest that agent performance drops 10-15% compared to the standard (offline, no time limit) setting.

As these new evaluation frameworks mature, we expect to see a more detailed understanding of AI capabilities. Rather than a single monolithic leaderboard, engineering teams will likely rely on a dashboard of metrics tailored to their specific technology stack and workflows.

The most honest reading of the current state: AI coding agents can now solve the majority of well-specified bugs in well-tested Python codebases. That's genuinely impressive, and commercially useful. But the remaining 30-40% of tasks, and the vast space of engineering work that SWE-bench doesn't cover at all, represent the harder frontier that the field is now confronting.

References

Agentless: Demystifying LLM-based Software Engineering Agents

Xia et al. · 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

SWE-bench Lite: A Filtered Subset for Practical Evaluation

Jimenez et al. · 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al. · 2024

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Yang, J., et al. · 2025 · ICLR 2025

Introducing Devin, the first AI software engineer

Cognition Labs · 2024

Claude's performance on SWE-bench

Anthropic · 2025

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Program Synthesis with Large Language Models

Austin, J., et al. · 2021

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail