EasyEvaluationPython 3

Eval Failure Aggregator

Aggregate eval run rows into stable pass-rate and error diagnostics.

25m1 sample tests1 hidden tests

Summarize model evaluation runs into pass rate, failing cases, and top errors.

Requirements

  • Define summarize_runs(runs).
  • Each run is a dict with case_id, passed, optional category, and optional error.
  • Return a dict with:
    • total
    • passed
    • pass_rate
    • failing_cases
    • top_errors
  • failing_cases contains unique failed case IDs sorted alphabetically.
  • top_errors contains (error, count) pairs for failed rows with a non-empty error.
  • Sort top errors by count descending, then error alphabetically.
  • For no runs, pass_rate is 0.

Example

python
1runs = [ 2 {"case_id": "a", "passed": True}, 3 {"case_id": "b", "passed": False, "error": "timeout"}, 4] 5summary = summarize_runs(runs) 6assert summary["pass_rate"] == 0.5 7assert summary["failing_cases"] == ["b"]

Constraints

  • Do not mutate input runs.
  • Count each run toward total, even repeated case IDs.
  • Count unique case IDs only in failing_cases.

Editor
Results
Run sample tests or submit all tests.