EasyEvaluationPython 3
Eval Failure Aggregator
Aggregate eval run rows into stable pass-rate and error diagnostics.
25m1 sample tests1 hidden tests
Summarize model evaluation runs into pass rate, failing cases, and top errors.
Requirements
- Define
summarize_runs(runs). - Each run is a dict with
case_id,passed, optionalcategory, and optionalerror. - Return a dict with:
totalpassedpass_ratefailing_casestop_errors
failing_casescontains unique failed case IDs sorted alphabetically.top_errorscontains(error, count)pairs for failed rows with a non-empty error.- Sort top errors by count descending, then error alphabetically.
- For no runs,
pass_rateis0.
Example
python
1runs = [
2 {"case_id": "a", "passed": True},
3 {"case_id": "b", "passed": False, "error": "timeout"},
4]
5summary = summarize_runs(runs)
6assert summary["pass_rate"] == 0.5
7assert summary["failing_cases"] == ["b"]Constraints
- Do not mutate input runs.
- Count each run toward
total, even repeated case IDs. - Count unique case IDs only in
failing_cases.
Editor
Results
Run sample tests or submit all tests.