MediumRetrievalPython 3

RAG Chunk Selector

Rank retrieval chunks while deduplicating repeated text and limiting per-document dominance.

30m3 sample tests5 hidden tests

Select retrieval chunks while avoiding duplicate text and overloading one document.

Requirements

Define select_chunks(chunks, k, max_per_doc=2).
Each chunk is a dict with id, doc_id, text, and score.
Return selected chunk IDs.
Higher score wins.
Ties sort by id alphabetically.
Skip chunks whose normalized text was already selected.
Normalized text is lowercase with whitespace collapsed.
Select at most max_per_doc chunks per document.

Example

python

chunks = [
    {"id": "a", "doc_id": "d1", "text": "Hello world", "score": 0.9},
    {"id": "b", "doc_id": "d2", "text": "hello   world", "score": 0.8},
]
assert select_chunks(chunks, 2) == ["a"]

Constraints

Don't mutate input chunks.
Use deterministic ordering.
Empty or whitespace-only text can still be selected once.

Editor