MediumParsingPython 3

Longest-match Tokenizer

Tokenize text by greedily choosing the longest vocabulary token at each position.

35m3 sample tests5 hidden tests

Implement tokenize(text, vocab) for a simplified tokenizer. At each character position, choose the longest token from vocab that matches the remaining text.

Requirements

Return tokens in order.
Ignore ASCII whitespace between tokens.
Prefer longest match when multiple tokens match.
Raise ValueError with the failing position when no token matches.
Avoid repeated full-string slicing in the hot loop.

Example

python

vocab = {"the", "there", "for", "fore"}
assert tokenize("therefore", vocab) == ["there", "fore"]

Constraints

vocab contains non-empty strings.
Input text is small enough for a trie or length-sorted scan.
Use standard-library Python only.

Editor

Longest-match Tokenizer

Requirements

Example

Constraints

Sample Tests

Clarification Questions

Complexity Analysis

Solution Guide

Follow-up Guide