MediumParsingPython 3
Longest-match Tokenizer
Tokenize text by greedily choosing the longest vocabulary token at each position.
35m1 sample tests2 hidden tests
Longest-match Tokenizer
Implement tokenize(text, vocab) for a simplified tokenizer. At each character position, choose the longest token from vocab that matches the remaining text.
Requirements
- Return tokens in order.
- Ignore ASCII whitespace between tokens.
- Prefer longest match when multiple tokens match.
- Raise
ValueErrorwith the failing position when no token matches. - Avoid repeated full-string slicing in the hot loop.
Example
python
1vocab = {"the", "there", "for", "fore"}
2assert tokenize("therefore", vocab) == ["there", "fore"]Constraints
vocabcontains non-empty strings.- Input text is small enough for a trie or length-sorted scan.
- Use standard-library Python only.
Editor
1 2
Sample Tests
uses longest match over shorter prefixes
from solution import tokenize
vocab = {"the", "there", "for", "fore"}
assert tokenize("therefore", vocab) == ["there", "fore"]Results
Run sample tests or submit all tests.