MediumParsingPython 3

Longest-match Tokenizer

Tokenize text by greedily choosing the longest vocabulary token at each position.

35m1 sample tests2 hidden tests

Longest-match Tokenizer

Implement tokenize(text, vocab) for a simplified tokenizer. At each character position, choose the longest token from vocab that matches the remaining text.

Requirements

  • Return tokens in order.
  • Ignore ASCII whitespace between tokens.
  • Prefer longest match when multiple tokens match.
  • Raise ValueError with the failing position when no token matches.
  • Avoid repeated full-string slicing in the hot loop.

Example

python
1vocab = {"the", "there", "for", "fore"} 2assert tokenize("therefore", vocab) == ["there", "fore"]

Constraints

  • vocab contains non-empty strings.
  • Input text is small enough for a trie or length-sorted scan.
  • Use standard-library Python only.

Editor
1
2
Sample Tests
uses longest match over shorter prefixes
from solution import tokenize

vocab = {"the", "there", "for", "fore"}
assert tokenize("therefore", vocab) == ["there", "fore"]
Results
Run sample tests or submit all tests.