MediumGraph TraversalPython 3

Same-host Crawler

Traverse a web graph in breadth-first order while enforcing same-host and duplicate-visit invariants.

30m1 sample tests1 hidden tests

Same-host Crawler

Implement crawl(start_url, get_links) for a small web graph. The crawler starts from start_url, calls get_links(url) to discover outgoing links, and returns only URLs on the same host.

Requirements

  • Return URLs in deterministic breadth-first order.
  • Do not visit the same URL twice.
  • Resolve relative links against the current URL.
  • Ignore malformed URLs, non-HTTP(S) URLs, and off-host URLs.
  • Keep the implementation single-threaded for the base problem.

Example

python
1graph = { 2 "https://docs.example.com/": ["/a", "/b", "https://other.example.com/x"], 3 "https://docs.example.com/a": ["/c"], 4 "https://docs.example.com/b": ["/c"], 5} 6 7def get_links(url): 8 return graph.get(url, []) 9 10assert crawl("https://docs.example.com/", get_links) == [ 11 "https://docs.example.com/", 12 "https://docs.example.com/a", 13 "https://docs.example.com/b", 14 "https://docs.example.com/c", 15]

Constraints

  • Assume get_links is synchronous.
  • Use standard-library Python only.
  • Treat the start URL host as the only allowed host.

Editor
1
2
3
4
5
6
Sample Tests
filters off-host links and keeps BFS order
from solution import crawl

graph = {
    "https://docs.example.com/": ["/a", "https://docs.example.com/b", "https://other.example.com/x"],
    "https://docs.example.com/a": ["/c"],
    "https://docs.example.com/b": ["/c"],
    "https://docs.example.com/c": [],
}

def get_links(url):
    return graph.get(url, [])

assert crawl("https://docs.example.com/", get_links) == [
    "https://docs.example.com/",
    "https://docs.example.com/a",
    "https://docs.example.com/b",
    "https://docs.example.com/c",
]
Results
Run sample tests or submit all tests.