MediumGraph TraversalPython 3

Same-host Crawler

Traverse a web graph in breadth-first order while enforcing same-host and duplicate-visit invariants.

30m3 sample tests5 hidden tests

Implement crawl(start_url, get_links) for a small web graph. The crawler starts from start_url, calls get_links(url) to discover outgoing links, and returns only URLs on the same host.

Requirements

Return URLs in deterministic breadth-first order.
Don't visit the same URL twice.
Resolve relative links against the current URL.
Ignore malformed URLs, non-HTTP(S) URLs, and off-host URLs.
Keep the implementation single-threaded for the base problem.

Example

python

graph = {
    "https://docs.example.com/": ["/a", "/b", "https://other.example.com/x"],
    "https://docs.example.com/a": ["/c"],
    "https://docs.example.com/b": ["/c"],
}

def get_links(url):
    return graph.get(url, [])

assert crawl("https://docs.example.com/", get_links) == [
    "https://docs.example.com/",
    "https://docs.example.com/a",
    "https://docs.example.com/b",
    "https://docs.example.com/c",
]

Constraints

Assume get_links is synchronous.
Use standard-library Python only.
Treat the start URL host as the only allowed host.

Editor

Same-host Crawler

Requirements

Example

Constraints

Sample Tests

Clarification Questions

Complexity Analysis

Solution Guide

Follow-up Guide