MediumGraph TraversalPython 3
Same-host Crawler
Traverse a web graph in breadth-first order while enforcing same-host and duplicate-visit invariants.
30m1 sample tests1 hidden tests
Same-host Crawler
Implement crawl(start_url, get_links) for a small web graph. The crawler starts from start_url, calls get_links(url) to discover outgoing links, and returns only URLs on the same host.
Requirements
- Return URLs in deterministic breadth-first order.
- Do not visit the same URL twice.
- Resolve relative links against the current URL.
- Ignore malformed URLs, non-HTTP(S) URLs, and off-host URLs.
- Keep the implementation single-threaded for the base problem.
Example
python
1graph = {
2 "https://docs.example.com/": ["/a", "/b", "https://other.example.com/x"],
3 "https://docs.example.com/a": ["/c"],
4 "https://docs.example.com/b": ["/c"],
5}
6
7def get_links(url):
8 return graph.get(url, [])
9
10assert crawl("https://docs.example.com/", get_links) == [
11 "https://docs.example.com/",
12 "https://docs.example.com/a",
13 "https://docs.example.com/b",
14 "https://docs.example.com/c",
15]Constraints
- Assume
get_linksis synchronous. - Use standard-library Python only.
- Treat the start URL host as the only allowed host.
Editor
1 2 3 4 5 6
Sample Tests
filters off-host links and keeps BFS order
from solution import crawl
graph = {
"https://docs.example.com/": ["/a", "https://docs.example.com/b", "https://other.example.com/x"],
"https://docs.example.com/a": ["/c"],
"https://docs.example.com/b": ["/c"],
"https://docs.example.com/c": [],
}
def get_links(url):
return graph.get(url, [])
assert crawl("https://docs.example.com/", get_links) == [
"https://docs.example.com/",
"https://docs.example.com/a",
"https://docs.example.com/b",
"https://docs.example.com/c",
]Results
Run sample tests or submit all tests.