Crawl a WordPress site with a calendar plugin. Suddenly you're hitting /events/2024/01/01, /events/2024/01/02, /events/2024/01/03 ...infinitely. Or e-commerce sites with faceted search: ?color=red&size=m&sort=price. Thousands of "pages" that are really just filter combinations.
What I learned
Coverage isn't the goal. Representative coverage is. I needed enough pages to understand the site's structure and quality, not an exhaustive inventory.
Coverage isn't the goal. Representative coverage is.
The rules I built
Four constraints keep the crawler productive without melting down:
- 30-page hard cap. Enough to sample patterns, not enough to melt down
- Priority queue: nav links first, footer links second, body links last
- Exclusion patterns: calendar paths, ?page=N, faceted parameters are automatically filtered
- Deduplication: same form appearing on 15 pages? I count it once
The tradeoff I accepted
I might miss a deeply nested page. But it prevents timeouts, crashes, or inconsistent data. Speed and reliability beat theoretical completeness.