Files
crawler/cmd/extract-html
primal 091fa8490b Filter text/html extraction by feed-like URL patterns
Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:30:40 -05:00
..