Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New 'html' type for feeds served with text/html MIME
- feed_check content-sniffs html feeds and updates type to rss/atom/json
- If content-sniff returns unknown, marks feed as IGNORE
- Added cmd/extract-html tool to query local parquet files for text/html
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>