Commit Graph

2 Commits

Author SHA1 Message Date
primal
091fa8490b Filter text/html extraction by feed-like URL patterns
Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:30:40 -05:00
primal
61ca7a4c7a Add html feed type for content-sniffed feeds
- New 'html' type for feeds served with text/html MIME
- feed_check content-sniffs html feeds and updates type to rss/atom/json
- If content-sniff returns unknown, marks feed as IGNORE
- Added cmd/extract-html tool to query local parquet files for text/html

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:18:26 -05:00