Batch processes existing item descriptions to strip HTML tags,
decode HTML entities, and truncate to 300 characters. Processes
in batches of 1000 with progress output.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove classifyFeed and classifyFeedByTitle functions
- Remove Category from Feed struct
- Remove category from all SQL queries and scans
- Add migration to drop category column from database
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove domain_host and domain_tld columns from feeds schema
- Add migrations to drop columns and related index/FK constraint
- Update all feed queries and structs to not include these columns
- Use URL pattern search instead of domain columns for GetFeedsByHost
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove DiscoveredAt field from Feed and Item structs
- Remove from all SQL queries
- Remove from schema definitions
- Add migrations to drop the columns
- Remove unused 'now' variable declarations
The column wasn't providing value - all feeds had the same timestamp
from bulk import, and items weren't using it for any logic.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Standalone tool that uses pgx connection pool to import feeds from TSV.
Handles special characters in password via key=value connection string format.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New 'html' type for feeds served with text/html MIME
- feed_check content-sniffs html feeds and updates type to rss/atom/json
- If content-sniff returns unknown, marks feed as IGNORE
- Added cmd/extract-html tool to query local parquet files for text/html
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>