Batch processes existing item descriptions to strip HTML tags,
decode HTML entities, and truncate to 300 characters. Processes
in batches of 1000 with progress output.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Clean descriptions when parsing feeds rather than at publish time.
Descriptions are now stored as plain text, max 300 chars.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The enclosure length was never used when publishing to the PDS.
Added migration to drop the column.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This field was never populated by any feed parser (RSS, Atom, JSON Feed)
and was always NULL. Added migration to drop the column.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove SiteURL from Feed struct
- Remove site_url from all SQL queries and scans
- Remove SiteURL parsing from RSS/Atom/JSON feed parsers
- Add migration to drop site_url column
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove classifyFeed and classifyFeedByTitle functions
- Remove Category from Feed struct
- Remove category from all SQL queries and scans
- Add migration to drop category column from database
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove domains table from schema
- Add migration to DROP TABLE domains
- Remove domain.go (Domain struct and related functions)
- Update stats output to only show feed count
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove domain_host and domain_tld columns from feeds schema
- Add migrations to drop columns and related index/FK constraint
- Update all feed queries and structs to not include these columns
- Use URL pattern search instead of domain columns for GetFeedsByHost
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Drop columns from schema, add migration
- Remove date stats calculation from parsers
- Update all feed queries to exclude these columns
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Drop last_build_date from schema and add migration
- Remove parsing of RSS lastBuildDate and Atom updated date
- Update all feed SQL queries to exclude the column
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove DiscoveredAt field from Feed and Item structs
- Remove from all SQL queries
- Remove from schema definitions
- Add migrations to drop the columns
- Remove unused 'now' variable declarations
The column wasn't providing value - all feeds had the same timestamp
from bulk import, and items weren't using it for any logic.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove SourceURL field from Feed struct
- Remove from all SQL queries
- Remove from schema definition
- Add migration to drop the column
The column was never populated (0 entries) and the feature
to track where feeds were discovered from was never implemented.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove NextCheckAt field from Feed struct
- Remove from all SQL queries (saveFeed, getFeed, scanFeeds, etc.)
- Remove from schema definition
- Add migration to drop the column if it exists
- Update calculateNextCheck to use MissCount field
- Update tld.go to use status column instead of feed_health
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Standalone tool that uses pgx connection pool to import feeds from TSV.
Handles special characters in password via key=value connection string format.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New 'html' type for feeds served with text/html MIME
- feed_check content-sniffs html feeds and updates type to rss/atom/json
- If content-sniff returns unknown, marks feed as IGNORE
- Added cmd/extract-html tool to query local parquet files for text/html
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Many sites (e.g., news.ycombinator.com) serve RSS with Content-Type: text/html.
These were being filtered out. Now we include text/html and rely on
detectFeedType content sniffing during feed_check to identify real feeds.
False positives will have type=unknown and increment MissCount but
will not cause harm beyond consuming check cycles.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename feeds.status -> feeds.feed_health in all SQL
- Rename Feed.Status -> Feed.FeedHealth in Go code
- Update index to use feed_health column
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add BulkImportFeedsFromTSV() for fast direct insertion from TSV
- Skip HTTP verification, insert feeds with status='hold'
- Check for pending TSV files on startup and auto-import
- Imported 4.7M feeds in ~26 minutes
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
infra-dns container has UDP connectivity issues to upstream DNS.
System resolver works (proven via wget test).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- StartDomainCheckLoop: DNS verification for unchecked domains (1000 workers)
- StartFeedCrawlLoop: Feed discovery on DNS-verified domains (100 workers)
This fixes starvation where 104M unchecked domains blocked 1.2M
DNS-verified domains from ever being crawled for feeds.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Items now have a status column ('pass' or 'fail', default 'pass') to
control publishing eligibility. Includes migration for existing databases.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Publishing functionality has been moved to the standalone publish service.
Removed:
- publisher.go, pds_auth.go, pds_records.go, image.go, handle.go
- StartPublishLoop and related functions from crawler.go
- Publish loop invocation from main.go
Updated CLAUDE.md to reflect the new architecture.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dpop_authserver_nonce, dpop_pds_nonce, pds_url, authserver_iss columns
- These columns are required by GetSession query but were missing from schema
- Add migrations to create columns on existing tables
- Add debug logging for OAuth flow troubleshooting
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update schema to create oauth_sessions instead of sessions
- Add migration to rename existing sessions table
- Add token_expiry column for OAuth library compatibility
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove ID field from Item struct
- Remove ID field from SearchItem struct
- Update all SQL queries to not select id column
- Change MarkItemPublished to use feedURL/guid instead of id
- Update shortener to use item_guid instead of item_id
- Add migration to convert item_id to item_guid in short_urls table
- Update API endpoints to use feedUrl/guid instead of itemId
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace source_host column with proper FK to domains table using
composite key (domain_host, domain_tld). This enables JOIN queries
instead of string concatenation for domain lookups.
Changes:
- Update Feed struct: SourceHost/TLD → DomainHost/DomainTLD
- Update all SQL queries to use domain_host/domain_tld columns
- Add column aliases (as source_host) for API backwards compatibility
- Update trigram index from source_host to domain_host
- Add getDomainHost() helper for extracting host from domain
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add shutdownCh channel to signal goroutines to stop
- Check IsShuttingDown() in all main loops
- Wait 2 seconds for goroutines to finish before closing DB
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When searching for "npr.org" and viewing the .org TLD, use the host part
("npr") for matching instead of the full pattern ("npr.org").
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>