Commit Graph

138 Commits

Author SHA1 Message Date
primal
d4a1928fa6 Increase feed check parallelism for 1Gbps bandwidth
- Workers: 1000 -> 4000
- Work channel buffer: 1000 -> 4000
- Fetch batch size: 1000 -> 4000
- MaxIdleConns: 100 -> 2000

Should improve throughput from ~15 feeds/sec to ~50-60 feeds/sec.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:11:58 -05:00
primal
1f90b7d6a0 Simplify launch script
Remove local cache busting logic, delegate to parent launch script.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 23:49:21 -05:00
primal
b515976fef Add clean-descriptions migration tool
Batch processes existing item descriptions to strip HTML tags,
decode HTML entities, and truncate to 300 characters. Processes
in batches of 1000 with progress output.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 23:43:58 -05:00
primal
253e04a749 Strip HTML and truncate descriptions on input
Clean descriptions when parsing feeds rather than at publish time.
Descriptions are now stored as plain text, max 300 chars.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:59:21 -05:00
primal
70828bf05d Remove unused enclosure_length from items table
The enclosure length was never used when publishing to the PDS.
Added migration to drop the column.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:56:45 -05:00
primal
3af9e65937 Remove unused updated_at column from items table
This field was never populated by any feed parser (RSS, Atom, JSON Feed)
and was always NULL. Added migration to drop the column.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 22:44:33 -05:00
primal
018c059924 Remove GUID from items, use Link as primary key
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:34:39 -05:00
primal
6314b934c1 Remove content column from items table - only description used for posts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:14:29 -05:00
primal
8cf25a55dc Remove item_count column from feeds table - compute dynamically from items
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:00:23 -05:00
primal
f4c6a9d814 Remove last_error_at from feeds table and queries
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:53:52 -05:00
primal
be969c11db Remove site_url field from feeds
- Remove SiteURL from Feed struct
- Remove site_url from all SQL queries and scans
- Remove SiteURL parsing from RSS/Atom/JSON feed parsers
- Add migration to drop site_url column

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:44:35 -05:00
primal
288379804d Remove category field from feeds
- Remove classifyFeed and classifyFeedByTitle functions
- Remove Category from Feed struct
- Remove category from all SQL queries and scans
- Add migration to drop category column from database

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:37:26 -05:00
primal
94369232f8 Remove domains table - feeds imported directly from CDX
- Remove domains table from schema
- Add migration to DROP TABLE domains
- Remove domain.go (Domain struct and related functions)
- Update stats output to only show feed count

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:17:52 -05:00
primal
82b40b9155 Remove domain_host/domain_tld columns from feeds table
- Remove domain_host and domain_tld columns from feeds schema
- Add migrations to drop columns and related index/FK constraint
- Update all feed queries and structs to not include these columns
- Use URL pattern search instead of domain columns for GetFeedsByHost

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 20:07:03 -05:00
primal
037b453a68 Remove oldest_item_date and newest_item_date columns
- Drop columns from schema, add migration
- Remove date stats calculation from parsers
- Update all feed queries to exclude these columns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 19:28:20 -05:00
primal
fec53f913c Remove last_build_date column from feeds schema
- Drop last_build_date from schema and add migration
- Remove parsing of RSS lastBuildDate and Atom updated date
- Update all feed SQL queries to exclude the column

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 19:16:30 -05:00
primal
2c3fa5e104 Remove discovered_at column from feeds and items tables
- Remove DiscoveredAt field from Feed and Item structs
- Remove from all SQL queries
- Remove from schema definitions
- Add migrations to drop the columns
- Remove unused 'now' variable declarations

The column wasn't providing value - all feeds had the same timestamp
from bulk import, and items weren't using it for any logic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 19:07:20 -05:00
primal
0428ff0241 Remove unused source_url column from feeds table
- Remove SourceURL field from Feed struct
- Remove from all SQL queries
- Remove from schema definition
- Add migration to drop the column

The column was never populated (0 entries) and the feature
to track where feeds were discovered from was never implemented.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 16:16:53 -05:00
primal
e56bcd456d Remove next_check_at column from feeds table
- Remove NextCheckAt field from Feed struct
- Remove from all SQL queries (saveFeed, getFeed, scanFeeds, etc.)
- Remove from schema definition
- Add migration to drop the column if it exists
- Update calculateNextCheck to use MissCount field
- Update tld.go to use status column instead of feed_health

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 16:07:04 -05:00
primal
1b0ff1b507 Add import-tsv tool for bulk importing TSV feed files
Standalone tool that uses pgx connection pool to import feeds from TSV.
Handles special characters in password via key=value connection string format.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 15:25:29 -05:00
primal
091fa8490b Filter text/html extraction by feed-like URL patterns
Reduces from ~2B URLs to ~2-3M by filtering for URLs containing:
rss, feed, atom, xml, syndication, frontpage, newest, etc.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:30:40 -05:00
primal
61ca7a4c7a Add html feed type for content-sniffed feeds
- New 'html' type for feeds served with text/html MIME
- feed_check content-sniffs html feeds and updates type to rss/atom/json
- If content-sniff returns unknown, marks feed as IGNORE
- Added cmd/extract-html tool to query local parquet files for text/html

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:18:26 -05:00
primal
5c43fc693b Add text/html to CDX filter for feeds with wrong MIME type
Many sites (e.g., news.ycombinator.com) serve RSS with Content-Type: text/html.
These were being filtered out. Now we include text/html and rely on
detectFeedType content sniffing during feed_check to identify real feeds.

False positives will have type=unknown and increment MissCount but
will not cause harm beyond consuming check cycles.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:11:57 -05:00
primal
12bd68000d Rename status column to feed_health
- Rename feeds.status -> feeds.feed_health in all SQL
- Rename Feed.Status -> Feed.FeedHealth in Go code
- Update index to use feed_health column

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 10:58:33 -05:00
primal
02378950f4 Add bulk import for CDX feeds
- Add BulkImportFeedsFromTSV() for fast direct insertion from TSV
- Skip HTTP verification, insert feeds with status='hold'
- Check for pending TSV files on startup and auto-import
- Imported 4.7M feeds in ~26 minutes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 10:42:52 -05:00
primal
3b1b12ff70 Simplify crawler: use CDX for feed discovery, remove unused loops
- Replace StartCDXImportLoop with StartCDXMonthlyLoop (runs on 1st of month)
- Enable StartFeedCheckLoop, StartCleanupLoop, StartMaintenanceLoop
- Remove domain check/crawl loops (CDX provides verified feeds)
- Remove vertices.txt import functions (CDX is now sole feed source)
- Remove HTML extraction functions (extractFeedLinks, extractAnchorFeeds, etc.)
- Remove unused helpers (shouldCrawl, makeAbsoluteURL, GetFeedCountByHost)
- Simplify Crawler struct (remove MaxDepth, visited, domain counters)
- Add CDX progress tracking to database and dashboard
- Net removal: ~720 lines of unused code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 22:53:23 -05:00
primal
07621a7059 Switch back to infra-dns for DNS lookups
infra-dns now configured with Charter DNS servers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 21:02:28 -05:00
primal
e6761954c0 Use system DNS resolver instead of custom infra-dns
infra-dns container has UDP connectivity issues to upstream DNS.
System resolver works (proven via wget test).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 20:55:57 -05:00
primal
f2bb1e72d2 Split domain processing into separate check and crawl loops
- StartDomainCheckLoop: DNS verification for unchecked domains (1000 workers)
- StartFeedCrawlLoop: Feed discovery on DNS-verified domains (100 workers)

This fixes starvation where 104M unchecked domains blocked 1.2M
DNS-verified domains from ever being crawled for feeds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 20:35:46 -05:00
primal
26de5d3753 Add status column to items table
Items now have a status column ('pass' or 'fail', default 'pass') to
control publishing eligibility. Includes migration for existing databases.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:46:33 -05:00
primal
6eaa39f9db Remove publishing code - now handled by publish service
Publishing functionality has been moved to the standalone publish service.
Removed:
- publisher.go, pds_auth.go, pds_records.go, image.go, handle.go
- StartPublishLoop and related functions from crawler.go
- Publish loop invocation from main.go

Updated CLAUDE.md to reflect the new architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:40:49 -05:00
primal
7b50f5c008 Update shared references to commons 2026-02-02 15:19:48 -05:00
primal
bd76ea1108 Trim shortener.go - keep only URL creation, remove click tracking
Click tracking now handled by tracker service.
Reduced from 250 to 79 lines.
2026-02-02 13:28:10 -05:00
primal
aea101a5e7 Update short URLs to use news.1440.news 2026-02-02 13:23:24 -05:00
primal
ec53ad59db Phase 5: Remove dashboard code from crawler
Removed dashboard-related files (now in standalone dashboard/ service):
- api_domains.go, api_feeds.go, api_publish.go, api_search.go
- dashboard.go, templates.go
- oauth.go, oauth_handlers.go, oauth_middleware.go, oauth_session.go
- routes.go
- static/dashboard.css, static/dashboard.js

Updated crawler.go:
- Removed cachedStats, cachedAllDomains, statsMu fields
- Removed StartStatsLoop function

Updated main.go:
- Removed dashboard startup
- Removed stats loop and UpdateStats calls

The crawler now runs independently without dashboard.
Use the standalone dashboard/ service for web interface.
2026-02-02 13:08:48 -05:00
primal
fa82d8b765 Move plan to dedicated plans/ directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:40:48 -05:00
primal
98bee87c05 Add dashboard separation plan
Tracking file-by-file migration of dashboard code from app/ to dashboard/ service.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:39:08 -05:00
primal
bce9369cb8 Fix OAuth session storage - add missing database columns
- Add dpop_authserver_nonce, dpop_pds_nonce, pds_url, authserver_iss columns
- These columns are required by GetSession query but were missing from schema
- Add migrations to create columns on existing tables
- Add debug logging for OAuth flow troubleshooting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 00:44:19 -05:00
primal
86d669e08e Make oauth_sessions.access_token nullable
Session is created before tokens are obtained during OAuth flow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 00:35:53 -05:00
primal
265975c7c5 Rename sessions table to oauth_sessions for consistency
- Update schema to create oauth_sessions instead of sessions
- Add migration to rename existing sessions table
- Add token_expiry column for OAuth library compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 00:34:13 -05:00
primal
615aa6ef5d Fix TLD sync to use domain_tld column for feeds table
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 23:52:29 -05:00
primal
3f277ec165 Remove item ID column references - items now use composite PK (guid, feed_url)
- Remove ID field from Item struct
- Remove ID field from SearchItem struct
- Update all SQL queries to not select id column
- Change MarkItemPublished to use feedURL/guid instead of id
- Update shortener to use item_guid instead of item_id
- Add migration to convert item_id to item_guid in short_urls table
- Update API endpoints to use feedUrl/guid instead of itemId

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 23:51:44 -05:00
primal
7ec4207173 Migrate to normalized FK schema (domain_host, domain_tld)
Replace source_host column with proper FK to domains table using
composite key (domain_host, domain_tld). This enables JOIN queries
instead of string concatenation for domain lookups.

Changes:
- Update Feed struct: SourceHost/TLD → DomainHost/DomainTLD
- Update all SQL queries to use domain_host/domain_tld columns
- Add column aliases (as source_host) for API backwards compatibility
- Update trigram index from source_host to domain_host
- Add getDomainHost() helper for extracting host from domain

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 22:36:25 -05:00
primal
e7f6be2203 Add internal crawl endpoint without auth
For triggering priority crawls from internal network.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:59:39 -05:00
primal
edf54ca212 Add graceful shutdown for goroutines
- Add shutdownCh channel to signal goroutines to stop
- Check IsShuttingDown() in all main loops
- Wait 2 seconds for goroutines to finish before closing DB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:23:57 -05:00
primal
81146fd572 Fix domain search when pattern looks like domain
When searching for "npr.org" and viewing the .org TLD, use the host part
("npr") for matching instead of the full pattern ("npr.org").

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:19:21 -05:00
primal
7011b126fe Fix tld_enum comparison - cast to text instead of LOWER()
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:13:21 -05:00
primal
f2978e7ab5 Clean up debug logging
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:11:49 -05:00
primal
8a9001c02c Restore working codebase with all methods
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:53 -05:00
primal
211812363a Add TLD sync loop for IANA TLD updates
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:07:43 -05:00