crawler

Author	SHA1	Message	Date
primal	d4a1928fa6	Increase feed check parallelism for 1Gbps bandwidth - Workers: 1000 -> 4000 - Work channel buffer: 1000 -> 4000 - Fetch batch size: 1000 -> 4000 - MaxIdleConns: 100 -> 2000 Should improve throughput from ~15 feeds/sec to ~50-60 feeds/sec. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 12:11:58 -05:00
primal	1f90b7d6a0	Simplify launch script Remove local cache busting logic, delegate to parent launch script. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 23:49:21 -05:00
primal	b515976fef	Add clean-descriptions migration tool Batch processes existing item descriptions to strip HTML tags, decode HTML entities, and truncate to 300 characters. Processes in batches of 1000 with progress output. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 23:43:58 -05:00
primal	253e04a749	Strip HTML and truncate descriptions on input Clean descriptions when parsing feeds rather than at publish time. Descriptions are now stored as plain text, max 300 chars. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:59:21 -05:00
primal	70828bf05d	Remove unused enclosure_length from items table The enclosure length was never used when publishing to the PDS. Added migration to drop the column. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:56:45 -05:00
primal	3af9e65937	Remove unused updated_at column from items table This field was never populated by any feed parser (RSS, Atom, JSON Feed) and was always NULL. Added migration to drop the column. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:44:33 -05:00
primal	018c059924	Remove GUID from items, use Link as primary key Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:34:39 -05:00
primal	6314b934c1	Remove content column from items table - only description used for posts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:14:29 -05:00
primal	8cf25a55dc	Remove item_count column from feeds table - compute dynamically from items Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:00:23 -05:00
primal	f4c6a9d814	Remove last_error_at from feeds table and queries Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 20:53:52 -05:00
primal	be969c11db	Remove site_url field from feeds - Remove SiteURL from Feed struct - Remove site_url from all SQL queries and scans - Remove SiteURL parsing from RSS/Atom/JSON feed parsers - Add migration to drop site_url column Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 20:44:35 -05:00
primal	288379804d	Remove category field from feeds - Remove classifyFeed and classifyFeedByTitle functions - Remove Category from Feed struct - Remove category from all SQL queries and scans - Add migration to drop category column from database Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 20:37:26 -05:00
primal	94369232f8	Remove domains table - feeds imported directly from CDX - Remove domains table from schema - Add migration to DROP TABLE domains - Remove domain.go (Domain struct and related functions) - Update stats output to only show feed count Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 20:17:52 -05:00
primal	82b40b9155	Remove domain_host/domain_tld columns from feeds table - Remove domain_host and domain_tld columns from feeds schema - Add migrations to drop columns and related index/FK constraint - Update all feed queries and structs to not include these columns - Use URL pattern search instead of domain columns for GetFeedsByHost Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 20:07:03 -05:00
primal	037b453a68	Remove oldest_item_date and newest_item_date columns - Drop columns from schema, add migration - Remove date stats calculation from parsers - Update all feed queries to exclude these columns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 19:28:20 -05:00
primal	fec53f913c	Remove last_build_date column from feeds schema - Drop last_build_date from schema and add migration - Remove parsing of RSS lastBuildDate and Atom updated date - Update all feed SQL queries to exclude the column Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 19:16:30 -05:00
primal	2c3fa5e104	Remove discovered_at column from feeds and items tables - Remove DiscoveredAt field from Feed and Item structs - Remove from all SQL queries - Remove from schema definitions - Add migrations to drop the columns - Remove unused 'now' variable declarations The column wasn't providing value - all feeds had the same timestamp from bulk import, and items weren't using it for any logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 19:07:20 -05:00
primal	0428ff0241	Remove unused source_url column from feeds table - Remove SourceURL field from Feed struct - Remove from all SQL queries - Remove from schema definition - Add migration to drop the column The column was never populated (0 entries) and the feature to track where feeds were discovered from was never implemented. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 16:16:53 -05:00
primal	e56bcd456d	Remove next_check_at column from feeds table - Remove NextCheckAt field from Feed struct - Remove from all SQL queries (saveFeed, getFeed, scanFeeds, etc.) - Remove from schema definition - Add migration to drop the column if it exists - Update calculateNextCheck to use MissCount field - Update tld.go to use status column instead of feed_health Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 16:07:04 -05:00
primal	1b0ff1b507	Add import-tsv tool for bulk importing TSV feed files Standalone tool that uses pgx connection pool to import feeds from TSV. Handles special characters in password via key=value connection string format. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 15:25:29 -05:00
primal	091fa8490b	Filter text/html extraction by feed-like URL patterns Reduces from ~2B URLs to ~2-3M by filtering for URLs containing: rss, feed, atom, xml, syndication, frontpage, newest, etc. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 14:30:40 -05:00
primal	61ca7a4c7a	Add html feed type for content-sniffed feeds - New 'html' type for feeds served with text/html MIME - feed_check content-sniffs html feeds and updates type to rss/atom/json - If content-sniff returns unknown, marks feed as IGNORE - Added cmd/extract-html tool to query local parquet files for text/html Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 14:18:26 -05:00
primal	5c43fc693b	Add text/html to CDX filter for feeds with wrong MIME type Many sites (e.g., news.ycombinator.com) serve RSS with Content-Type: text/html. These were being filtered out. Now we include text/html and rely on detectFeedType content sniffing during feed_check to identify real feeds. False positives will have type=unknown and increment MissCount but will not cause harm beyond consuming check cycles. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 14:11:57 -05:00
primal	12bd68000d	Rename status column to feed_health - Rename feeds.status -> feeds.feed_health in all SQL - Rename Feed.Status -> Feed.FeedHealth in Go code - Update index to use feed_health column Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 10:58:33 -05:00
primal	02378950f4	Add bulk import for CDX feeds - Add BulkImportFeedsFromTSV() for fast direct insertion from TSV - Skip HTTP verification, insert feeds with status='hold' - Check for pending TSV files on startup and auto-import - Imported 4.7M feeds in ~26 minutes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 10:42:52 -05:00
primal	3b1b12ff70	Simplify crawler: use CDX for feed discovery, remove unused loops - Replace StartCDXImportLoop with StartCDXMonthlyLoop (runs on 1st of month) - Enable StartFeedCheckLoop, StartCleanupLoop, StartMaintenanceLoop - Remove domain check/crawl loops (CDX provides verified feeds) - Remove vertices.txt import functions (CDX is now sole feed source) - Remove HTML extraction functions (extractFeedLinks, extractAnchorFeeds, etc.) - Remove unused helpers (shouldCrawl, makeAbsoluteURL, GetFeedCountByHost) - Simplify Crawler struct (remove MaxDepth, visited, domain counters) - Add CDX progress tracking to database and dashboard - Net removal: ~720 lines of unused code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 22:53:23 -05:00
primal	07621a7059	Switch back to infra-dns for DNS lookups infra-dns now configured with Charter DNS servers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 21:02:28 -05:00
primal	e6761954c0	Use system DNS resolver instead of custom infra-dns infra-dns container has UDP connectivity issues to upstream DNS. System resolver works (proven via wget test). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 20:55:57 -05:00
primal	f2bb1e72d2	Split domain processing into separate check and crawl loops - StartDomainCheckLoop: DNS verification for unchecked domains (1000 workers) - StartFeedCrawlLoop: Feed discovery on DNS-verified domains (100 workers) This fixes starvation where 104M unchecked domains blocked 1.2M DNS-verified domains from ever being crawled for feeds. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 20:35:46 -05:00
primal	26de5d3753	Add status column to items table Items now have a status column ('pass' or 'fail', default 'pass') to control publishing eligibility. Includes migration for existing databases. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 15:46:33 -05:00
primal	6eaa39f9db	Remove publishing code - now handled by publish service Publishing functionality has been moved to the standalone publish service. Removed: - publisher.go, pds_auth.go, pds_records.go, image.go, handle.go - StartPublishLoop and related functions from crawler.go - Publish loop invocation from main.go Updated CLAUDE.md to reflect the new architecture. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 15:40:49 -05:00
primal	7b50f5c008	Update shared references to commons	2026-02-02 15:19:48 -05:00
primal	bd76ea1108	Trim shortener.go - keep only URL creation, remove click tracking Click tracking now handled by tracker service. Reduced from 250 to 79 lines.	2026-02-02 13:28:10 -05:00
primal	aea101a5e7	Update short URLs to use news.1440.news	2026-02-02 13:23:24 -05:00
primal	ec53ad59db	Phase 5: Remove dashboard code from crawler Removed dashboard-related files (now in standalone dashboard/ service): - api_domains.go, api_feeds.go, api_publish.go, api_search.go - dashboard.go, templates.go - oauth.go, oauth_handlers.go, oauth_middleware.go, oauth_session.go - routes.go - static/dashboard.css, static/dashboard.js Updated crawler.go: - Removed cachedStats, cachedAllDomains, statsMu fields - Removed StartStatsLoop function Updated main.go: - Removed dashboard startup - Removed stats loop and UpdateStats calls The crawler now runs independently without dashboard. Use the standalone dashboard/ service for web interface.	2026-02-02 13:08:48 -05:00
primal	fa82d8b765	Move plan to dedicated plans/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 12:40:48 -05:00
primal	98bee87c05	Add dashboard separation plan Tracking file-by-file migration of dashboard code from app/ to dashboard/ service. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 12:39:08 -05:00
primal	bce9369cb8	Fix OAuth session storage - add missing database columns - Add dpop_authserver_nonce, dpop_pds_nonce, pds_url, authserver_iss columns - These columns are required by GetSession query but were missing from schema - Add migrations to create columns on existing tables - Add debug logging for OAuth flow troubleshooting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 00:44:19 -05:00
primal	86d669e08e	Make oauth_sessions.access_token nullable Session is created before tokens are obtained during OAuth flow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 00:35:53 -05:00
primal	265975c7c5	Rename sessions table to oauth_sessions for consistency - Update schema to create oauth_sessions instead of sessions - Add migration to rename existing sessions table - Add token_expiry column for OAuth library compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 00:34:13 -05:00
primal	615aa6ef5d	Fix TLD sync to use domain_tld column for feeds table Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 23:52:29 -05:00
primal	3f277ec165	Remove item ID column references - items now use composite PK (guid, feed_url) - Remove ID field from Item struct - Remove ID field from SearchItem struct - Update all SQL queries to not select id column - Change MarkItemPublished to use feedURL/guid instead of id - Update shortener to use item_guid instead of item_id - Add migration to convert item_id to item_guid in short_urls table - Update API endpoints to use feedUrl/guid instead of itemId Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 23:51:44 -05:00
primal	7ec4207173	Migrate to normalized FK schema (domain_host, domain_tld) Replace source_host column with proper FK to domains table using composite key (domain_host, domain_tld). This enables JOIN queries instead of string concatenation for domain lookups. Changes: - Update Feed struct: SourceHost/TLD → DomainHost/DomainTLD - Update all SQL queries to use domain_host/domain_tld columns - Add column aliases (as source_host) for API backwards compatibility - Update trigram index from source_host to domain_host - Add getDomainHost() helper for extracting host from domain Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 22:36:25 -05:00
primal	e7f6be2203	Add internal crawl endpoint without auth For triggering priority crawls from internal network. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:59:39 -05:00
primal	edf54ca212	Add graceful shutdown for goroutines - Add shutdownCh channel to signal goroutines to stop - Check IsShuttingDown() in all main loops - Wait 2 seconds for goroutines to finish before closing DB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:23:57 -05:00
primal	81146fd572	Fix domain search when pattern looks like domain When searching for "npr.org" and viewing the .org TLD, use the host part ("npr") for matching instead of the full pattern ("npr.org"). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:19:21 -05:00
primal	7011b126fe	Fix tld_enum comparison - cast to text instead of LOWER() Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:13:21 -05:00
primal	f2978e7ab5	Clean up debug logging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:11:49 -05:00
primal	8a9001c02c	Restore working codebase with all methods Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:08:53 -05:00
primal	211812363a	Add TLD sync loop for IANA TLD updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:07:43 -05:00

1 2 3

138 Commits