From 93ab1f8117c17d0e3c8e812625b649dcadfd2212 Mon Sep 17 00:00:00 2001 From: primal Date: Mon, 26 Jan 2026 13:47:19 -0500 Subject: [PATCH] Update CLAUDE.md to reflect current multi-file architecture The codebase evolved from a single-file app to a multi-file structure with SQLite persistence, dashboard, and concurrent processing loops. Updated documentation to accurately describe current architecture. Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 97 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 61 insertions(+), 36 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 4c4dd88..eaf5ae7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,50 +2,75 @@ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. -## Project Overview - -1440.news is a Go-based web feed crawler that discovers and catalogs RSS and Atom feeds from websites. It processes hosts from Common Crawl data (vertices.txt.gz) and outputs discovered feeds organized by TLD into `.feed` files. - -## Build & Run Commands +## Build & Run ```bash -# Build -go build -o 1440.news main.go - -# Run (requires vertices.txt.gz in the working directory) -./1440.news - -# Format code -go fmt ./... - -# Static analysis -go vet ./... +go build -o 1440.news . # Build +./1440.news # Run (starts dashboard at http://localhost:4321) +go fmt ./... # Format +go vet ./... # Static analysis ``` +Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory. + ## Architecture -**Single-file application** (`main.go`, ~656 lines) with these key components: +Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard. -- `Crawler` struct - Core engine managing HTTP client, concurrency, and state -- `Feed` struct - Simple URL + Type (rss/atom) structure -- RSS/Atom parsing structs for XML deserialization +### Concurrent Loops (main.go) -**Concurrency model:** -- Worker pool pattern with `runtime.NumCPU() - 1` goroutines -- `sync.Map` for thread-safe global URL deduplication -- `sync.Mutex` for feed collection and TLD file operations +The application runs five independent goroutine loops: +- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in 10k batches +- **Crawl loop** - Worker pool processes unchecked domains, discovers feeds +- **Check loop** - Worker pool re-checks known feeds for updates (conditional HTTP) +- **Stats loop** - Updates cached dashboard statistics every minute +- **Cleanup loop** - Removes items older than 12 months (weekly) -**Key functions:** -- `CrawlHosts()` - Main entry point, coordinates worker pool -- `crawlHost()` - Processes a single host (tries HTTPS then HTTP) -- `crawlPage()` - Recursive page crawler with depth/page limits -- `extractFeedLinks()` - Finds `` feed references -- `extractAnchorFeeds()` - Finds anchor tags with rss/atom/feed in href +### File Structure -**Configuration (hardcoded in `NewCrawler()`):** -- MaxDepth: 10, MaxPagesPerHost: 10, Timeout: 10s -- UserAgent: "FeedCrawler/1.0" -- Max redirects: 10 +| File | Purpose | +|------|---------| +| `crawler.go` | Crawler struct, worker pools, page fetching, recursive crawl logic | +| `domain.go` | Domain struct, DB operations, vertices file import | +| `feed.go` | Feed/Item structs, DB operations, feed checking with HTTP caching | +| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation | +| `html.go` | HTML parsing: feed link extraction, anchor feed detection | +| `util.go` | URL normalization, host utilities, TLD extraction | +| `db.go` | SQLite schema (domains, feeds, items tables with FTS5) | +| `dashboard.go` | HTTP server, JSON APIs, HTML template | -**Input:** Common Crawl vertices file (gzipped TSV with reverse domain notation) -**Output:** TLD-specific `.feed` files (e.g., `com.feed`, `org.feed`) containing sorted, deduplicated feed URLs +### Database Schema + +SQLite with WAL mode at `feeds/feeds.db`: +- **domains** - Hosts to crawl (status: unchecked/checked/error) +- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers +- **items** - Individual feed entries (guid + feedUrl unique) +- **feeds_fts / items_fts** - FTS5 virtual tables for search + +### Crawl Logic + +1. Domain picked from `unchecked` status (random order) +2. Try HTTPS, fall back to HTTP +3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10 +4. Extract `` and anchor hrefs containing rss/atom/feed +5. Parse discovered feeds for metadata, save with nextCrawlAt + +### Feed Checking + +Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `` and Syndication namespace hints. + +## AT Protocol Integration (Planned) + +Domain: 1440.news + +User structure: +- `wehrv.1440.news` - Owner/admin account +- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`) +- `{category}.{domain}.1440.news` - Category-specific feeds (future) + +Phases: +1. Local PDS setup +2. Account management +3. Auto-create domain users +4. Post articles to accounts +5. Category detection