From 93ab1f8117c17d0e3c8e812625b649dcadfd2212 Mon Sep 17 00:00:00 2001
From: primal <primal@primal.host>
Date: Mon, 26 Jan 2026 13:47:19 -0500
Subject: [PATCH] Update CLAUDE.md to reflect current multi-file architecture

The codebase evolved from a single-file app to a multi-file structure
with SQLite persistence, dashboard, and concurrent processing loops.
Updated documentation to accurately describe current architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 CLAUDE.md | 97 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 61 insertions(+), 36 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
index 4c4dd88..eaf5ae7 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -2,50 +2,75 @@
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
-## Project Overview
-
-1440.news is a Go-based web feed crawler that discovers and catalogs RSS and Atom feeds from websites. It processes hosts from Common Crawl data (vertices.txt.gz) and outputs discovered feeds organized by TLD into `.feed` files.
-
-## Build & Run Commands
+## Build & Run
 
 ```bash
-# Build
-go build -o 1440.news main.go
-
-# Run (requires vertices.txt.gz in the working directory)
-./1440.news
-
-# Format code
-go fmt ./...
-
-# Static analysis
-go vet ./...
+go build -o 1440.news .    # Build
+./1440.news                 # Run (starts dashboard at http://localhost:4321)
+go fmt ./...                # Format
+go vet ./...                # Static analysis
 ```
 
+Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory.
+
 ## Architecture
 
-**Single-file application** (`main.go`, ~656 lines) with these key components:
+Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.
 
-- `Crawler` struct - Core engine managing HTTP client, concurrency, and state
-- `Feed` struct - Simple URL + Type (rss/atom) structure
-- RSS/Atom parsing structs for XML deserialization
+### Concurrent Loops (main.go)
 
-**Concurrency model:**
-- Worker pool pattern with `runtime.NumCPU() - 1` goroutines
-- `sync.Map` for thread-safe global URL deduplication
-- `sync.Mutex` for feed collection and TLD file operations
+The application runs five independent goroutine loops:
+- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in 10k batches
+- **Crawl loop** - Worker pool processes unchecked domains, discovers feeds
+- **Check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
+- **Stats loop** - Updates cached dashboard statistics every minute
+- **Cleanup loop** - Removes items older than 12 months (weekly)
 
-**Key functions:**
-- `CrawlHosts()` - Main entry point, coordinates worker pool
-- `crawlHost()` - Processes a single host (tries HTTPS then HTTP)
-- `crawlPage()` - Recursive page crawler with depth/page limits
-- `extractFeedLinks()` - Finds `<link rel="alternate">` feed references
-- `extractAnchorFeeds()` - Finds anchor tags with rss/atom/feed in href
+### File Structure
 
-**Configuration (hardcoded in `NewCrawler()`):**
-- MaxDepth: 10, MaxPagesPerHost: 10, Timeout: 10s
-- UserAgent: "FeedCrawler/1.0"
-- Max redirects: 10
+| File | Purpose |
+|------|---------|
+| `crawler.go` | Crawler struct, worker pools, page fetching, recursive crawl logic |
+| `domain.go` | Domain struct, DB operations, vertices file import |
+| `feed.go` | Feed/Item structs, DB operations, feed checking with HTTP caching |
+| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation |
+| `html.go` | HTML parsing: feed link extraction, anchor feed detection |
+| `util.go` | URL normalization, host utilities, TLD extraction |
+| `db.go` | SQLite schema (domains, feeds, items tables with FTS5) |
+| `dashboard.go` | HTTP server, JSON APIs, HTML template |
 
-**Input:** Common Crawl vertices file (gzipped TSV with reverse domain notation)
-**Output:** TLD-specific `.feed` files (e.g., `com.feed`, `org.feed`) containing sorted, deduplicated feed URLs
+### Database Schema
+
+SQLite with WAL mode at `feeds/feeds.db`:
+- **domains** - Hosts to crawl (status: unchecked/checked/error)
+- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers
+- **items** - Individual feed entries (guid + feedUrl unique)
+- **feeds_fts / items_fts** - FTS5 virtual tables for search
+
+### Crawl Logic
+
+1. Domain picked from `unchecked` status (random order)
+2. Try HTTPS, fall back to HTTP
+3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
+4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
+5. Parse discovered feeds for metadata, save with nextCrawlAt
+
+### Feed Checking
+
+Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `<ttl>` and Syndication namespace hints.
+
+## AT Protocol Integration (Planned)
+
+Domain: 1440.news
+
+User structure:
+- `wehrv.1440.news` - Owner/admin account
+- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`)
+- `{category}.{domain}.1440.news` - Category-specific feeds (future)
+
+Phases:
+1. Local PDS setup
+2. Account management
+3. Auto-create domain users
+4. Post articles to accounts
+5. Category detection