The codebase evolved from a single-file app to a multi-file structure with SQLite persistence, dashboard, and concurrent processing loops. Updated documentation to accurately describe current architecture. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.8 KiB
2.8 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Build & Run
go build -o 1440.news . # Build
./1440.news # Run (starts dashboard at http://localhost:4321)
go fmt ./... # Format
go vet ./... # Static analysis
Requires vertices.txt.gz (Common Crawl domain list) in the working directory.
Architecture
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.
Concurrent Loops (main.go)
The application runs five independent goroutine loops:
- Import loop - Reads
vertices.txt.gzand inserts domains into DB in 10k batches - Crawl loop - Worker pool processes unchecked domains, discovers feeds
- Check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
- Stats loop - Updates cached dashboard statistics every minute
- Cleanup loop - Removes items older than 12 months (weekly)
File Structure
| File | Purpose |
|---|---|
crawler.go |
Crawler struct, worker pools, page fetching, recursive crawl logic |
domain.go |
Domain struct, DB operations, vertices file import |
feed.go |
Feed/Item structs, DB operations, feed checking with HTTP caching |
parser.go |
RSS/Atom XML parsing, date parsing, next-crawl calculation |
html.go |
HTML parsing: feed link extraction, anchor feed detection |
util.go |
URL normalization, host utilities, TLD extraction |
db.go |
SQLite schema (domains, feeds, items tables with FTS5) |
dashboard.go |
HTTP server, JSON APIs, HTML template |
Database Schema
SQLite with WAL mode at feeds/feeds.db:
- domains - Hosts to crawl (status: unchecked/checked/error)
- feeds - Discovered RSS/Atom feeds with metadata and cache headers
- items - Individual feed entries (guid + feedUrl unique)
- feeds_fts / items_fts - FTS5 virtual tables for search
Crawl Logic
- Domain picked from
uncheckedstatus (random order) - Try HTTPS, fall back to HTTP
- Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
- Extract
<link rel="alternate">and anchor hrefs containing rss/atom/feed - Parse discovered feeds for metadata, save with nextCrawlAt
Feed Checking
Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.
AT Protocol Integration (Planned)
Domain: 1440.news
User structure:
wehrv.1440.news- Owner/admin account{domain}.1440.news- Catch-all feed per source (e.g.,wsj.com.1440.news){category}.{domain}.1440.news- Category-specific feeds (future)
Phases:
- Local PDS setup
- Account management
- Auto-create domain users
- Post articles to accounts
- Category detection