Files

primal 93ab1f8117 Update CLAUDE.md to reflect current multi-file architecture

The codebase evolved from a single-file app to a multi-file structure
with SQLite persistence, dashboard, and concurrent processing loops.
Updated documentation to accurately describe current architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-26 13:47:19 -05:00

2.8 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs five independent goroutine loops:

Import loop - Reads vertices.txt.gz and inserts domains into DB in 10k batches
Crawl loop - Worker pool processes unchecked domains, discovers feeds
Check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
Stats loop - Updates cached dashboard statistics every minute
Cleanup loop - Removes items older than 12 months (weekly)

File Structure

File	Purpose
`crawler.go`	Crawler struct, worker pools, page fetching, recursive crawl logic
`domain.go`	Domain struct, DB operations, vertices file import
`feed.go`	Feed/Item structs, DB operations, feed checking with HTTP caching
`parser.go`	RSS/Atom XML parsing, date parsing, next-crawl calculation
`html.go`	HTML parsing: feed link extraction, anchor feed detection
`util.go`	URL normalization, host utilities, TLD extraction
`db.go`	SQLite schema (domains, feeds, items tables with FTS5)
`dashboard.go`	HTTP server, JSON APIs, HTML template

Database Schema

SQLite with WAL mode at feeds/feeds.db:

domains - Hosts to crawl (status: unchecked/checked/error)
feeds - Discovered RSS/Atom feeds with metadata and cache headers
items - Individual feed entries (guid + feedUrl unique)
feeds_fts / items_fts - FTS5 virtual tables for search

Crawl Logic

Domain picked from unchecked status (random order)
Try HTTPS, fall back to HTTP
Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
Parse discovered feeds for metadata, save with nextCrawlAt

Feed Checking

Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

AT Protocol Integration (Planned)

Domain: 1440.news

User structure:

wehrv.1440.news - Owner/admin account
{domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
{category}.{domain}.1440.news - Category-specific feeds (future)

Phases:

Local PDS setup
Account management
Auto-create domain users
Post articles to accounts
Category detection

2.8 KiB Raw Blame History