Files
crawler/CLAUDE.md
primal 93ab1f8117 Update CLAUDE.md to reflect current multi-file architecture
The codebase evolved from a single-file app to a multi-file structure
with SQLite persistence, dashboard, and concurrent processing loops.
Updated documentation to accurately describe current architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 13:47:19 -05:00

2.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs five independent goroutine loops:

  • Import loop - Reads vertices.txt.gz and inserts domains into DB in 10k batches
  • Crawl loop - Worker pool processes unchecked domains, discovers feeds
  • Check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
  • Stats loop - Updates cached dashboard statistics every minute
  • Cleanup loop - Removes items older than 12 months (weekly)

File Structure

File Purpose
crawler.go Crawler struct, worker pools, page fetching, recursive crawl logic
domain.go Domain struct, DB operations, vertices file import
feed.go Feed/Item structs, DB operations, feed checking with HTTP caching
parser.go RSS/Atom XML parsing, date parsing, next-crawl calculation
html.go HTML parsing: feed link extraction, anchor feed detection
util.go URL normalization, host utilities, TLD extraction
db.go SQLite schema (domains, feeds, items tables with FTS5)
dashboard.go HTTP server, JSON APIs, HTML template

Database Schema

SQLite with WAL mode at feeds/feeds.db:

  • domains - Hosts to crawl (status: unchecked/checked/error)
  • feeds - Discovered RSS/Atom feeds with metadata and cache headers
  • items - Individual feed entries (guid + feedUrl unique)
  • feeds_fts / items_fts - FTS5 virtual tables for search

Crawl Logic

  1. Domain picked from unchecked status (random order)
  2. Try HTTPS, fall back to HTTP
  3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
  4. Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
  5. Parse discovered feeds for metadata, save with nextCrawlAt

Feed Checking

Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

AT Protocol Integration (Planned)

Domain: 1440.news

User structure:

  • wehrv.1440.news - Owner/admin account
  • {domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
  • {category}.{domain}.1440.news - Category-specific feeds (future)

Phases:

  1. Local PDS setup
  2. Account management
  3. Auto-create domain users
  4. Post articles to accounts
  5. Category detection