Files
crawler/CLAUDE.md
primal f4afb29980 Migrate from SQLite to PostgreSQL
- Replace modernc.org/sqlite with jackc/pgx/v5
- Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders)
- Use snake_case column names throughout
- Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search
- Add connection pooling with pgxpool
- Support Docker secrets for database password
- Add trigger to normalize feed URLs (strip https://, http://, www.)
- Fix anchor feed detection regex to avoid false positives
- Connect app container to atproto network for PostgreSQL access
- Add version indicator to dashboard UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:38:13 -05:00

3.9 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Database Setup

Requires PostgreSQL. Start the database first:

cd ../postgres && docker compose up -d

Environment Variables

Set via environment or create a .env file:

# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable

For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs six independent goroutine loops:

  • Import loop - Reads vertices.txt.gz and inserts domains into DB in 10k batches
  • Crawl loop - Worker pool processes unchecked domains, discovers feeds
  • Check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
  • Stats loop - Updates cached dashboard statistics every minute
  • Cleanup loop - Removes items older than 12 months (weekly)
  • Publish loop - Autopublishes items from approved feeds to AT Protocol PDS

File Structure

File Purpose
crawler.go Crawler struct, worker pools, page fetching, recursive crawl logic
domain.go Domain struct, DB operations, vertices file import
feed.go Feed/Item structs, DB operations, feed checking with HTTP caching
parser.go RSS/Atom XML parsing, date parsing, next-crawl calculation
html.go HTML parsing: feed link extraction, anchor feed detection
util.go URL normalization, host utilities, TLD extraction
db.go PostgreSQL schema (domains, feeds, items tables with tsvector FTS)
dashboard.go HTTP server, JSON APIs, HTML template
publisher.go AT Protocol PDS integration for posting items

Database Schema

PostgreSQL with pgx driver, using connection pooling:

  • domains - Hosts to crawl (status: unchecked/checked/error)
  • feeds - Discovered RSS/Atom feeds with metadata and cache headers
  • items - Individual feed entries (guid + feed_url unique)
  • search_vector - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., source_host, pub_date, item_count)

Crawl Logic

  1. Domain picked from unchecked status (random order)
  2. Try HTTPS, fall back to HTTP
  3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
  4. Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
  5. Parse discovered feeds for metadata, save with next_crawl_at

Feed Checking

Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

Publishing

Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol. Status values: held (default), pass (approved), deny (rejected).

AT Protocol Integration

Domain: 1440.news

User structure:

  • wehrv.1440.news - Owner/admin account
  • {domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
  • {category}.{domain}.1440.news - Category-specific feeds (future)

PDS configuration in pds.env:

PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>