Files
crawler/CLAUDE.md
primal be595cb403 v100
2026-01-30 22:35:08 -05:00

6.7 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Note: Always run applications in containers via docker compose up -d --build when possible. This ensures proper networking between services (database, traefik, etc.) and matches the production environment.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Database Setup

Requires PostgreSQL. Start the database first:

cd ../postgres && docker compose up -d

Environment Variables

Set via environment or create a .env file:

# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable

For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs six independent goroutine loops:

  • Import loop - Reads vertices.txt.gz and inserts domains into DB in batches of 100 (status='pass')
  • Crawl loop - Worker pool crawls approved domains for feed discovery
  • Feed check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
  • Stats loop - Updates cached dashboard statistics every minute
  • Cleanup loop - Removes items older than 12 months (weekly)
  • Publish loop - Autopublishes items from approved feeds to AT Protocol PDS

File Structure

File Purpose
crawler.go Crawler struct, worker pools, page fetching, recursive crawl logic
domain.go Domain struct, DB operations, vertices file import
feed.go Feed/Item structs, DB operations, feed checking with HTTP caching
parser.go RSS/Atom XML parsing, date parsing, next-crawl calculation
html.go HTML parsing: feed link extraction, anchor feed detection
util.go URL normalization, host utilities, TLD extraction
db.go PostgreSQL schema (domains, feeds, items tables with tsvector FTS)
dashboard.go HTTP server, JSON APIs, HTML template
publisher.go AT Protocol PDS integration for posting items
oauth.go OAuth 2.0 client wrapper for AT Protocol authentication
oauth_session.go Session management with AES-256-GCM encrypted cookies
oauth_middleware.go RequireAuth middleware for protecting routes
oauth_handlers.go OAuth HTTP endpoints (login, callback, logout, metadata)
routes.go HTTP route registration with auth middleware

Database Schema

PostgreSQL with pgx driver, using connection pooling:

  • domains - Hosts to crawl (status: hold/pass/skip)
  • feeds - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
  • items - Individual feed entries (guid + feed_url unique)
  • search_vector - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., source_host, pub_date, item_count)

Crawl Logic

  1. Domains import as pass by default (auto-crawled)
  2. Crawl loop picks up domains where last_crawled_at IS NULL
  3. Full recursive crawl (HTTPS, fallback HTTP) up to MaxDepth=10, MaxPagesPerHost=10
  4. Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
  5. Parse discovered feeds for metadata, save with next_crawl_at

Feed Checking

Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

Publishing

Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol. Status values: hold (default/pending review), pass (approved), skip (rejected).

Domain Processing

Domain status values:

  • pass (default on import) - Domain is crawled and checked automatically
  • hold (manual) - Pauses crawling, keeps existing feeds and items
  • skip (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
  • drop (manual, via button) - Permanently deletes all feeds, items, and PDS accounts (requires skip first)

Note: Errors during check/crawl are recorded in last_error but do not change the domain status.

Skip vs Drop:

  • skip is reversible - use "un-skip" to restore accounts and resume publishing
  • drop is permanent - all data is deleted, cannot be recovered Auto-skip patterns (imported as skip): bare TLDs, domains starting with digit, domains starting with letter-dash. Non-English feeds are auto-skipped.

AT Protocol Integration

Domain: 1440.news

User structure:

  • wehrv.1440.news - Owner/admin account
  • {domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
  • {category}.{domain}.1440.news - Category-specific feeds (future)

PDS configuration in pds.env:

PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>

Dashboard Authentication

The dashboard is protected by AT Protocol OAuth 2.0. Only the @1440.news handle can access it.

OAuth Setup

  1. Generate configuration:

    go run ./cmd/genkey
    
  2. Create oauth.env with the generated values:

    OAUTH_COOKIE_SECRET=<generated_hex_string>
    OAUTH_PRIVATE_JWK=<generated_jwk_json>
    
  3. Optionally set the base URL (defaults to https://app.1440.news):

    OAUTH_BASE_URL=https://app.1440.news
    

OAuth Flow

  1. User navigates to /dashboard -> redirected to /auth/login
  2. User enters their Bluesky handle
  3. User is redirected to Bluesky authorization
  4. After approval, callback verifies handle is 1440.news
  5. Session cookie is set, user redirected to dashboard

OAuth Endpoints

  • /.well-known/oauth-client-metadata - Client metadata (public)
  • /.well-known/jwks.json - Public JWK set (public)
  • /auth/login - Login page / initiates OAuth flow
  • /auth/callback - OAuth callback handler
  • /auth/logout - Clears session
  • /auth/session - Returns current session info (JSON)

Security Notes

  • Tokens are stored server-side only (BFF pattern)
  • Browser only receives encrypted session cookie (AES-256-GCM)
  • Access restricted to single handle (1440.news)
  • Sessions expire after 24 hours
  • Automatic token refresh when within 5 minutes of expiry