Files
crawler/CLAUDE.md
primal 6eaa39f9db Remove publishing code - now handled by publish service
Publishing functionality has been moved to the standalone publish service.
Removed:
- publisher.go, pds_auth.go, pds_records.go, image.go, handle.go
- StartPublishLoop and related functions from crawler.go
- Publish loop invocation from main.go

Updated CLAUDE.md to reflect the new architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:40:49 -05:00

6.9 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

IMPORTANT: Always use ./.launch.sh to deploy changes. This script updates version numbers in static files (CSS/JS cache busting) before running docker compose up -d --build. Never use docker compose directly.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Database Setup

Requires PostgreSQL. Start the database first:

cd ../postgres && docker compose up -d

Environment Variables

Set via environment or create a .env file:

# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable

For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs five independent goroutine loops:

  • Import loop - Reads vertices.txt.gz and inserts domains into DB in batches of 100 (status='pass')
  • Crawl loop - Worker pool crawls approved domains for feed discovery
  • Feed check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
  • Stats loop - Updates cached dashboard statistics every minute
  • Cleanup loop - Removes items older than 12 months (weekly)

Note: Publishing is handled by the separate publish service.

File Structure

File Purpose
crawler.go Crawler struct, worker pools, page fetching, recursive crawl logic
domain.go Domain struct, DB operations, vertices file import
feed.go Feed/Item structs, DB operations, feed checking with HTTP caching
parser.go RSS/Atom XML parsing, date parsing, next-crawl calculation
html.go HTML parsing: feed link extraction, anchor feed detection
util.go URL normalization, host utilities, TLD extraction
db.go PostgreSQL schema (domains, feeds, items tables with tsvector FTS)
dashboard.go HTTP server, JSON APIs, HTML template
oauth.go OAuth 2.0 client wrapper for AT Protocol authentication
oauth_session.go Session management with AES-256-GCM encrypted cookies
oauth_middleware.go RequireAuth middleware for protecting routes
oauth_handlers.go OAuth HTTP endpoints (login, callback, logout, metadata)
routes.go HTTP route registration with auth middleware

Database Schema

PostgreSQL with pgx driver, using connection pooling:

  • domains - Hosts to crawl (status: hold/pass/skip)
  • feeds - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
  • items - Individual feed entries (guid + feed_url unique)
  • search_vector - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., source_host, pub_date, item_count)

Processing Terminology

  • domain_check: DNS lookup to verify domain is live
  • feed_crawl: Crawl a live domain to discover RSS/Atom feeds
  • feed_check: Check a known feed for new items

Domain Processing Flow

  1. Domains import as pass by default
  2. Domain loop runs domain_check (DNS lookup) for unchecked domains
  3. Domain loop runs feed_crawl for checked domains (recursive crawl up to MaxDepth=10, MaxPagesPerHost=10)
  4. Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
  5. Parse discovered feeds for metadata, save with next_check_at

Feed Checking

feed_check uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

Publishing

Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol. Status values: hold (default/pending review), pass (approved), skip (rejected).

Domain Processing

Domain status values:

  • pass (default on import) - Domain is crawled and checked automatically
  • hold (manual) - Pauses crawling, keeps existing feeds and items
  • skip (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
  • drop (manual, via button) - Permanently deletes all feeds, items, and PDS accounts (requires skip first)

Note: Errors during check/crawl are recorded in last_error but do not change the domain status.

Skip vs Drop:

  • skip is reversible - use "un-skip" to restore accounts and resume publishing
  • drop is permanent - all data is deleted, cannot be recovered Auto-skip patterns (imported as skip): bare TLDs, domains starting with digit, domains starting with letter-dash. Non-English feeds are auto-skipped.

AT Protocol Integration

Domain: 1440.news

User structure:

  • wehrv.1440.news - Owner/admin account
  • {domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
  • {category}.{domain}.1440.news - Category-specific feeds (future)

PDS configuration in pds.env:

PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>

Dashboard Authentication

The dashboard is protected by AT Protocol OAuth 2.0. Only the @1440.news handle can access it.

OAuth Setup

  1. Generate configuration:

    go run ./cmd/genkey
    
  2. Create oauth.env with the generated values:

    OAUTH_COOKIE_SECRET=<generated_hex_string>
    OAUTH_PRIVATE_JWK=<generated_jwk_json>
    
  3. Optionally set the base URL (defaults to https://app.1440.news):

    OAUTH_BASE_URL=https://app.1440.news
    

OAuth Flow

  1. User navigates to /dashboard -> redirected to /auth/login
  2. User enters their Bluesky handle
  3. User is redirected to Bluesky authorization
  4. After approval, callback verifies handle is 1440.news
  5. Session cookie is set, user redirected to dashboard

OAuth Endpoints

  • /.well-known/oauth-client-metadata - Client metadata (public)
  • /.well-known/jwks.json - Public JWK set (public)
  • /auth/login - Login page / initiates OAuth flow
  • /auth/callback - OAuth callback handler
  • /auth/logout - Clears session
  • /auth/session - Returns current session info (JSON)

Security Notes

  • Tokens are stored server-side only (BFF pattern)
  • Browser only receives encrypted session cookie (AES-256-GCM)
  • Access restricted to single handle (1440.news)
  • Sessions expire after 24 hours
  • Automatic token refresh when within 5 minutes of expiry