Publishing functionality has been moved to the standalone publish service. Removed: - publisher.go, pds_auth.go, pds_records.go, image.go, handle.go - StartPublishLoop and related functions from crawler.go - Publish loop invocation from main.go Updated CLAUDE.md to reflect the new architecture. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
IMPORTANT: Always use
./.launch.shto deploy changes. This script updates version numbers in static files (CSS/JS cache busting) before runningdocker compose up -d --build. Never usedocker composedirectly.
Build & Run
go build -o 1440.news . # Build
./1440.news # Run (starts dashboard at http://localhost:4321)
go fmt ./... # Format
go vet ./... # Static analysis
Database Setup
Requires PostgreSQL. Start the database first:
cd ../postgres && docker compose up -d
Environment Variables
Set via environment or create a .env file:
# Database connection (individual vars)
DB_HOST=atproto-postgres # Default: atproto-postgres
DB_PORT=5432 # Default: 5432
DB_USER=news_1440 # Default: news_1440
DB_PASSWORD=<password> # Or use DB_PASSWORD_FILE
DB_NAME=news_1440 # Default: news_1440
# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable
For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.
Requires vertices.txt.gz (Common Crawl domain list) in the working directory.
Architecture
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.
Concurrent Loops (main.go)
The application runs five independent goroutine loops:
- Import loop - Reads
vertices.txt.gzand inserts domains into DB in batches of 100 (status='pass') - Crawl loop - Worker pool crawls approved domains for feed discovery
- Feed check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
- Stats loop - Updates cached dashboard statistics every minute
- Cleanup loop - Removes items older than 12 months (weekly)
Note: Publishing is handled by the separate publish service.
File Structure
| File | Purpose |
|---|---|
crawler.go |
Crawler struct, worker pools, page fetching, recursive crawl logic |
domain.go |
Domain struct, DB operations, vertices file import |
feed.go |
Feed/Item structs, DB operations, feed checking with HTTP caching |
parser.go |
RSS/Atom XML parsing, date parsing, next-crawl calculation |
html.go |
HTML parsing: feed link extraction, anchor feed detection |
util.go |
URL normalization, host utilities, TLD extraction |
db.go |
PostgreSQL schema (domains, feeds, items tables with tsvector FTS) |
dashboard.go |
HTTP server, JSON APIs, HTML template |
oauth.go |
OAuth 2.0 client wrapper for AT Protocol authentication |
oauth_session.go |
Session management with AES-256-GCM encrypted cookies |
oauth_middleware.go |
RequireAuth middleware for protecting routes |
oauth_handlers.go |
OAuth HTTP endpoints (login, callback, logout, metadata) |
routes.go |
HTTP route registration with auth middleware |
Database Schema
PostgreSQL with pgx driver, using connection pooling:
- domains - Hosts to crawl (status: hold/pass/skip)
- feeds - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
- items - Individual feed entries (guid + feed_url unique)
- search_vector - GENERATED tsvector columns for full-text search (GIN indexed)
Column naming: snake_case (e.g., source_host, pub_date, item_count)
Processing Terminology
- domain_check: DNS lookup to verify domain is live
- feed_crawl: Crawl a live domain to discover RSS/Atom feeds
- feed_check: Check a known feed for new items
Domain Processing Flow
- Domains import as
passby default - Domain loop runs domain_check (DNS lookup) for unchecked domains
- Domain loop runs feed_crawl for checked domains (recursive crawl up to MaxDepth=10, MaxPagesPerHost=10)
- Extract
<link rel="alternate">and anchor hrefs containing rss/atom/feed - Parse discovered feeds for metadata, save with
next_check_at
Feed Checking
feed_check uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.
Publishing
Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol.
Status values: hold (default/pending review), pass (approved), skip (rejected).
Domain Processing
Domain status values:
pass(default on import) - Domain is crawled and checked automaticallyhold(manual) - Pauses crawling, keeps existing feeds and itemsskip(manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all datadrop(manual, via button) - Permanently deletes all feeds, items, and PDS accounts (requires skip first)
Note: Errors during check/crawl are recorded in last_error but do not change the domain status.
Skip vs Drop:
skipis reversible - use "un-skip" to restore accounts and resume publishingdropis permanent - all data is deleted, cannot be recovered Auto-skip patterns (imported asskip): bare TLDs, domains starting with digit, domains starting with letter-dash. Non-English feeds are auto-skipped.
AT Protocol Integration
Domain: 1440.news
User structure:
wehrv.1440.news- Owner/admin account{domain}.1440.news- Catch-all feed per source (e.g.,wsj.com.1440.news){category}.{domain}.1440.news- Category-specific feeds (future)
PDS configuration in pds.env:
PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>
Dashboard Authentication
The dashboard is protected by AT Protocol OAuth 2.0. Only the @1440.news handle can access it.
OAuth Setup
-
Generate configuration:
go run ./cmd/genkey -
Create
oauth.envwith the generated values:OAUTH_COOKIE_SECRET=<generated_hex_string> OAUTH_PRIVATE_JWK=<generated_jwk_json> -
Optionally set the base URL (defaults to https://app.1440.news):
OAUTH_BASE_URL=https://app.1440.news
OAuth Flow
- User navigates to
/dashboard-> redirected to/auth/login - User enters their Bluesky handle
- User is redirected to Bluesky authorization
- After approval, callback verifies handle is
1440.news - Session cookie is set, user redirected to dashboard
OAuth Endpoints
/.well-known/oauth-client-metadata- Client metadata (public)/.well-known/jwks.json- Public JWK set (public)/auth/login- Login page / initiates OAuth flow/auth/callback- OAuth callback handler/auth/logout- Clears session/auth/session- Returns current session info (JSON)
Security Notes
- Tokens are stored server-side only (BFF pattern)
- Browser only receives encrypted session cookie (AES-256-GCM)
- Access restricted to single handle (
1440.news) - Sessions expire after 24 hours
- Automatic token refresh when within 5 minutes of expiry