Files

primal 6eaa39f9db Remove publishing code - now handled by publish service

Publishing functionality has been moved to the standalone publish service.
Removed:
- publisher.go, pds_auth.go, pds_records.go, image.go, handle.go
- StartPublishLoop and related functions from crawler.go
- Publish loop invocation from main.go

Updated CLAUDE.md to reflect the new architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 15:40:49 -05:00

6.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

IMPORTANT: Always use ./.launch.sh to deploy changes. This script updates version numbers in static files (CSS/JS cache busting) before running docker compose up -d --build. Never use docker compose directly.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Database Setup

Requires PostgreSQL. Start the database first:

cd ../postgres && docker compose up -d

Environment Variables

Set via environment or create a .env file:

# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable

For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs five independent goroutine loops:

Import loop - Reads vertices.txt.gz and inserts domains into DB in batches of 100 (status='pass')
Crawl loop - Worker pool crawls approved domains for feed discovery
Feed check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
Stats loop - Updates cached dashboard statistics every minute
Cleanup loop - Removes items older than 12 months (weekly)

Note: Publishing is handled by the separate publish service.

File Structure

File	Purpose
`crawler.go`	Crawler struct, worker pools, page fetching, recursive crawl logic
`domain.go`	Domain struct, DB operations, vertices file import
`feed.go`	Feed/Item structs, DB operations, feed checking with HTTP caching
`parser.go`	RSS/Atom XML parsing, date parsing, next-crawl calculation
`html.go`	HTML parsing: feed link extraction, anchor feed detection
`util.go`	URL normalization, host utilities, TLD extraction
`db.go`	PostgreSQL schema (domains, feeds, items tables with tsvector FTS)
`dashboard.go`	HTTP server, JSON APIs, HTML template
`oauth.go`	OAuth 2.0 client wrapper for AT Protocol authentication
`oauth_session.go`	Session management with AES-256-GCM encrypted cookies
`oauth_middleware.go`	RequireAuth middleware for protecting routes
`oauth_handlers.go`	OAuth HTTP endpoints (login, callback, logout, metadata)
`routes.go`	HTTP route registration with auth middleware

Database Schema

PostgreSQL with pgx driver, using connection pooling:

domains - Hosts to crawl (status: hold/pass/skip)
feeds - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
items - Individual feed entries (guid + feed_url unique)
search_vector - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., source_host, pub_date, item_count)

Processing Terminology

domain_check: DNS lookup to verify domain is live
feed_crawl: Crawl a live domain to discover RSS/Atom feeds
feed_check: Check a known feed for new items

Domain Processing Flow

Domains import as pass by default
Domain loop runs domain_check (DNS lookup) for unchecked domains
Domain loop runs feed_crawl for checked domains (recursive crawl up to MaxDepth=10, MaxPagesPerHost=10)
Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
Parse discovered feeds for metadata, save with next_check_at

Feed Checking

feed_check uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

Publishing

Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol. Status values: hold (default/pending review), pass (approved), skip (rejected).

Domain Processing

Domain status values:

pass (default on import) - Domain is crawled and checked automatically
hold (manual) - Pauses crawling, keeps existing feeds and items
skip (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
drop (manual, via button) - Permanently deletes all feeds, items, and PDS accounts (requires skip first)

Note: Errors during check/crawl are recorded in last_error but do not change the domain status.

Skip vs Drop:

skip is reversible - use "un-skip" to restore accounts and resume publishing
drop is permanent - all data is deleted, cannot be recovered Auto-skip patterns (imported as skip): bare TLDs, domains starting with digit, domains starting with letter-dash. Non-English feeds are auto-skipped.

AT Protocol Integration

Domain: 1440.news

User structure:

wehrv.1440.news - Owner/admin account
{domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
{category}.{domain}.1440.news - Category-specific feeds (future)

PDS configuration in pds.env:

PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>

Dashboard Authentication

The dashboard is protected by AT Protocol OAuth 2.0. Only the @1440.news handle can access it.

OAuth Setup

Generate configuration:
```
go run ./cmd/genkey
```

Create oauth.env with the generated values:

OAUTH_COOKIE_SECRET=<generated_hex_string>
OAUTH_PRIVATE_JWK=<generated_jwk_json>

Optionally set the base URL (defaults to https://app.1440.news):
```
OAUTH_BASE_URL=https://app.1440.news
```

OAuth Flow

User navigates to /dashboard -> redirected to /auth/login
User enters their Bluesky handle
User is redirected to Bluesky authorization
After approval, callback verifies handle is 1440.news
Session cookie is set, user redirected to dashboard

OAuth Endpoints

/.well-known/oauth-client-metadata - Client metadata (public)
/.well-known/jwks.json - Public JWK set (public)
/auth/login - Login page / initiates OAuth flow
/auth/callback - OAuth callback handler
/auth/logout - Clears session
/auth/session - Returns current session info (JSON)

Security Notes

Tokens are stored server-side only (BFF pattern)
Browser only receives encrypted session cookie (AES-256-GCM)
Access restricted to single handle (1440.news)
Sessions expire after 24 hours
Automatic token refresh when within 5 minutes of expiry

6.9 KiB Raw Blame History