Files

primal be595cb403 v100

2026-01-30 22:35:08 -05:00

6.7 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Note: Always run applications in containers via docker compose up -d --build when possible. This ensures proper networking between services (database, traefik, etc.) and matches the production environment.

Build & Run

go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis

Database Setup

Requires PostgreSQL. Start the database first:

cd ../postgres && docker compose up -d

Environment Variables

Set via environment or create a .env file:

# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable

For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.

Requires vertices.txt.gz (Common Crawl domain list) in the working directory.

Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

Concurrent Loops (main.go)

The application runs six independent goroutine loops:

Import loop - Reads vertices.txt.gz and inserts domains into DB in batches of 100 (status='pass')
Crawl loop - Worker pool crawls approved domains for feed discovery
Feed check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
Stats loop - Updates cached dashboard statistics every minute
Cleanup loop - Removes items older than 12 months (weekly)
Publish loop - Autopublishes items from approved feeds to AT Protocol PDS

File Structure

File	Purpose
`crawler.go`	Crawler struct, worker pools, page fetching, recursive crawl logic
`domain.go`	Domain struct, DB operations, vertices file import
`feed.go`	Feed/Item structs, DB operations, feed checking with HTTP caching
`parser.go`	RSS/Atom XML parsing, date parsing, next-crawl calculation
`html.go`	HTML parsing: feed link extraction, anchor feed detection
`util.go`	URL normalization, host utilities, TLD extraction
`db.go`	PostgreSQL schema (domains, feeds, items tables with tsvector FTS)
`dashboard.go`	HTTP server, JSON APIs, HTML template
`publisher.go`	AT Protocol PDS integration for posting items
`oauth.go`	OAuth 2.0 client wrapper for AT Protocol authentication
`oauth_session.go`	Session management with AES-256-GCM encrypted cookies
`oauth_middleware.go`	RequireAuth middleware for protecting routes
`oauth_handlers.go`	OAuth HTTP endpoints (login, callback, logout, metadata)
`routes.go`	HTTP route registration with auth middleware

Database Schema

PostgreSQL with pgx driver, using connection pooling:

domains - Hosts to crawl (status: hold/pass/skip)
feeds - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
items - Individual feed entries (guid + feed_url unique)
search_vector - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., source_host, pub_date, item_count)

Crawl Logic

Domains import as pass by default (auto-crawled)
Crawl loop picks up domains where last_crawled_at IS NULL
Full recursive crawl (HTTPS, fallback HTTP) up to MaxDepth=10, MaxPagesPerHost=10
Extract <link rel="alternate"> and anchor hrefs containing rss/atom/feed
Parse discovered feeds for metadata, save with next_crawl_at

Feed Checking

Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.

Publishing

Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol. Status values: hold (default/pending review), pass (approved), skip (rejected).

Domain Processing

Domain status values:

pass (default on import) - Domain is crawled and checked automatically
hold (manual) - Pauses crawling, keeps existing feeds and items
skip (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
drop (manual, via button) - Permanently deletes all feeds, items, and PDS accounts (requires skip first)

Note: Errors during check/crawl are recorded in last_error but do not change the domain status.

Skip vs Drop:

skip is reversible - use "un-skip" to restore accounts and resume publishing
drop is permanent - all data is deleted, cannot be recovered Auto-skip patterns (imported as skip): bare TLDs, domains starting with digit, domains starting with letter-dash. Non-English feeds are auto-skipped.

AT Protocol Integration

Domain: 1440.news

User structure:

wehrv.1440.news - Owner/admin account
{domain}.1440.news - Catch-all feed per source (e.g., wsj.com.1440.news)
{category}.{domain}.1440.news - Category-specific feeds (future)

PDS configuration in pds.env:

PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>

Dashboard Authentication

The dashboard is protected by AT Protocol OAuth 2.0. Only the @1440.news handle can access it.

OAuth Setup

Generate configuration:
```
go run ./cmd/genkey
```

Create oauth.env with the generated values:

OAUTH_COOKIE_SECRET=<generated_hex_string>
OAUTH_PRIVATE_JWK=<generated_jwk_json>

Optionally set the base URL (defaults to https://app.1440.news):
```
OAUTH_BASE_URL=https://app.1440.news
```

OAuth Flow

User navigates to /dashboard -> redirected to /auth/login
User enters their Bluesky handle
User is redirected to Bluesky authorization
After approval, callback verifies handle is 1440.news
Session cookie is set, user redirected to dashboard

OAuth Endpoints

/.well-known/oauth-client-metadata - Client metadata (public)
/.well-known/jwks.json - Public JWK set (public)
/auth/login - Login page / initiates OAuth flow
/auth/callback - OAuth callback handler
/auth/logout - Clears session
/auth/session - Returns current session info (JSON)

Security Notes

Tokens are stored server-side only (BFF pattern)
Browser only receives encrypted session cookie (AES-256-GCM)
Access restricted to single handle (1440.news)
Sessions expire after 24 hours
Automatic token refresh when within 5 minutes of expiry

6.7 KiB Raw Blame History