- Implement full OAuth 2.0 with PKCE using haileyok/atproto-oauth-golang - Backend For Frontend (BFF) pattern: tokens stored server-side only - AES-256-GCM encrypted session cookies - Auto token refresh when near expiry - Restrict access to allowed handles (1440.news, wehrv.bsky.social) - Add genkey utility for generating OAuth configuration - Generic error messages to prevent handle enumeration - Server-side logging of failed login attempts for security monitoring New files: - oauth.go: OAuth client wrapper and DID/handle resolution - oauth_session.go: Session management with encrypted cookies - oauth_middleware.go: RequireAuth middleware for route protection - oauth_handlers.go: Login, callback, logout, metadata endpoints - cmd/genkey/main.go: Generate OAuth secrets and JWK keypair - oauth.env.example: Configuration template Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
185 lines
7.0 KiB
Markdown
185 lines
7.0 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
> **Note:** Always run applications in containers via `docker compose up -d --build` when possible. This ensures proper networking between services (database, traefik, etc.) and matches the production environment.
|
|
|
|
## Build & Run
|
|
|
|
```bash
|
|
go build -o 1440.news . # Build
|
|
./1440.news # Run (starts dashboard at http://localhost:4321)
|
|
go fmt ./... # Format
|
|
go vet ./... # Static analysis
|
|
```
|
|
|
|
### Database Setup
|
|
|
|
Requires PostgreSQL. Start the database first:
|
|
|
|
```bash
|
|
cd ../postgres && docker compose up -d
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
Set via environment or create a `.env` file:
|
|
|
|
```bash
|
|
# Database connection (individual vars)
|
|
DB_HOST=atproto-postgres # Default: atproto-postgres
|
|
DB_PORT=5432 # Default: 5432
|
|
DB_USER=news_1440 # Default: news_1440
|
|
DB_PASSWORD=<password> # Or use DB_PASSWORD_FILE
|
|
DB_NAME=news_1440 # Default: news_1440
|
|
|
|
# Or use a connection string
|
|
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable
|
|
```
|
|
|
|
For Docker, use `DB_PASSWORD_FILE=/run/secrets/db_password` with Docker secrets.
|
|
|
|
Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory.
|
|
|
|
## Architecture
|
|
|
|
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.
|
|
|
|
### Concurrent Loops (main.go)
|
|
|
|
The application runs seven independent goroutine loops:
|
|
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in batches of 100 (status='pass')
|
|
- **Domain check loop** - HEAD requests to verify approved domains are reachable
|
|
- **Crawl loop** - Worker pool crawls verified domains for feed discovery
|
|
- **Feed check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
|
|
- **Stats loop** - Updates cached dashboard statistics every minute
|
|
- **Cleanup loop** - Removes items older than 12 months (weekly)
|
|
- **Publish loop** - Autopublishes items from approved feeds to AT Protocol PDS
|
|
|
|
### File Structure
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `crawler.go` | Crawler struct, worker pools, page fetching, recursive crawl logic |
|
|
| `domain.go` | Domain struct, DB operations, vertices file import |
|
|
| `feed.go` | Feed/Item structs, DB operations, feed checking with HTTP caching |
|
|
| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation |
|
|
| `html.go` | HTML parsing: feed link extraction, anchor feed detection |
|
|
| `util.go` | URL normalization, host utilities, TLD extraction |
|
|
| `db.go` | PostgreSQL schema (domains, feeds, items tables with tsvector FTS) |
|
|
| `dashboard.go` | HTTP server, JSON APIs, HTML template |
|
|
| `publisher.go` | AT Protocol PDS integration for posting items |
|
|
| `oauth.go` | OAuth 2.0 client wrapper for AT Protocol authentication |
|
|
| `oauth_session.go` | Session management with AES-256-GCM encrypted cookies |
|
|
| `oauth_middleware.go` | RequireAuth middleware for protecting routes |
|
|
| `oauth_handlers.go` | OAuth HTTP endpoints (login, callback, logout, metadata) |
|
|
| `routes.go` | HTTP route registration with auth middleware |
|
|
|
|
### Database Schema
|
|
|
|
PostgreSQL with pgx driver, using connection pooling:
|
|
- **domains** - Hosts to crawl (status: hold/pass/skip/fail)
|
|
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
|
|
- **items** - Individual feed entries (guid + feed_url unique)
|
|
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)
|
|
|
|
Column naming: snake_case (e.g., `source_host`, `pub_date`, `item_count`)
|
|
|
|
### Crawl Logic
|
|
|
|
1. Domains import as `pass` by default (auto-crawled)
|
|
2. Check stage: HEAD request verifies domain is reachable, sets last_checked_at
|
|
3. Crawl stage: Full recursive crawl (HTTPS, fallback HTTP)
|
|
4. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
|
|
5. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
|
6. Parse discovered feeds for metadata, save with next_crawl_at
|
|
|
|
### Feed Checking
|
|
|
|
Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `<ttl>` and Syndication namespace hints.
|
|
|
|
### Publishing
|
|
|
|
Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
|
|
Status values: `hold` (default/pending review), `pass` (approved), `skip` (rejected).
|
|
|
|
### Domain Processing (Two-Stage)
|
|
|
|
1. **Check stage** - HEAD request to verify domain is reachable
|
|
2. **Crawl stage** - Full recursive crawl for feed discovery
|
|
|
|
Domain status values:
|
|
- `pass` (default on import) - Domain is crawled and checked automatically
|
|
- `hold` (manual) - Pauses crawling, keeps existing feeds and items
|
|
- `skip` (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
|
|
- `drop` (manual, via button) - Permanently **deletes** all feeds, items, and PDS accounts (requires skip first)
|
|
- `fail` (automatic) - Set when check/crawl fails, keeps existing feeds and items
|
|
|
|
Skip vs Drop:
|
|
- `skip` is reversible - use "un-skip" to restore accounts and resume publishing
|
|
- `drop` is permanent - all data is deleted, cannot be recovered
|
|
Auto-skip patterns (imported as `skip`): bare TLDs, domains starting with digit, domains starting with letter-dash.
|
|
Non-English feeds are auto-skipped.
|
|
|
|
## AT Protocol Integration
|
|
|
|
Domain: 1440.news
|
|
|
|
User structure:
|
|
- `wehrv.1440.news` - Owner/admin account
|
|
- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`)
|
|
- `{category}.{domain}.1440.news` - Category-specific feeds (future)
|
|
|
|
PDS configuration in `pds.env`:
|
|
```
|
|
PDS_HOST=https://pds.1440.news
|
|
PDS_ADMIN_PASSWORD=<admin_password>
|
|
```
|
|
|
|
## Dashboard Authentication
|
|
|
|
The dashboard is protected by AT Protocol OAuth 2.0. Only the `@1440.news` handle can access it.
|
|
|
|
### OAuth Setup
|
|
|
|
1. Generate configuration:
|
|
```bash
|
|
go run ./cmd/genkey
|
|
```
|
|
|
|
2. Create `oauth.env` with the generated values:
|
|
```
|
|
OAUTH_COOKIE_SECRET=<generated_hex_string>
|
|
OAUTH_PRIVATE_JWK=<generated_jwk_json>
|
|
```
|
|
|
|
3. Optionally set the base URL (defaults to https://app.1440.news):
|
|
```
|
|
OAUTH_BASE_URL=https://app.1440.news
|
|
```
|
|
|
|
### OAuth Flow
|
|
|
|
1. User navigates to `/dashboard` -> redirected to `/auth/login`
|
|
2. User enters their Bluesky handle
|
|
3. User is redirected to Bluesky authorization
|
|
4. After approval, callback verifies handle is `1440.news`
|
|
5. Session cookie is set, user redirected to dashboard
|
|
|
|
### OAuth Endpoints
|
|
|
|
- `/.well-known/oauth-client-metadata` - Client metadata (public)
|
|
- `/.well-known/jwks.json` - Public JWK set (public)
|
|
- `/auth/login` - Login page / initiates OAuth flow
|
|
- `/auth/callback` - OAuth callback handler
|
|
- `/auth/logout` - Clears session
|
|
- `/auth/session` - Returns current session info (JSON)
|
|
|
|
### Security Notes
|
|
|
|
- Tokens are stored server-side only (BFF pattern)
|
|
- Browser only receives encrypted session cookie (AES-256-GCM)
|
|
- Access restricted to single handle (`1440.news`)
|
|
- Sessions expire after 24 hours
|
|
- Automatic token refresh when within 5 minutes of expiry
|