crawler/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

> **IMPORTANT:** Always use `./.launch.sh` to deploy changes. This script updates version numbers in static files (CSS/JS cache busting) before running `docker compose up -d --build`. Never use `docker compose` directly.

## Build & Run

```bash
go build -o 1440.news .    # Build
./1440.news                 # Run (starts dashboard at http://localhost:4321)
go fmt ./...                # Format
go vet ./...                # Static analysis
```

### Database Setup

Requires PostgreSQL. Start the database first:

```bash
cd ../postgres && docker compose up -d
```

### Environment Variables

Set via environment or create a `.env` file:

```bash
# Database connection (individual vars)
DB_HOST=atproto-postgres    # Default: atproto-postgres
DB_PORT=5432                # Default: 5432
DB_USER=news_1440           # Default: news_1440
DB_PASSWORD=<password>      # Or use DB_PASSWORD_FILE
DB_NAME=news_1440           # Default: news_1440

# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable
```

For Docker, use `DB_PASSWORD_FILE=/run/secrets/db_password` with Docker secrets.

Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory.

## Architecture

Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.

### Concurrent Loops (main.go)

The application runs five independent goroutine loops:
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in batches of 100 (status='pass')
- **Crawl loop** - Worker pool crawls approved domains for feed discovery
- **Feed check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
- **Stats loop** - Updates cached dashboard statistics every minute
- **Cleanup loop** - Removes items older than 12 months (weekly)

Note: Publishing is handled by the separate `publish` service.

### File Structure

| File | Purpose |
|------|---------|
| `crawler.go` | Crawler struct, worker pools, page fetching, recursive crawl logic |
| `domain.go` | Domain struct, DB operations, vertices file import |
| `feed.go` | Feed/Item structs, DB operations, feed checking with HTTP caching |
| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation |
| `html.go` | HTML parsing: feed link extraction, anchor feed detection |
| `util.go` | URL normalization, host utilities, TLD extraction |
| `db.go` | PostgreSQL schema (domains, feeds, items tables with tsvector FTS) |
| `dashboard.go` | HTTP server, JSON APIs, HTML template |
| `oauth.go` | OAuth 2.0 client wrapper for AT Protocol authentication |
| `oauth_session.go` | Session management with AES-256-GCM encrypted cookies |
| `oauth_middleware.go` | RequireAuth middleware for protecting routes |
| `oauth_handlers.go` | OAuth HTTP endpoints (login, callback, logout, metadata) |
| `routes.go` | HTTP route registration with auth middleware |

### Database Schema

PostgreSQL with pgx driver, using connection pooling:
- **domains** - Hosts to crawl (status: hold/pass/skip)
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
- **items** - Individual feed entries (guid + feed_url unique)
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)

Column naming: snake_case (e.g., `source_host`, `pub_date`, `item_count`)

### Processing Terminology

- **domain_check**: DNS lookup to verify domain is live
- **feed_crawl**: Crawl a live domain to discover RSS/Atom feeds
- **feed_check**: Check a known feed for new items

### Domain Processing Flow

1. Domains import as `pass` by default
2. Domain loop runs **domain_check** (DNS lookup) for unchecked domains
3. Domain loop runs **feed_crawl** for checked domains (recursive crawl up to MaxDepth=10, MaxPagesPerHost=10)
4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
5. Parse discovered feeds for metadata, save with `next_check_at`

### Feed Checking

**feed_check** uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `<ttl>` and Syndication namespace hints.

### Publishing

Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
Status values: `hold` (default/pending review), `pass` (approved), `skip` (rejected).

### Domain Processing

Domain status values:
- `pass` (default on import) - Domain is crawled and checked automatically
- `hold` (manual) - Pauses crawling, keeps existing feeds and items
- `skip` (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
- `drop` (manual, via button) - Permanently **deletes** all feeds, items, and PDS accounts (requires skip first)

Note: Errors during check/crawl are recorded in `last_error` but do not change the domain status.

Skip vs Drop:
- `skip` is reversible - use "un-skip" to restore accounts and resume publishing
- `drop` is permanent - all data is deleted, cannot be recovered
Auto-skip patterns (imported as `skip`): bare TLDs, domains starting with digit, domains starting with letter-dash.
Non-English feeds are auto-skipped.

## AT Protocol Integration

Domain: 1440.news

User structure:
- `wehrv.1440.news` - Owner/admin account
- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`)
- `{category}.{domain}.1440.news` - Category-specific feeds (future)

PDS configuration in `pds.env`:
```
PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>
```

## Dashboard Authentication

The dashboard is protected by AT Protocol OAuth 2.0. Only the `@1440.news` handle can access it.

### OAuth Setup

1. Generate configuration:
   ```bash
   go run ./cmd/genkey
   ```

2. Create `oauth.env` with the generated values:
   ```
   OAUTH_COOKIE_SECRET=<generated_hex_string>
   OAUTH_PRIVATE_JWK=<generated_jwk_json>
   ```

3. Optionally set the base URL (defaults to https://app.1440.news):
   ```
   OAUTH_BASE_URL=https://app.1440.news
   ```

### OAuth Flow

1. User navigates to `/dashboard` -> redirected to `/auth/login`
2. User enters their Bluesky handle
3. User is redirected to Bluesky authorization
4. After approval, callback verifies handle is `1440.news`
5. Session cookie is set, user redirected to dashboard

### OAuth Endpoints

- `/.well-known/oauth-client-metadata` - Client metadata (public)
- `/.well-known/jwks.json` - Public JWK set (public)
- `/auth/login` - Login page / initiates OAuth flow
- `/auth/callback` - OAuth callback handler
- `/auth/logout` - Clears session
- `/auth/session` - Returns current session info (JSON)

### Security Notes

- Tokens are stored server-side only (BFF pattern)
- Browser only receives encrypted session cookie (AES-256-GCM)
- Access restricted to single handle (`1440.news`)
- Sessions expire after 24 hours
- Automatic token refresh when within 5 minutes of expiry