- Replace modernc.org/sqlite with jackc/pgx/v5 - Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders) - Use snake_case column names throughout - Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search - Add connection pooling with pgxpool - Support Docker secrets for database password - Add trigger to normalize feed URLs (strip https://, http://, www.) - Fix anchor feed detection regex to avoid false positives - Connect app container to atproto network for PostgreSQL access - Add version indicator to dashboard UI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Build & Run
go build -o 1440.news . # Build
./1440.news # Run (starts dashboard at http://localhost:4321)
go fmt ./... # Format
go vet ./... # Static analysis
Database Setup
Requires PostgreSQL. Start the database first:
cd ../postgres && docker compose up -d
Environment Variables
Set via environment or create a .env file:
# Database connection (individual vars)
DB_HOST=atproto-postgres # Default: atproto-postgres
DB_PORT=5432 # Default: 5432
DB_USER=news_1440 # Default: news_1440
DB_PASSWORD=<password> # Or use DB_PASSWORD_FILE
DB_NAME=news_1440 # Default: news_1440
# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable
For Docker, use DB_PASSWORD_FILE=/run/secrets/db_password with Docker secrets.
Requires vertices.txt.gz (Common Crawl domain list) in the working directory.
Architecture
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.
Concurrent Loops (main.go)
The application runs six independent goroutine loops:
- Import loop - Reads
vertices.txt.gzand inserts domains into DB in 10k batches - Crawl loop - Worker pool processes unchecked domains, discovers feeds
- Check loop - Worker pool re-checks known feeds for updates (conditional HTTP)
- Stats loop - Updates cached dashboard statistics every minute
- Cleanup loop - Removes items older than 12 months (weekly)
- Publish loop - Autopublishes items from approved feeds to AT Protocol PDS
File Structure
| File | Purpose |
|---|---|
crawler.go |
Crawler struct, worker pools, page fetching, recursive crawl logic |
domain.go |
Domain struct, DB operations, vertices file import |
feed.go |
Feed/Item structs, DB operations, feed checking with HTTP caching |
parser.go |
RSS/Atom XML parsing, date parsing, next-crawl calculation |
html.go |
HTML parsing: feed link extraction, anchor feed detection |
util.go |
URL normalization, host utilities, TLD extraction |
db.go |
PostgreSQL schema (domains, feeds, items tables with tsvector FTS) |
dashboard.go |
HTTP server, JSON APIs, HTML template |
publisher.go |
AT Protocol PDS integration for posting items |
Database Schema
PostgreSQL with pgx driver, using connection pooling:
- domains - Hosts to crawl (status: unchecked/checked/error)
- feeds - Discovered RSS/Atom feeds with metadata and cache headers
- items - Individual feed entries (guid + feed_url unique)
- search_vector - GENERATED tsvector columns for full-text search (GIN indexed)
Column naming: snake_case (e.g., source_host, pub_date, item_count)
Crawl Logic
- Domain picked from
uncheckedstatus (random order) - Try HTTPS, fall back to HTTP
- Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
- Extract
<link rel="alternate">and anchor hrefs containing rss/atom/feed - Parse discovered feeds for metadata, save with next_crawl_at
Feed Checking
Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS <ttl> and Syndication namespace hints.
Publishing
Feeds with publish_status = 'pass' have their items automatically posted to AT Protocol.
Status values: held (default), pass (approved), deny (rejected).
AT Protocol Integration
Domain: 1440.news
User structure:
wehrv.1440.news- Owner/admin account{domain}.1440.news- Catch-all feed per source (e.g.,wsj.com.1440.news){category}.{domain}.1440.news- Category-specific feeds (future)
PDS configuration in pds.env:
PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>