Migrate from SQLite to PostgreSQL

- Replace modernc.org/sqlite with jackc/pgx/v5
- Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders)
- Use snake_case column names throughout
- Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search
- Add connection pooling with pgxpool
- Support Docker secrets for database password
- Add trigger to normalize feed URLs (strip https://, http://, www.)
- Fix anchor feed detection regex to avoid false positives
- Connect app container to atproto network for PostgreSQL access
- Add version indicator to dashboard UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
primal
2026-01-28 20:38:13 -05:00
parent 75835d771d
commit f4afb29980
11 changed files with 1525 additions and 1137 deletions
+48 -14
View File
@@ -11,20 +11,47 @@ go fmt ./... # Format
go vet ./... # Static analysis
```
### Database Setup
Requires PostgreSQL. Start the database first:
```bash
cd ../postgres && docker compose up -d
```
### Environment Variables
Set via environment or create a `.env` file:
```bash
# Database connection (individual vars)
DB_HOST=atproto-postgres # Default: atproto-postgres
DB_PORT=5432 # Default: 5432
DB_USER=news_1440 # Default: news_1440
DB_PASSWORD=<password> # Or use DB_PASSWORD_FILE
DB_NAME=news_1440 # Default: news_1440
# Or use a connection string
DATABASE_URL=postgres://news_1440:password@atproto-postgres:5432/news_1440?sslmode=disable
```
For Docker, use `DB_PASSWORD_FILE=/run/secrets/db_password` with Docker secrets.
Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory.
## Architecture
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in PostgreSQL, and provides a web dashboard.
### Concurrent Loops (main.go)
The application runs five independent goroutine loops:
The application runs six independent goroutine loops:
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in 10k batches
- **Crawl loop** - Worker pool processes unchecked domains, discovers feeds
- **Check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
- **Stats loop** - Updates cached dashboard statistics every minute
- **Cleanup loop** - Removes items older than 12 months (weekly)
- **Publish loop** - Autopublishes items from approved feeds to AT Protocol PDS
### File Structure
@@ -36,16 +63,19 @@ The application runs five independent goroutine loops:
| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation |
| `html.go` | HTML parsing: feed link extraction, anchor feed detection |
| `util.go` | URL normalization, host utilities, TLD extraction |
| `db.go` | SQLite schema (domains, feeds, items tables with FTS5) |
| `db.go` | PostgreSQL schema (domains, feeds, items tables with tsvector FTS) |
| `dashboard.go` | HTTP server, JSON APIs, HTML template |
| `publisher.go` | AT Protocol PDS integration for posting items |
### Database Schema
SQLite with WAL mode at `feeds/feeds.db`:
PostgreSQL with pgx driver, using connection pooling:
- **domains** - Hosts to crawl (status: unchecked/checked/error)
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers
- **items** - Individual feed entries (guid + feedUrl unique)
- **feeds_fts / items_fts** - FTS5 virtual tables for search
- **items** - Individual feed entries (guid + feed_url unique)
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)
Column naming: snake_case (e.g., `source_host`, `pub_date`, `item_count`)
### Crawl Logic
@@ -53,13 +83,18 @@ SQLite with WAL mode at `feeds/feeds.db`:
2. Try HTTPS, fall back to HTTP
3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
5. Parse discovered feeds for metadata, save with nextCrawlAt
5. Parse discovered feeds for metadata, save with next_crawl_at
### Feed Checking
Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `<ttl>` and Syndication namespace hints.
## AT Protocol Integration (Planned)
### Publishing
Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
Status values: `held` (default), `pass` (approved), `deny` (rejected).
## AT Protocol Integration
Domain: 1440.news
@@ -68,9 +103,8 @@ User structure:
- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`)
- `{category}.{domain}.1440.news` - Category-specific feeds (future)
Phases:
1. Local PDS setup
2. Account management
3. Auto-create domain users
4. Post articles to accounts
5. Category detection
PDS configuration in `pds.env`:
```
PDS_HOST=https://pds.1440.news
PDS_ADMIN_PASSWORD=<admin_password>
```
+233 -45
View File
@@ -1,10 +1,10 @@
package main
import (
"database/sql"
"fmt"
"io"
"net/http"
"os"
"runtime"
"strings"
"sync"
@@ -25,17 +25,17 @@ type Crawler struct {
hostsProcessed int32
feedsChecked int32
startTime time.Time
db *sql.DB
db *DB
displayedCrawlRate int
displayedCheckRate int
domainsImported int32
cachedStats *DashboardStats
cachedAllDomains []DomainStat
statsMu sync.RWMutex
cachedStats *DashboardStats
cachedAllDomains []DomainStat
statsMu sync.RWMutex
}
func NewCrawler(dbPath string) (*Crawler, error) {
db, err := OpenDatabase(dbPath)
func NewCrawler(connString string) (*Crawler, error) {
db, err := OpenDatabase(connString)
if err != nil {
return nil, fmt.Errorf("failed to open database: %v", err)
}
@@ -61,12 +61,6 @@ func NewCrawler(dbPath string) (*Crawler, error) {
func (c *Crawler) Close() error {
if c.db != nil {
// Checkpoint WAL to merge it back into main database before closing
// This prevents corruption if the container is stopped mid-write
fmt.Println("Checkpointing WAL...")
if _, err := c.db.Exec("PRAGMA wal_checkpoint(TRUNCATE)"); err != nil {
fmt.Printf("WAL checkpoint warning: %v\n", err)
}
fmt.Println("Closing database...")
return c.db.Close()
}
@@ -95,53 +89,247 @@ func (c *Crawler) StartCleanupLoop() {
}
// StartMaintenanceLoop performs periodic database maintenance
// - WAL checkpoint every 5 minutes to prevent WAL bloat and reduce corruption risk
// - Quick integrity check every hour to detect issues early
// - Hot backup every 24 hours for recovery
func (c *Crawler) StartMaintenanceLoop() {
checkpointTicker := time.NewTicker(5 * time.Minute)
integrityTicker := time.NewTicker(1 * time.Hour)
backupTicker := time.NewTicker(24 * time.Hour)
defer checkpointTicker.Stop()
defer integrityTicker.Stop()
defer backupTicker.Stop()
vacuumTicker := time.NewTicker(24 * time.Hour)
analyzeTicker := time.NewTicker(1 * time.Hour)
defer vacuumTicker.Stop()
defer analyzeTicker.Stop()
for {
select {
case <-checkpointTicker.C:
// Passive checkpoint - doesn't block writers
if _, err := c.db.Exec("PRAGMA wal_checkpoint(PASSIVE)"); err != nil {
fmt.Printf("WAL checkpoint error: %v\n", err)
case <-analyzeTicker.C:
// Update statistics for query planner
if _, err := c.db.Exec("ANALYZE"); err != nil {
fmt.Printf("ANALYZE error: %v\n", err)
}
case <-integrityTicker.C:
// Quick check is faster than full integrity_check
var result string
if err := c.db.QueryRow("PRAGMA quick_check").Scan(&result); err != nil {
fmt.Printf("Integrity check error: %v\n", err)
} else if result != "ok" {
fmt.Printf("WARNING: Database integrity issue detected: %s\n", result)
case <-vacuumTicker.C:
// Reclaim dead tuple space (VACUUM is lighter than VACUUM FULL)
fmt.Println("Running VACUUM...")
if _, err := c.db.Exec("VACUUM"); err != nil {
fmt.Printf("VACUUM error: %v\n", err)
} else {
fmt.Println("VACUUM complete")
}
case <-backupTicker.C:
c.createBackup()
}
}
}
// createBackup creates a hot backup of the database using SQLite's backup API
func (c *Crawler) createBackup() {
backupPath := "feeds/feeds.db.backup"
fmt.Println("Creating database backup...")
// StartPublishLoop automatically publishes unpublished items for approved feeds
// Grabs up to 50 items sorted by discovered_at, publishes one per second, then reloops
func (c *Crawler) StartPublishLoop() {
// Load PDS credentials from environment or pds.env file
pdsHost := os.Getenv("PDS_HOST")
pdsAdminPassword := os.Getenv("PDS_ADMIN_PASSWORD")
// Use SQLite's online backup via VACUUM INTO (available in SQLite 3.27+)
// This creates a consistent snapshot without blocking writers
if _, err := c.db.Exec("VACUUM INTO ?", backupPath); err != nil {
fmt.Printf("Backup error: %v\n", err)
if pdsHost == "" || pdsAdminPassword == "" {
if data, err := os.ReadFile("pds.env"); err == nil {
for _, line := range strings.Split(string(data), "\n") {
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "#") || line == "" {
continue
}
parts := strings.SplitN(line, "=", 2)
if len(parts) == 2 {
key := strings.TrimSpace(parts[0])
value := strings.TrimSpace(parts[1])
switch key {
case "PDS_HOST":
pdsHost = value
case "PDS_ADMIN_PASSWORD":
pdsAdminPassword = value
}
}
}
}
}
if pdsHost == "" || pdsAdminPassword == "" {
fmt.Println("Publish loop: PDS credentials not configured, skipping")
return
}
fmt.Printf("Backup created: %s\n", backupPath)
fmt.Printf("Publish loop: starting with PDS %s\n", pdsHost)
feedPassword := "feed1440!"
// Cache sessions per account
sessions := make(map[string]*PDSSession)
publisher := NewPublisher(pdsHost)
for {
// Get up to 50 unpublished items from approved feeds, sorted by discovered_at ASC
items, err := c.GetAllUnpublishedItems(50)
if err != nil {
fmt.Printf("Publish loop error: %v\n", err)
time.Sleep(1 * time.Second)
continue
}
if len(items) == 0 {
time.Sleep(1 * time.Second)
continue
}
// Publish one item per second
for _, item := range items {
// Get or create session for this feed's account
account := c.getAccountForFeed(item.FeedURL)
if account == "" {
time.Sleep(1 * time.Second)
continue
}
session, ok := sessions[account]
if !ok {
// Try to log in
session, err = publisher.CreateSession(account, feedPassword)
if err != nil {
// Account might not exist - try to create it
inviteCode, err := publisher.CreateInviteCode(pdsAdminPassword, 1)
if err != nil {
fmt.Printf("Publish: failed to create invite for %s: %v\n", account, err)
time.Sleep(1 * time.Second)
continue
}
email := account + "@1440.news"
session, err = publisher.CreateAccount(account, email, feedPassword, inviteCode)
if err != nil {
fmt.Printf("Publish: failed to create account %s: %v\n", account, err)
time.Sleep(1 * time.Second)
continue
}
fmt.Printf("Publish: created account %s\n", account)
c.db.Exec("UPDATE feeds SET publish_account = $1 WHERE url = $2", account, item.FeedURL)
// Set up profile for new account
feedInfo := c.getFeedInfo(item.FeedURL)
if feedInfo != nil {
displayName := feedInfo.Title
if displayName == "" {
displayName = account
}
description := feedInfo.Description
if description == "" {
description = "News feed via 1440.news"
}
// Truncate if needed
if len(displayName) > 64 {
displayName = displayName[:61] + "..."
}
if len(description) > 256 {
description = description[:253] + "..."
}
if err := publisher.UpdateProfile(session, displayName, description, nil); err != nil {
fmt.Printf("Publish: failed to set profile for %s: %v\n", account, err)
} else {
fmt.Printf("Publish: set profile for %s\n", account)
}
}
}
sessions[account] = session
}
// Publish the item
uri, err := publisher.PublishItem(session, &item)
if err != nil {
fmt.Printf("Publish: failed item %d: %v\n", item.ID, err)
// Clear session cache on auth errors
if strings.Contains(err.Error(), "401") || strings.Contains(err.Error(), "auth") {
delete(sessions, account)
}
} else {
c.MarkItemPublished(item.ID, uri)
fmt.Printf("Publish: %s -> %s\n", item.Title[:min(40, len(item.Title))], account)
}
time.Sleep(1 * time.Second)
}
time.Sleep(1 * time.Second)
}
}
// getAccountForFeed returns the publish account for a feed URL
func (c *Crawler) getAccountForFeed(feedURL string) string {
var account *string
err := c.db.QueryRow(`
SELECT publish_account FROM feeds
WHERE url = $1 AND publish_status = 'pass' AND status = 'active'
`, feedURL).Scan(&account)
if err != nil || account == nil || *account == "" {
// Derive handle from feed URL
return DeriveHandleFromFeed(feedURL)
}
return *account
}
// FeedInfo holds basic feed metadata for profile setup
type FeedInfo struct {
Title string
Description string
SiteURL string
}
// getFeedInfo returns feed metadata for profile setup
func (c *Crawler) getFeedInfo(feedURL string) *FeedInfo {
var title, description, siteURL *string
err := c.db.QueryRow(`
SELECT title, description, site_url FROM feeds WHERE url = $1
`, feedURL).Scan(&title, &description, &siteURL)
if err != nil {
return nil
}
return &FeedInfo{
Title: StringValue(title),
Description: StringValue(description),
SiteURL: StringValue(siteURL),
}
}
// GetAllUnpublishedItems returns unpublished items from all approved feeds
func (c *Crawler) GetAllUnpublishedItems(limit int) ([]Item, error) {
rows, err := c.db.Query(`
SELECT i.id, i.feed_url, i.guid, i.title, i.link, i.description, i.content,
i.author, i.pub_date, i.discovered_at
FROM items i
JOIN feeds f ON i.feed_url = f.url
WHERE f.publish_status = 'pass'
AND f.status = 'active'
AND i.published_at IS NULL
ORDER BY i.discovered_at ASC
LIMIT $1
`, limit)
if err != nil {
return nil, err
}
defer rows.Close()
var items []Item
for rows.Next() {
var item Item
var guid, title, link, description, content, author *string
var pubDate, discoveredAt *time.Time
err := rows.Scan(&item.ID, &item.FeedURL, &guid, &title, &link, &description,
&content, &author, &pubDate, &discoveredAt)
if err != nil {
continue
}
item.GUID = StringValue(guid)
item.Title = StringValue(title)
item.Link = StringValue(link)
item.Description = StringValue(description)
item.Content = StringValue(content)
item.Author = StringValue(author)
item.PubDate = TimeValue(pubDate)
item.DiscoveredAt = TimeValue(discoveredAt)
items = append(items, item)
}
return items, nil
}
// StartCrawlLoop runs the domain crawling loop independently
+417 -353
View File
File diff suppressed because it is too large Load Diff
+222 -140
View File
@@ -1,27 +1,31 @@
package main
import (
"database/sql"
"context"
"fmt"
"net/url"
"os"
"strings"
"time"
_ "modernc.org/sqlite"
"github.com/jackc/pgx/v5"
"github.com/jackc/pgx/v5/pgxpool"
)
const schema = `
CREATE TABLE IF NOT EXISTS domains (
host TEXT PRIMARY KEY,
status TEXT NOT NULL DEFAULT 'unchecked',
discoveredAt DATETIME NOT NULL,
lastCrawledAt DATETIME,
feedsFound INTEGER DEFAULT 0,
lastError TEXT,
discovered_at TIMESTAMPTZ NOT NULL,
last_crawled_at TIMESTAMPTZ,
feeds_found INTEGER DEFAULT 0,
last_error TEXT,
tld TEXT
);
CREATE INDEX IF NOT EXISTS idx_domains_status ON domains(status);
CREATE INDEX IF NOT EXISTS idx_domains_tld ON domains(tld);
CREATE INDEX IF NOT EXISTS idx_domains_feedsFound ON domains(feedsFound DESC) WHERE feedsFound > 0;
CREATE INDEX IF NOT EXISTS idx_domains_feeds_found ON domains(feeds_found DESC) WHERE feeds_found > 0;
CREATE TABLE IF NOT EXISTS feeds (
url TEXT PRIMARY KEY,
@@ -30,196 +34,195 @@ CREATE TABLE IF NOT EXISTS feeds (
title TEXT,
description TEXT,
language TEXT,
siteUrl TEXT,
site_url TEXT,
discoveredAt DATETIME NOT NULL,
lastCrawledAt DATETIME,
nextCrawlAt DATETIME,
lastBuildDate DATETIME,
discovered_at TIMESTAMPTZ NOT NULL,
last_crawled_at TIMESTAMPTZ,
next_crawl_at TIMESTAMPTZ,
last_build_date TIMESTAMPTZ,
etag TEXT,
lastModified TEXT,
last_modified TEXT,
ttlMinutes INTEGER,
updatePeriod TEXT,
updateFreq INTEGER,
ttl_minutes INTEGER,
update_period TEXT,
update_freq INTEGER,
status TEXT DEFAULT 'active',
errorCount INTEGER DEFAULT 0,
lastError TEXT,
lastErrorAt DATETIME,
error_count INTEGER DEFAULT 0,
last_error TEXT,
last_error_at TIMESTAMPTZ,
sourceUrl TEXT,
sourceHost TEXT,
source_url TEXT,
source_host TEXT,
tld TEXT,
itemCount INTEGER,
avgPostFreqHrs REAL,
oldestItemDate DATETIME,
newestItemDate DATETIME,
item_count INTEGER,
avg_post_freq_hrs DOUBLE PRECISION,
oldest_item_date TIMESTAMPTZ,
newest_item_date TIMESTAMPTZ,
noUpdate INTEGER DEFAULT 0,
no_update INTEGER DEFAULT 0,
-- Publishing to PDS
publishStatus TEXT DEFAULT 'held' CHECK(publishStatus IN ('held', 'pass', 'fail')),
publishAccount TEXT
publish_status TEXT DEFAULT 'held' CHECK(publish_status IN ('held', 'pass', 'deny')),
publish_account TEXT,
-- Full-text search vector
search_vector tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
setweight(to_tsvector('english', coalesce(url, '')), 'C')
) STORED
);
CREATE INDEX IF NOT EXISTS idx_feeds_sourceHost ON feeds(sourceHost);
CREATE INDEX IF NOT EXISTS idx_feeds_publishStatus ON feeds(publishStatus);
CREATE INDEX IF NOT EXISTS idx_feeds_sourceHost_url ON feeds(sourceHost, url);
CREATE INDEX IF NOT EXISTS idx_feeds_source_host ON feeds(source_host);
CREATE INDEX IF NOT EXISTS idx_feeds_publish_status ON feeds(publish_status);
CREATE INDEX IF NOT EXISTS idx_feeds_source_host_url ON feeds(source_host, url);
CREATE INDEX IF NOT EXISTS idx_feeds_tld ON feeds(tld);
CREATE INDEX IF NOT EXISTS idx_feeds_tld_sourceHost ON feeds(tld, sourceHost);
CREATE INDEX IF NOT EXISTS idx_feeds_tld_source_host ON feeds(tld, source_host);
CREATE INDEX IF NOT EXISTS idx_feeds_type ON feeds(type);
CREATE INDEX IF NOT EXISTS idx_feeds_category ON feeds(category);
CREATE INDEX IF NOT EXISTS idx_feeds_status ON feeds(status);
CREATE INDEX IF NOT EXISTS idx_feeds_discoveredAt ON feeds(discoveredAt);
CREATE INDEX IF NOT EXISTS idx_feeds_discovered_at ON feeds(discovered_at);
CREATE INDEX IF NOT EXISTS idx_feeds_title ON feeds(title);
CREATE INDEX IF NOT EXISTS idx_feeds_search ON feeds USING GIN(search_vector);
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
feedUrl TEXT NOT NULL,
id BIGSERIAL PRIMARY KEY,
feed_url TEXT NOT NULL,
guid TEXT,
title TEXT,
link TEXT,
description TEXT,
content TEXT,
author TEXT,
pubDate DATETIME,
discoveredAt DATETIME NOT NULL,
updatedAt DATETIME,
pub_date TIMESTAMPTZ,
discovered_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ,
-- Media attachments
enclosureUrl TEXT,
enclosureType TEXT,
enclosureLength INTEGER,
imageUrls TEXT, -- JSON array of image URLs
enclosure_url TEXT,
enclosure_type TEXT,
enclosure_length BIGINT,
image_urls TEXT, -- JSON array of image URLs
-- Publishing to PDS
publishedAt DATETIME,
publishedUri TEXT,
published_at TIMESTAMPTZ,
published_uri TEXT,
UNIQUE(feedUrl, guid)
-- Full-text search vector
search_vector tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
setweight(to_tsvector('english', coalesce(content, '')), 'C') ||
setweight(to_tsvector('english', coalesce(author, '')), 'D')
) STORED,
UNIQUE(feed_url, guid)
);
CREATE INDEX IF NOT EXISTS idx_items_feedUrl ON items(feedUrl);
CREATE INDEX IF NOT EXISTS idx_items_pubDate ON items(pubDate DESC);
CREATE INDEX IF NOT EXISTS idx_items_feed_url ON items(feed_url);
CREATE INDEX IF NOT EXISTS idx_items_pub_date ON items(pub_date DESC);
CREATE INDEX IF NOT EXISTS idx_items_link ON items(link);
CREATE INDEX IF NOT EXISTS idx_items_feedUrl_pubDate ON items(feedUrl, pubDate DESC);
CREATE INDEX IF NOT EXISTS idx_items_unpublished ON items(feedUrl, publishedAt) WHERE publishedAt IS NULL;
CREATE INDEX IF NOT EXISTS idx_items_feed_url_pub_date ON items(feed_url, pub_date DESC);
CREATE INDEX IF NOT EXISTS idx_items_unpublished ON items(feed_url, published_at) WHERE published_at IS NULL;
CREATE INDEX IF NOT EXISTS idx_items_search ON items USING GIN(search_vector);
-- Full-text search for feeds
CREATE VIRTUAL TABLE IF NOT EXISTS feeds_fts USING fts5(
url,
title,
description,
content='feeds',
content_rowid='rowid'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER IF NOT EXISTS feeds_ai AFTER INSERT ON feeds BEGIN
INSERT INTO feeds_fts(rowid, url, title, description)
VALUES (NEW.rowid, NEW.url, NEW.title, NEW.description);
-- Trigger to normalize feed URLs on insert/update (strips https://, http://, www.)
CREATE OR REPLACE FUNCTION normalize_feed_url()
RETURNS TRIGGER AS $$
BEGIN
NEW.url = regexp_replace(NEW.url, '^https?://', '');
NEW.url = regexp_replace(NEW.url, '^www\.', '');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER IF NOT EXISTS feeds_ad AFTER DELETE ON feeds BEGIN
INSERT INTO feeds_fts(feeds_fts, rowid, url, title, description)
VALUES ('delete', OLD.rowid, OLD.url, OLD.title, OLD.description);
END;
CREATE TRIGGER IF NOT EXISTS feeds_au AFTER UPDATE ON feeds BEGIN
INSERT INTO feeds_fts(feeds_fts, rowid, url, title, description)
VALUES ('delete', OLD.rowid, OLD.url, OLD.title, OLD.description);
INSERT INTO feeds_fts(rowid, url, title, description)
VALUES (NEW.rowid, NEW.url, NEW.title, NEW.description);
END;
-- Full-text search for items
CREATE VIRTUAL TABLE IF NOT EXISTS items_fts USING fts5(
title,
description,
content,
author,
content='items',
content_rowid='id'
);
-- Triggers to keep items FTS in sync
CREATE TRIGGER IF NOT EXISTS items_ai AFTER INSERT ON items BEGIN
INSERT INTO items_fts(rowid, title, description, content, author)
VALUES (NEW.id, NEW.title, NEW.description, NEW.content, NEW.author);
END;
CREATE TRIGGER IF NOT EXISTS items_ad AFTER DELETE ON items BEGIN
INSERT INTO items_fts(items_fts, rowid, title, description, content, author)
VALUES ('delete', OLD.id, OLD.title, OLD.description, OLD.content, OLD.author);
END;
CREATE TRIGGER IF NOT EXISTS items_au AFTER UPDATE ON items BEGIN
INSERT INTO items_fts(items_fts, rowid, title, description, content, author)
VALUES ('delete', OLD.id, OLD.title, OLD.description, OLD.content, OLD.author);
INSERT INTO items_fts(rowid, title, description, content, author)
VALUES (NEW.id, NEW.title, NEW.description, NEW.content, NEW.author);
END;
DROP TRIGGER IF EXISTS normalize_feed_url_trigger ON feeds;
CREATE TRIGGER normalize_feed_url_trigger
BEFORE INSERT OR UPDATE ON feeds
FOR EACH ROW
EXECUTE FUNCTION normalize_feed_url();
`
func OpenDatabase(dbPath string) (*sql.DB, error) {
fmt.Printf("Opening database: %s\n", dbPath)
// DB wraps pgxpool.Pool with helper methods
type DB struct {
*pgxpool.Pool
}
// Use pragmas in connection string for consistent application
// - busy_timeout: wait up to 10s for locks instead of failing immediately
// - journal_mode: WAL for better concurrency and crash recovery
// - synchronous: NORMAL is safe with WAL (fsync at checkpoint, not every commit)
// - wal_autocheckpoint: checkpoint every 1000 pages (~4MB) to prevent WAL bloat
// - foreign_keys: enforce referential integrity
connStr := dbPath + "?_pragma=busy_timeout(10000)&_pragma=journal_mode(WAL)&_pragma=synchronous(NORMAL)&_pragma=wal_autocheckpoint(1000)&_pragma=foreign_keys(ON)"
db, err := sql.Open("sqlite", connStr)
func OpenDatabase(connString string) (*DB, error) {
fmt.Printf("Connecting to database...\n")
// If connection string not provided, try environment variables
if connString == "" {
connString = os.Getenv("DATABASE_URL")
}
if connString == "" {
// Build from individual env vars
host := getEnvOrDefault("DB_HOST", "atproto-postgres")
port := getEnvOrDefault("DB_PORT", "5432")
user := getEnvOrDefault("DB_USER", "news_1440")
dbname := getEnvOrDefault("DB_NAME", "news_1440")
// Support Docker secrets (password file) or direct password
password := os.Getenv("DB_PASSWORD")
if password == "" {
if passwordFile := os.Getenv("DB_PASSWORD_FILE"); passwordFile != "" {
data, err := os.ReadFile(passwordFile)
if err != nil {
return nil, fmt.Errorf("failed to read password file: %v", err)
}
password = strings.TrimSpace(string(data))
}
}
connString = fmt.Sprintf("postgres://%s:%s@%s:%s/%s?sslmode=disable",
user, url.QueryEscape(password), host, port, dbname)
}
config, err := pgxpool.ParseConfig(connString)
if err != nil {
return nil, fmt.Errorf("failed to open database: %v", err)
return nil, fmt.Errorf("failed to parse connection string: %v", err)
}
// Connection pool settings for stability
db.SetMaxOpenConns(4) // Limit concurrent connections
db.SetMaxIdleConns(2) // Keep some connections warm
db.SetConnMaxLifetime(5 * time.Minute) // Recycle connections periodically
db.SetConnMaxIdleTime(1 * time.Minute) // Close idle connections
// Connection pool settings
config.MaxConns = 10
config.MinConns = 2
config.MaxConnLifetime = 5 * time.Minute
config.MaxConnIdleTime = 1 * time.Minute
// Verify connection and show journal mode
var journalMode string
if err := db.QueryRow("PRAGMA journal_mode").Scan(&journalMode); err != nil {
fmt.Printf(" Warning: could not query journal_mode: %v\n", err)
} else {
fmt.Printf(" Journal mode: %s\n", journalMode)
ctx := context.Background()
pool, err := pgxpool.NewWithConfig(ctx, config)
if err != nil {
return nil, fmt.Errorf("failed to connect to database: %v", err)
}
// Verify connection
if err := pool.Ping(ctx); err != nil {
pool.Close()
return nil, fmt.Errorf("failed to ping database: %v", err)
}
fmt.Println(" Connected to PostgreSQL")
db := &DB{pool}
// Create schema
if _, err := db.Exec(schema); err != nil {
db.Close()
if _, err := pool.Exec(ctx, schema); err != nil {
pool.Close()
return nil, fmt.Errorf("failed to create schema: %v", err)
}
fmt.Println(" Schema OK")
// Migrations for existing databases
migrations := []string{
"ALTER TABLE items ADD COLUMN enclosureUrl TEXT",
"ALTER TABLE items ADD COLUMN enclosureType TEXT",
"ALTER TABLE items ADD COLUMN enclosureLength INTEGER",
"ALTER TABLE items ADD COLUMN imageUrls TEXT",
}
for _, m := range migrations {
db.Exec(m) // Ignore errors (column may already exist)
}
// Run stats and ANALYZE in background to avoid blocking startup with large databases
// Run stats in background
go func() {
var domainCount, feedCount int
db.QueryRow("SELECT COUNT(*) FROM domains").Scan(&domainCount)
db.QueryRow("SELECT COUNT(*) FROM feeds").Scan(&feedCount)
pool.QueryRow(context.Background(), "SELECT COUNT(*) FROM domains").Scan(&domainCount)
pool.QueryRow(context.Background(), "SELECT COUNT(*) FROM feeds").Scan(&feedCount)
fmt.Printf(" Existing data: %d domains, %d feeds\n", domainCount, feedCount)
fmt.Println(" Running ANALYZE...")
if _, err := db.Exec("ANALYZE"); err != nil {
if _, err := pool.Exec(context.Background(), "ANALYZE"); err != nil {
fmt.Printf(" Warning: ANALYZE failed: %v\n", err)
} else {
fmt.Println(" ANALYZE complete")
@@ -228,3 +231,82 @@ func OpenDatabase(dbPath string) (*sql.DB, error) {
return db, nil
}
func getEnvOrDefault(key, defaultVal string) string {
if val := os.Getenv(key); val != "" {
return val
}
return defaultVal
}
// QueryRow wraps pool.QueryRow for compatibility
func (db *DB) QueryRow(query string, args ...interface{}) pgx.Row {
return db.Pool.QueryRow(context.Background(), query, args...)
}
// Query wraps pool.Query for compatibility
func (db *DB) Query(query string, args ...interface{}) (pgx.Rows, error) {
return db.Pool.Query(context.Background(), query, args...)
}
// Exec wraps pool.Exec for compatibility
func (db *DB) Exec(query string, args ...interface{}) (int64, error) {
result, err := db.Pool.Exec(context.Background(), query, args...)
if err != nil {
return 0, err
}
return result.RowsAffected(), nil
}
// Begin starts a transaction
func (db *DB) Begin() (pgx.Tx, error) {
return db.Pool.Begin(context.Background())
}
// Close closes the connection pool
func (db *DB) Close() error {
db.Pool.Close()
return nil
}
// NullableString returns nil for empty strings, otherwise the string pointer
func NullableString(s string) *string {
if s == "" {
return nil
}
return &s
}
// NullableTime returns nil for zero times, otherwise the time pointer
func NullableTime(t time.Time) *time.Time {
if t.IsZero() {
return nil
}
return &t
}
// StringValue returns empty string for nil, otherwise the dereferenced value
func StringValue(s *string) string {
if s == nil {
return ""
}
return *s
}
// TimeValue returns zero time for nil, otherwise the dereferenced value
func TimeValue(t *time.Time) time.Time {
if t == nil {
return time.Time{}
}
return *t
}
// ToSearchQuery converts a user query to PostgreSQL tsquery format
func ToSearchQuery(query string) string {
// Simple conversion: split on spaces and join with &
words := strings.Fields(query)
if len(words) == 0 {
return ""
}
return strings.Join(words, " & ")
}
+15 -1
View File
@@ -6,11 +6,19 @@ services:
stop_grace_period: 30s
env_file:
- pds.env
environment:
DB_HOST: atproto-postgres
DB_PORT: 5432
DB_USER: news_1440
DB_PASSWORD_FILE: /run/secrets/db_password
DB_NAME: news_1440
secrets:
- db_password
volumes:
- ./feeds:/app/feeds
- ./vertices.txt.gz:/app/vertices.txt.gz:ro
networks:
- proxy
- atproto
labels:
- "traefik.enable=true"
# Production: HTTPS with Let's Encrypt
@@ -29,6 +37,12 @@ services:
# Shared service
- "traefik.http.services.app-1440-news.loadbalancer.server.port=4321"
secrets:
db_password:
file: ../postgres/secrets/news_1440_password.txt
networks:
proxy:
external: true
atproto:
external: true
+104 -105
View File
@@ -3,13 +3,15 @@ package main
import (
"bufio"
"compress/gzip"
"database/sql"
"context"
"fmt"
"io"
"os"
"strings"
"sync/atomic"
"time"
"github.com/jackc/pgx/v5"
)
// Domain represents a host to be crawled for feeds
@@ -23,78 +25,74 @@ type Domain struct {
TLD string `json:"tld,omitempty"`
}
// saveDomain stores a domain in SQLite
// saveDomain stores a domain in PostgreSQL
func (c *Crawler) saveDomain(domain *Domain) error {
_, err := c.db.Exec(`
INSERT INTO domains (host, status, discoveredAt, lastCrawledAt, feedsFound, lastError, tld)
VALUES (?, ?, ?, ?, ?, ?, ?)
INSERT INTO domains (host, status, discovered_at, last_crawled_at, feeds_found, last_error, tld)
VALUES ($1, $2, $3, $4, $5, $6, $7)
ON CONFLICT(host) DO UPDATE SET
status = excluded.status,
lastCrawledAt = excluded.lastCrawledAt,
feedsFound = excluded.feedsFound,
lastError = excluded.lastError,
tld = excluded.tld
`, domain.Host, domain.Status, domain.DiscoveredAt, nullTime(domain.LastCrawledAt),
domain.FeedsFound, nullString(domain.LastError), domain.TLD)
status = EXCLUDED.status,
last_crawled_at = EXCLUDED.last_crawled_at,
feeds_found = EXCLUDED.feeds_found,
last_error = EXCLUDED.last_error,
tld = EXCLUDED.tld
`, domain.Host, domain.Status, domain.DiscoveredAt, NullableTime(domain.LastCrawledAt),
domain.FeedsFound, NullableString(domain.LastError), domain.TLD)
return err
}
// saveDomainTx stores a domain using a transaction
func (c *Crawler) saveDomainTx(tx *sql.Tx, domain *Domain) error {
_, err := tx.Exec(`
INSERT INTO domains (host, status, discoveredAt, lastCrawledAt, feedsFound, lastError, tld)
VALUES (?, ?, ?, ?, ?, ?, ?)
func (c *Crawler) saveDomainTx(tx pgx.Tx, domain *Domain) error {
_, err := tx.Exec(context.Background(), `
INSERT INTO domains (host, status, discovered_at, last_crawled_at, feeds_found, last_error, tld)
VALUES ($1, $2, $3, $4, $5, $6, $7)
ON CONFLICT(host) DO NOTHING
`, domain.Host, domain.Status, domain.DiscoveredAt, nullTime(domain.LastCrawledAt),
domain.FeedsFound, nullString(domain.LastError), domain.TLD)
`, domain.Host, domain.Status, domain.DiscoveredAt, NullableTime(domain.LastCrawledAt),
domain.FeedsFound, NullableString(domain.LastError), domain.TLD)
return err
}
// domainExists checks if a domain already exists in the database
func (c *Crawler) domainExists(host string) bool {
var exists bool
err := c.db.QueryRow("SELECT EXISTS(SELECT 1 FROM domains WHERE host = ?)", normalizeHost(host)).Scan(&exists)
err := c.db.QueryRow("SELECT EXISTS(SELECT 1 FROM domains WHERE host = $1)", normalizeHost(host)).Scan(&exists)
return err == nil && exists
}
// getDomain retrieves a domain from SQLite
// getDomain retrieves a domain from PostgreSQL
func (c *Crawler) getDomain(host string) (*Domain, error) {
domain := &Domain{}
var lastCrawledAt sql.NullTime
var lastError sql.NullString
var lastCrawledAt *time.Time
var lastError *string
err := c.db.QueryRow(`
SELECT host, status, discoveredAt, lastCrawledAt, feedsFound, lastError, tld
FROM domains WHERE host = ?
SELECT host, status, discovered_at, last_crawled_at, feeds_found, last_error, tld
FROM domains WHERE host = $1
`, normalizeHost(host)).Scan(
&domain.Host, &domain.Status, &domain.DiscoveredAt, &lastCrawledAt,
&domain.FeedsFound, &lastError, &domain.TLD,
)
if err == sql.ErrNoRows {
if err == pgx.ErrNoRows {
return nil, nil
}
if err != nil {
return nil, err
}
if lastCrawledAt.Valid {
domain.LastCrawledAt = lastCrawledAt.Time
}
if lastError.Valid {
domain.LastError = lastError.String
}
domain.LastCrawledAt = TimeValue(lastCrawledAt)
domain.LastError = StringValue(lastError)
return domain, nil
}
// GetUncheckedDomains returns up to limit unchecked domains ordered by discoveredAt (FIFO)
// GetUncheckedDomains returns up to limit unchecked domains ordered by discovered_at (FIFO)
func (c *Crawler) GetUncheckedDomains(limit int) ([]*Domain, error) {
rows, err := c.db.Query(`
SELECT host, status, discoveredAt, lastCrawledAt, feedsFound, lastError, tld
SELECT host, status, discovered_at, last_crawled_at, feeds_found, last_error, tld
FROM domains WHERE status = 'unchecked'
ORDER BY discoveredAt ASC
LIMIT ?
ORDER BY discovered_at ASC
LIMIT $1
`, limit)
if err != nil {
return nil, err
@@ -105,12 +103,12 @@ func (c *Crawler) GetUncheckedDomains(limit int) ([]*Domain, error) {
}
// scanDomains is a helper to scan multiple domain rows
func (c *Crawler) scanDomains(rows *sql.Rows) ([]*Domain, error) {
func (c *Crawler) scanDomains(rows pgx.Rows) ([]*Domain, error) {
var domains []*Domain
for rows.Next() {
domain := &Domain{}
var lastCrawledAt sql.NullTime
var lastError sql.NullString
var lastCrawledAt *time.Time
var lastError *string
if err := rows.Scan(
&domain.Host, &domain.Status, &domain.DiscoveredAt, &lastCrawledAt,
@@ -119,12 +117,8 @@ func (c *Crawler) scanDomains(rows *sql.Rows) ([]*Domain, error) {
continue
}
if lastCrawledAt.Valid {
domain.LastCrawledAt = lastCrawledAt.Time
}
if lastError.Valid {
domain.LastError = lastError.String
}
domain.LastCrawledAt = TimeValue(lastCrawledAt)
domain.LastError = StringValue(lastError)
domains = append(domains, domain)
}
@@ -142,13 +136,13 @@ func (c *Crawler) markDomainCrawled(host string, feedsFound int, lastError strin
var err error
if lastError != "" {
_, err = c.db.Exec(`
UPDATE domains SET status = ?, lastCrawledAt = ?, feedsFound = ?, lastError = ?
WHERE host = ?
UPDATE domains SET status = $1, last_crawled_at = $2, feeds_found = $3, last_error = $4
WHERE host = $5
`, status, time.Now(), feedsFound, lastError, normalizeHost(host))
} else {
_, err = c.db.Exec(`
UPDATE domains SET status = ?, lastCrawledAt = ?, feedsFound = ?, lastError = NULL
WHERE host = ?
UPDATE domains SET status = $1, last_crawled_at = $2, feeds_found = $3, last_error = NULL
WHERE host = $4
`, status, time.Now(), feedsFound, normalizeHost(host))
}
return err
@@ -164,6 +158,23 @@ func (c *Crawler) GetDomainCount() (total int, unchecked int, err error) {
return total, unchecked, err
}
// ImportTestDomains adds a list of specific domains for testing
func (c *Crawler) ImportTestDomains(domains []string) {
now := time.Now()
for _, host := range domains {
_, err := c.db.Exec(`
INSERT INTO domains (host, status, discovered_at, tld)
VALUES ($1, 'unchecked', $2, $3)
ON CONFLICT(host) DO NOTHING
`, host, now, getTLD(host))
if err != nil {
fmt.Printf("Error adding test domain %s: %v\n", host, err)
} else {
fmt.Printf("Added test domain: %s\n", host)
}
}
}
// ImportDomainsFromFile reads a vertices file and stores new domains as "unchecked"
func (c *Crawler) ImportDomainsFromFile(filename string, limit int) (imported int, skipped int, err error) {
file, err := os.Open(filename)
@@ -212,7 +223,6 @@ func (c *Crawler) ImportDomainsInBackground(filename string) {
const batchSize = 1000
now := time.Now()
nowStr := now.Format("2006-01-02 15:04:05")
totalImported := 0
batchCount := 0
@@ -240,31 +250,43 @@ func (c *Crawler) ImportDomainsInBackground(filename string) {
break
}
// Build bulk INSERT statement
var sb strings.Builder
sb.WriteString("INSERT INTO domains (host, status, discoveredAt, tld) VALUES ")
args := make([]interface{}, 0, len(domains)*4)
for i, d := range domains {
if i > 0 {
sb.WriteString(",")
}
sb.WriteString("(?, 'unchecked', ?, ?)")
args = append(args, d.host, nowStr, d.tld)
}
sb.WriteString(" ON CONFLICT(host) DO NOTHING")
// Execute bulk insert
result, err := c.db.Exec(sb.String(), args...)
imported := 0
// Use COPY for bulk insert (much faster than individual INSERTs)
ctx := context.Background()
conn, err := c.db.Acquire(ctx)
if err != nil {
fmt.Printf("Bulk insert error: %v\n", err)
} else {
rowsAffected, _ := result.RowsAffected()
imported = int(rowsAffected)
fmt.Printf("Failed to acquire connection: %v\n", err)
break
}
// Build rows for copy
rows := make([][]interface{}, len(domains))
for i, d := range domains {
rows[i] = []interface{}{d.host, "unchecked", now, d.tld}
}
// Use CopyFrom for bulk insert
imported, err := conn.CopyFrom(
ctx,
pgx.Identifier{"domains"},
[]string{"host", "status", "discovered_at", "tld"},
pgx.CopyFromRows(rows),
)
conn.Release()
if err != nil {
// Fall back to individual inserts with ON CONFLICT
for _, d := range domains {
c.db.Exec(`
INSERT INTO domains (host, status, discovered_at, tld)
VALUES ($1, 'unchecked', $2, $3)
ON CONFLICT(host) DO NOTHING
`, d.host, now, d.tld)
}
imported = int64(len(domains))
}
batchCount++
totalImported += imported
totalImported += int(imported)
atomic.AddInt32(&c.domainsImported, int32(imported))
// Wait 1 second before the next batch
@@ -304,7 +326,6 @@ func (c *Crawler) parseAndStoreDomains(reader io.Reader, limit int) (imported in
scanner.Buffer(buf, 1024*1024)
now := time.Now()
nowStr := now.Format("2006-01-02 15:04:05")
count := 0
const batchSize = 1000
@@ -336,28 +357,21 @@ func (c *Crawler) parseAndStoreDomains(reader io.Reader, limit int) (imported in
break
}
// Build bulk INSERT statement
var sb strings.Builder
sb.WriteString("INSERT INTO domains (host, status, discoveredAt, tld) VALUES ")
args := make([]interface{}, 0, len(domains)*4)
for i, d := range domains {
if i > 0 {
sb.WriteString(",")
// Insert with ON CONFLICT
for _, d := range domains {
result, err := c.db.Exec(`
INSERT INTO domains (host, status, discovered_at, tld)
VALUES ($1, 'unchecked', $2, $3)
ON CONFLICT(host) DO NOTHING
`, d.host, now, d.tld)
if err != nil {
skipped++
} else if result > 0 {
imported++
} else {
skipped++
}
sb.WriteString("(?, 'unchecked', ?, ?)")
args = append(args, d.host, nowStr, d.tld)
}
sb.WriteString(" ON CONFLICT(host) DO NOTHING")
// Execute bulk insert
result, execErr := c.db.Exec(sb.String(), args...)
if execErr != nil {
skipped += len(domains)
continue
}
rowsAffected, _ := result.RowsAffected()
imported += int(rowsAffected)
skipped += len(domains) - int(rowsAffected)
if limit > 0 && count >= limit {
break
@@ -370,18 +384,3 @@ func (c *Crawler) parseAndStoreDomains(reader io.Reader, limit int) (imported in
return imported, skipped, nil
}
// Helper functions for SQL null handling
func nullTime(t time.Time) sql.NullTime {
if t.IsZero() {
return sql.NullTime{}
}
return sql.NullTime{Time: t, Valid: true}
}
func nullString(s string) sql.NullString {
if s == "" {
return sql.NullString{}
}
return sql.NullString{String: s, Valid: true}
}
+332 -402
View File
File diff suppressed because it is too large Load Diff
+5 -1
View File
@@ -77,7 +77,11 @@ func (c *Crawler) extractFeedLinks(n *html.Node, baseURL string) []simpleFeed {
func (c *Crawler) extractAnchorFeeds(n *html.Node, baseURL string) []simpleFeed {
feeds := make([]simpleFeed, 0)
feedPattern := regexp.MustCompile(`(?i)(rss|atom|feed)`)
// Match feed URLs more precisely:
// - /feed, /rss, /atom as path segments (not "feeds" or "feedback")
// - .rss, .atom, .xml file extensions
// - ?feed=, ?format=rss, etc.
feedPattern := regexp.MustCompile(`(?i)(/feed/?$|/feed/|/rss/?$|/rss/|/atom/?$|/atom/|\.rss|\.atom|\.xml|\?.*feed=|\?.*format=rss|\?.*format=atom)`)
var f func(*html.Node)
f = func(n *html.Node) {
+13 -9
View File
@@ -8,13 +8,8 @@ import (
)
func main() {
// Ensure feeds directory exists
if err := os.MkdirAll("feeds", 0755); err != nil {
fmt.Fprintf(os.Stderr, "Error creating feeds directory: %v\n", err)
os.Exit(1)
}
crawler, err := NewCrawler("feeds/feeds.db")
// Connection string from environment (DATABASE_URL or DB_* vars)
crawler, err := NewCrawler("")
if err != nil {
fmt.Fprintf(os.Stderr, "Error initializing crawler: %v\n", err)
os.Exit(1)
@@ -37,8 +32,14 @@ func main() {
// Start all loops independently
fmt.Println("Starting import, crawl, check, and stats loops...")
// Import loop (background)
go crawler.ImportDomainsInBackground("vertices.txt.gz")
// Import loop (background) - DISABLED for testing, using manual domains
// go crawler.ImportDomainsInBackground("vertices.txt.gz")
// Add only ycombinator domains for testing
go crawler.ImportTestDomains([]string{
"news.ycombinator.com",
"ycombinator.com",
})
// Check loop (background)
go crawler.StartCheckLoop()
@@ -52,6 +53,9 @@ func main() {
// Maintenance loop (background) - WAL checkpoints and integrity checks
go crawler.StartMaintenanceLoop()
// Publish loop (background) - autopublishes items for approved feeds
go crawler.StartPublishLoop()
// Crawl loop (background)
go crawler.StartCrawlLoop()
+80 -57
View File
@@ -3,7 +3,6 @@ package main
import (
"bytes"
"crypto/sha256"
"encoding/base32"
"encoding/json"
"fmt"
"io"
@@ -12,6 +11,7 @@ import (
"regexp"
"strings"
"time"
"unicode/utf8"
)
// Publisher handles posting items to AT Protocol PDS
@@ -196,22 +196,41 @@ func (p *Publisher) CreateInviteCode(adminPassword string, useCount int) (string
return result.Code, nil
}
// GenerateRkey creates a deterministic rkey from a GUID and timestamp
// Uses a truncated base32-encoded SHA256 hash
// Including the timestamp allows regenerating a new rkey by updating discoveredAt
// TID alphabet for base32-sortable encoding
const tidAlphabet = "234567abcdefghijklmnopqrstuvwxyz"
// GenerateRkey creates a deterministic TID-format rkey from a GUID and timestamp
// TIDs are required by Bluesky relay for indexing - custom rkeys don't sync
// Format: 13 chars base32-sortable, 53 bits timestamp + 10 bits clock ID
func GenerateRkey(guid string, timestamp time.Time) string {
if guid == "" {
return ""
}
// Combine GUID with timestamp for the hash input
// Format timestamp to second precision for consistency
input := guid + "|" + timestamp.UTC().Format(time.RFC3339)
hash := sha256.Sum256([]byte(input))
// Use first 10 bytes (80 bits) - plenty for uniqueness
// Base32 encode without padding, lowercase for rkey compatibility
encoded := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(hash[:10])
return strings.ToLower(encoded)
// Get microseconds since Unix epoch (53 bits)
microsInt := timestamp.UnixMicro()
if microsInt < 0 {
microsInt = 0
}
// Convert to uint64 and mask to 53 bits
micros := uint64(microsInt) & ((1 << 53) - 1)
// Generate deterministic 10-bit clock ID from GUID hash
hash := sha256.Sum256([]byte(guid))
clockID := uint64(hash[0])<<2 | uint64(hash[1])>>6
clockID = clockID & ((1 << 10) - 1) // 10 bits = 0-1023
// Combine: top bit 0, 53 bits timestamp, 10 bits clock ID
tid := (micros << 10) | clockID
// Encode as base32-sortable (13 characters)
var result [13]byte
for i := 12; i >= 0; i-- {
result[i] = tidAlphabet[tid&0x1f]
tid >>= 5
}
return string(result[:])
}
// extractURLs finds all URLs in a string
@@ -239,7 +258,8 @@ func (p *Publisher) PublishItem(session *PDSSession, item *Item) (string, error)
return "", fmt.Errorf("item has no GUID or link, cannot publish")
}
// Collect all unique URLs: main link + any URLs in description
// Collect URLs: main link + HN comments link (if applicable)
// Limit to 2 URLs max to stay under 300 grapheme limit
urlSet := make(map[string]bool)
var allURLs []string
@@ -249,8 +269,18 @@ func (p *Publisher) PublishItem(session *PDSSession, item *Item) (string, error)
allURLs = append(allURLs, item.Link)
}
// Add enclosure URL for podcasts/media (audio/video)
if item.Enclosure != nil && item.Enclosure.URL != "" {
// For HN feeds, add comments link from description (looks like "https://news.ycombinator.com/item?id=...")
descURLs := extractURLs(item.Description)
for _, u := range descURLs {
if strings.Contains(u, "news.ycombinator.com/item") && !urlSet[u] {
urlSet[u] = true
allURLs = append(allURLs, u)
break // Only add one comments link
}
}
// Add enclosure URL for podcasts/media (audio/video) if we have room
if len(allURLs) < 2 && item.Enclosure != nil && item.Enclosure.URL != "" {
encType := strings.ToLower(item.Enclosure.Type)
if strings.HasPrefix(encType, "audio/") || strings.HasPrefix(encType, "video/") {
if !urlSet[item.Enclosure.URL] {
@@ -260,59 +290,52 @@ func (p *Publisher) PublishItem(session *PDSSession, item *Item) (string, error)
}
}
// Extract URLs from description
descURLs := extractURLs(item.Description)
for _, u := range descURLs {
if !urlSet[u] {
urlSet[u] = true
allURLs = append(allURLs, u)
}
}
// Extract URLs from content if available
contentURLs := extractURLs(item.Content)
for _, u := range contentURLs {
if !urlSet[u] {
urlSet[u] = true
allURLs = append(allURLs, u)
}
}
// Build post text: title + all links
// Bluesky has 300 grapheme limit
var textBuilder strings.Builder
textBuilder.WriteString(item.Title)
// Bluesky has 300 grapheme limit - use rune count as approximation
const maxGraphemes = 295 // Leave some margin
// Calculate space needed for URLs (in runes)
urlSpace := 0
for _, u := range allURLs {
textBuilder.WriteString("\n\n")
textBuilder.WriteString(u)
urlSpace += utf8.RuneCountInString(u) + 2 // +2 for \n\n
}
text := textBuilder.String()
// Truncate title if needed
title := item.Title
titleRunes := utf8.RuneCountInString(title)
maxTitleRunes := maxGraphemes - urlSpace - 3 // -3 for "..."
// Truncate title if text is too long (keep URLs intact)
const maxLen = 300
if len(text) > maxLen {
// Calculate space needed for URLs
urlSpace := 0
for _, u := range allURLs {
urlSpace += len(u) + 2 // +2 for \n\n
}
maxTitleLen := maxLen - urlSpace - 3 // -3 for "..."
if maxTitleLen > 10 {
text = item.Title[:maxTitleLen] + "..."
for _, u := range allURLs {
text += "\n\n" + u
if titleRunes+urlSpace > maxGraphemes {
if maxTitleRunes > 10 {
// Truncate title to fit
runes := []rune(title)
if len(runes) > maxTitleRunes {
title = string(runes[:maxTitleRunes]) + "..."
}
} else {
// Title too long even with minimal space - just truncate hard
runes := []rune(title)
if len(runes) > 50 {
title = string(runes[:50]) + "..."
}
}
}
// Use item's pubDate for createdAt, fall back to now
createdAt := time.Now()
if !item.PubDate.IsZero() {
createdAt = item.PubDate
// Build final text
var textBuilder strings.Builder
textBuilder.WriteString(title)
for _, u := range allURLs {
textBuilder.WriteString("\n\n")
textBuilder.WriteString(u)
}
text := textBuilder.String()
// Use current time for createdAt (Bluesky won't index backdated posts)
// TODO: Restore original pubDate once Bluesky indexing is understood
createdAt := time.Now()
// if !item.PubDate.IsZero() {
// createdAt = item.PubDate
// }
post := BskyPost{
Type: "app.bsky.feed.post",
+56 -10
View File
@@ -258,6 +258,7 @@ function initDashboard() {
output.innerHTML = html;
attachTldHandlers(output.querySelector('.tld-list'));
} catch (err) {
console.error('TLDs error:', err);
output.innerHTML = '<div style="color: #f66; padding: 10px;">Error: ' + escapeHtml(err.message) + '</div>';
}
}
@@ -301,7 +302,7 @@ function initDashboard() {
const result = await response.json();
if (!result.data || result.data.length === 0) {
infiniteScrollState.ended = true;
if (infiniteScrollState) infiniteScrollState.ended = true;
document.getElementById('infiniteLoader').textContent = offset === 0 ? 'No results found' : 'End of list';
return;
}
@@ -319,11 +320,12 @@ function initDashboard() {
offset += result.data.length;
if (result.data.length < limit) {
infiniteScrollState.ended = true;
if (infiniteScrollState) infiniteScrollState.ended = true;
document.getElementById('infiniteLoader').textContent = 'End of list';
}
} catch (err) {
document.getElementById('infiniteLoader').textContent = 'Error loading';
console.error('Filter error:', err);
document.getElementById('infiniteLoader').textContent = 'Error loading: ' + err.message;
}
}
@@ -479,17 +481,26 @@ function initDashboard() {
output.innerHTML = '<div style="color: #666; padding: 10px;">Loading publish data...</div>';
try {
const [candidatesRes, passedRes] = await Promise.all([
const [candidatesRes, passedRes, deniedRes] = await Promise.all([
fetch('/api/publishCandidates?limit=50'),
fetch('/api/publishEnabled')
fetch('/api/publishEnabled'),
fetch('/api/publishDenied')
]);
const candidates = await candidatesRes.json();
const passed = await passedRes.json();
const denied = await deniedRes.json();
let html = '<div style="padding: 10px;">';
// Filter buttons
html += '<div style="margin-bottom: 15px; display: flex; gap: 10px;">';
html += '<button class="filter-btn" data-filter="pass" style="padding: 6px 16px; background: #040; border: 1px solid #060; border-radius: 3px; color: #0a0; cursor: pointer;">Pass (' + passed.length + ')</button>';
html += '<button class="filter-btn" data-filter="held" style="padding: 6px 16px; background: #330; border: 1px solid #550; border-radius: 3px; color: #f90; cursor: pointer;">Held (' + candidates.length + ')</button>';
html += '<button class="filter-btn" data-filter="deny" style="padding: 6px 16px; background: #400; border: 1px solid #600; border-radius: 3px; color: #f66; cursor: pointer;">Deny (' + denied.length + ')</button>';
html += '</div>';
// Passed feeds (approved for publishing)
html += '<div style="margin-bottom: 20px;">';
html += '<div id="section-pass" style="margin-bottom: 20px;">';
html += '<div style="color: #0a0; font-weight: bold; margin-bottom: 10px; border-bottom: 1px solid #333; padding-bottom: 5px;">✓ Approved for Publishing (' + passed.length + ')</div>';
if (passed.length === 0) {
html += '<div style="color: #666; padding: 10px;">No feeds approved yet</div>';
@@ -501,14 +512,14 @@ function initDashboard() {
html += '<div style="color: #666; font-size: 0.85em;">' + escapeHtml(f.url) + '</div>';
html += '<div style="color: #888; font-size: 0.85em;">→ ' + escapeHtml(f.account) + ' (' + f.unpublished_count + ' unpublished)</div>';
html += '</div>';
html += '<button class="status-btn" data-url="' + escapeHtml(f.url) + '" data-status="fail" style="padding: 4px 12px; background: #400; border: 1px solid #600; border-radius: 3px; color: #f66; cursor: pointer; margin-left: 10px;">Revoke</button>';
html += '<button class="status-btn" data-url="' + escapeHtml(f.url) + '" data-status="deny" style="padding: 4px 12px; background: #400; border: 1px solid #600; border-radius: 3px; color: #f66; cursor: pointer; margin-left: 10px;">Revoke</button>';
html += '</div>';
});
}
html += '</div>';
// Candidates (held for review)
html += '<div>';
html += '<div id="section-held">';
html += '<div style="color: #f90; font-weight: bold; margin-bottom: 10px; border-bottom: 1px solid #333; padding-bottom: 5px;">⏳ Held for Review (' + candidates.length + ')</div>';
if (candidates.length === 0) {
html += '<div style="color: #666; padding: 10px;">No candidates held</div>';
@@ -523,7 +534,28 @@ function initDashboard() {
html += '<div style="color: #555; font-size: 0.8em;">' + escapeHtml(f.source_host) + ' · ' + f.item_count + ' items · ' + escapeHtml(f.category) + '</div>';
html += '</div>';
html += '<button class="status-btn pass-btn" data-url="' + escapeHtml(f.url) + '" data-status="pass" style="padding: 4px 12px; background: #040; border: 1px solid #060; border-radius: 3px; color: #0a0; cursor: pointer; margin-left: 10px;">Pass</button>';
html += '<button class="status-btn fail-btn" data-url="' + escapeHtml(f.url) + '" data-status="fail" style="padding: 4px 12px; background: #400; border: 1px solid #600; border-radius: 3px; color: #f66; cursor: pointer; margin-left: 5px;">Fail</button>';
html += '<button class="status-btn deny-btn" data-url="' + escapeHtml(f.url) + '" data-status="deny" style="padding: 4px 12px; background: #400; border: 1px solid #600; border-radius: 3px; color: #f66; cursor: pointer; margin-left: 5px;">Deny</button>';
html += '</div>';
html += '</div>';
});
}
html += '</div>';
// Denied feeds
html += '<div id="section-deny" style="display: none;">';
html += '<div style="color: #f66; font-weight: bold; margin-bottom: 10px; border-bottom: 1px solid #333; padding-bottom: 5px;">✗ Denied (' + denied.length + ')</div>';
if (denied.length === 0) {
html += '<div style="color: #666; padding: 10px;">No feeds denied</div>';
} else {
denied.forEach(f => {
html += '<div class="publish-row" style="padding: 8px; border-bottom: 1px solid #202020;">';
html += '<div style="display: flex; align-items: center;">';
html += '<div style="flex: 1;">';
html += '<div style="color: #0af;">' + escapeHtml(f.title || f.url) + '</div>';
html += '<div style="color: #666; font-size: 0.85em;">' + escapeHtml(f.url) + '</div>';
html += '<div style="color: #555; font-size: 0.8em;">' + escapeHtml(f.source_host) + ' · ' + f.item_count + ' items</div>';
html += '</div>';
html += '<button class="status-btn" data-url="' + escapeHtml(f.url) + '" data-status="held" style="padding: 4px 12px; background: #330; border: 1px solid #550; border-radius: 3px; color: #f90; cursor: pointer; margin-left: 10px;">Restore</button>';
html += '</div>';
html += '</div>';
});
@@ -533,7 +565,21 @@ function initDashboard() {
html += '</div>';
output.innerHTML = html;
// Attach handlers for pass/fail buttons
// Filter button handlers
output.querySelectorAll('.filter-btn').forEach(btn => {
btn.addEventListener('click', () => {
const filter = btn.dataset.filter;
document.getElementById('section-pass').style.display = filter === 'pass' ? 'block' : 'none';
document.getElementById('section-held').style.display = filter === 'held' ? 'block' : 'none';
document.getElementById('section-deny').style.display = filter === 'deny' ? 'block' : 'none';
// Update button styles
output.querySelectorAll('.filter-btn').forEach(b => {
b.style.opacity = b.dataset.filter === filter ? '1' : '0.5';
});
});
});
// Attach handlers for pass/deny buttons
output.querySelectorAll('.status-btn').forEach(btn => {
btn.addEventListener('click', async () => {
const url = btn.dataset.url;