Commit Graph

74 Commits

Author SHA1 Message Date
primal
f4f80e91cc Add enclosure support for podcast/media items
- Load enclosure fields in GetAllUnpublishedItems query
- Only include enclosure URL if it fits within post length limit
- Shorter video/audio enclosures will be included when they fit

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:36:28 -05:00
primal
1609220a27 Limit handle subdomain to 18 chars (PDS restriction)
The PDS restricts the first segment of local handles to 18 characters,
not the AT Protocol spec of 63. Added abbreviation map for long
category names:
- science-and-environment -> sci-env
- entertainment-and-arts -> ent-arts
- technology -> tech (when needed)
- etc.

Fixes "Handle too long" errors for BBC category feeds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:22:59 -05:00
primal
1f092c87e9 Add automatic image resize for blobs exceeding size limit
Bluesky has a ~976KB blob limit. Images larger than 900KB are now
automatically resized using CatmullRom scaling and re-encoded as
JPEG with 85% quality. Iteratively scales down (90%, 72%, 58%...)
until under limit, with minimum dimensions of 100x100.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:10:10 -05:00
primal
959abf06c0 Enable .com domain import from vertices.txt.gz
Filter imported domains to only .com TLD for now.
Re-enabled the import loop that was disabled for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:59:14 -05:00
primal
c54005b5ba Upgrade BBC images from 240px to 800px for better quality
BBC CDN supports larger image sizes by changing the URL path.
Upgrade /standard/240/ and /standard/480/ to /standard/800/.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:49:11 -05:00
primal
5975df6771 Use PubDate for TID/rkey generation for consistent ordering
Both createdAt and rkey now use the original publication date,
so posts sort consistently by their original publication time.
Falls back to DiscoveredAt if PubDate is not available.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:32:07 -05:00
primal
bce6c93242 Use original publication date for post createdAt
Posts now use the item pub_date for the createdAt field instead
of the current time, so posts show their original publication time.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:29:44 -05:00
primal
a1f02cd0bc Fix image embeds and rkey collisions
- Add image_urls to GetAllUnpublishedItems query
- Add aspectRatio to image embeds (required by Bluesky)
- Add image decoding to get dimensions (width/height)
- Fix rkey collision by using XOR of multiple hash bytes

The rkey collision was caused by using only 2 hash bytes (10 bits)
which had ~0.1% collision rate per pair of items with same timestamp.
Now XORs 8 hash bytes for better entropy distribution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:24:35 -05:00
primal
4e4e8c939a Add favicon as profile picture for feed accounts
Fetches the site's favicon and uses it as the avatar when creating
or updating feed account profiles. Tries common favicon locations
(/favicon.ico, /favicon.png, /apple-touch-icon.png) then falls back
to Google's favicon service.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:05:50 -05:00
primal
9a43b69b4b Add profile refresh on startup to backfill feed URLs
Updates existing account profiles with the feed URL on startup.
This ensures all accounts have the source feed URL in their
profile description.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:03:10 -05:00
primal
9ecf0f700d Add feed URL to profile description
When creating new accounts, include the full RSS/Atom feed URL
in the profile description so users can find the original source.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:56:09 -05:00
primal
39714858e5 Fix subdomain length limit to match AT Protocol spec
AT Protocol allows 63 characters per label, not 18. The previous
limit was incorrectly truncating category names like
"science-and-environment" and "entertainment-and-arts".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:53:42 -05:00
primal
27c3fa1a3c Fix handle derivation for two-part TLDs and categories
Rewrite DeriveHandleFromFeed to properly handle domains like BBC:
- Handle two-part TLDs (co.uk, com.au, etc.)
- Skip noise subdomains (feeds, www, news, rss)
- Map bbci -> bbc for cleaner handles
- Include category from path (technology, sport, world, etc.)

Before: feeds.bbci.co.uk/news/technology/rss.xml -> news-co.1440.news
After:  feeds.bbci.co.uk/news/technology/rss.xml -> bbc-technology.1440.news

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:48:26 -05:00
primal
f4afb29980 Migrate from SQLite to PostgreSQL
- Replace modernc.org/sqlite with jackc/pgx/v5
- Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders)
- Use snake_case column names throughout
- Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search
- Add connection pooling with pgxpool
- Support Docker secrets for database password
- Add trigger to normalize feed URLs (strip https://, http://, www.)
- Fix anchor feed detection regex to avoid false positives
- Connect app container to atproto network for PostgreSQL access
- Add version indicator to dashboard UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:38:13 -05:00
primal
75835d771d Add AT Protocol publishing, media support, and SQLite stability
Publishing:
- Add publisher.go for posting feed items to AT Protocol PDS
- Support deterministic rkeys from SHA256(guid + discoveredAt)
- Handle multiple URLs in posts with facets for each link
- Image embed support (app.bsky.embed.images) for up to 4 images
- External embed with thumbnail fallback
- Podcast/audio enclosure URLs included in post text

Media extraction:
- Parse RSS enclosures (audio, video, images)
- Extract Media RSS content and thumbnails
- Extract images from HTML content in descriptions
- Store enclosure and imageUrls in items table

SQLite stability improvements:
- Add synchronous=NORMAL and wal_autocheckpoint pragmas
- Connection pool tuning (idle conns, max lifetime)
- Periodic WAL checkpoint every 5 minutes
- Hourly integrity checks with PRAGMA quick_check
- Daily hot backup via VACUUM INTO
- Docker stop_grace_period: 30s for graceful shutdown

Dashboard:
- Feed publishing UI and API endpoints
- Account creation with invite codes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 15:30:02 -05:00
primal
aa6f571215 Add PDS credentials env file for service auth 2026-01-26 19:38:36 -05:00
primal
67bd8339b2 Move crawler to app.1440.news subdomain 2026-01-26 17:10:50 -05:00
primal
6a3f894d6a Rename container to app-1440-news 2026-01-26 16:30:00 -05:00
primal
143807378f Add Docker support and refactor data layer 2026-01-26 16:02:05 -05:00
primal
398e7b3969 Add Docker Compose config with Traefik HTTPS routing
Configure container deployment with:
- HTTPS via Traefik with LetsEncrypt certificate
- HTTP to HTTPS redirect for production (1440.news)
- HTTP-only routing for local development (1440.localhost)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 15:30:03 -05:00
primal
93ab1f8117 Update CLAUDE.md to reflect current multi-file architecture
The codebase evolved from a single-file app to a multi-file structure
with SQLite persistence, dashboard, and concurrent processing loops.
Updated documentation to accurately describe current architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 13:47:19 -05:00
primal
219b49352e Add PebbleDB storage, domain tracking, and web dashboard
- Split main.go into separate files for better organization:
  crawler.go, domain.go, feed.go, parser.go, html.go, util.go
- Add PebbleDB for persistent storage of feeds and domains
- Store feeds with metadata: title, TTL, update frequency, ETag, etc.
- Track domains with crawl status (uncrawled/crawled/error)
- Normalize URLs by stripping scheme and www. prefix
- Add web dashboard on port 4321 with real-time stats:
  - Crawl progress with completion percentage
  - Feed counts by type (RSS/Atom)
  - Top TLDs and domains by feed count
  - Recent feeds table
- Filter out comment feeds from results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 16:29:00 -05:00
primal
0dd612b7e1 Rename feed directory to feeds
Update output directory path in main.go and .gitignore to use
feeds/ instead of feed/.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 15:36:56 -05:00
primal
f4cae127cc Add feed crawler with documentation
- main.go: RSS/Atom feed crawler using Common Crawl data
- CLAUDE.md: Project documentation for Claude Code
- .gitignore: Ignore binary and go.* files
- Feed output now written to feed/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 15:15:30 -05:00