Commit Graph

15 Commits

Author SHA1 Message Date
primal
6eaa39f9db Remove publishing code - now handled by publish service
Publishing functionality has been moved to the standalone publish service.
Removed:
- publisher.go, pds_auth.go, pds_records.go, image.go, handle.go
- StartPublishLoop and related functions from crawler.go
- Publish loop invocation from main.go

Updated CLAUDE.md to reflect the new architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:40:49 -05:00
primal
7ec4207173 Migrate to normalized FK schema (domain_host, domain_tld)
Replace source_host column with proper FK to domains table using
composite key (domain_host, domain_tld). This enables JOIN queries
instead of string concatenation for domain lookups.

Changes:
- Update Feed struct: SourceHost/TLD → DomainHost/DomainTLD
- Update all SQL queries to use domain_host/domain_tld columns
- Add column aliases (as source_host) for API backwards compatibility
- Update trigram index from source_host to domain_host
- Add getDomainHost() helper for extracting host from domain

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 22:36:25 -05:00
primal
8a9001c02c Restore working codebase with all methods
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:53 -05:00
primal
be595cb403 v100 2026-01-30 22:35:08 -05:00
primal
1066f42189 Refactor large Go files into focused modules
Split dashboard.go (3,528 lines) into:
- routes.go: HTTP route registration
- api_domains.go: Domain API handlers
- api_feeds.go: Feed API handlers
- api_publish.go: Publishing API handlers
- api_search.go: Search API handlers
- templates.go: HTML templates
- dashboard.go: Stats functions only (235 lines)

Split publisher.go (1,502 lines) into:
- pds_auth.go: Authentication and account management
- pds_records.go: Record operations (upload, update, delete)
- handle.go: Handle derivation from feed URLs
- image.go: Image processing and favicon fetching
- publisher.go: Core types and PublishItem (439 lines)

Split feed.go (1,137 lines) into:
- item.go: Item struct and DB operations
- feed_check.go: Feed checking and processing
- feed.go: Feed struct and DB operations (565 lines)

Also includes domain import batch size increase (1k -> 100k).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 22:25:02 -05:00
primal
3999e96f26 Dashboard UI overhaul: inline feed details, TLD filtering, status improvements
- Feed details now expand inline instead of navigating to new page
- Add TLD section headers with domains sorted by TLD then name
- Add TLD filter button to show/hide domain sections by TLD
- Feed status behavior: pass creates account, hold crawls only, skip stops, drop cleans up
- Auto-follow new accounts from directory account (1440.news)
- Fix handle derivation (removed duplicate .1440.news suffix)
- Increase domain import batch size to 100k
- Various bug fixes for account creation and profile updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 20:51:05 -05:00
primal
2386d551fc Auto-deny all-digit domains, whitelist 1440.news
- Deny domains where hostname is all digits (e.g., 0000114.com)
- Never auto-deny 1440.news or subdomains
- Auto-pass feeds from 1440.news sources
- Updated 554,085 domains and 3,213 feeds in database

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:27:48 -05:00
primal
897ae66e81 Fix NULL handling for nullable integer columns in getFeed
TTLMinutes, UpdateFreq, ErrorCount, ItemCount, and NoUpdate columns
can be NULL in the database. Use pointer types and handle them properly
to avoid scan errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:20:18 -05:00
primal
ad78c1a4c0 Add JSON Feed support
- Detect JSON Feed format (jsonfeed.org) via version field
- Parse JSON Feed metadata and items
- Support application/feed+json MIME type for feed discovery
- Include "json" as valid feed type (not auto-denied)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:16:50 -05:00
primal
798f79bfe9 Auto-deny feeds that are not RSS or Atom type
Feeds with type other than 'rss' or 'atom' (e.g., 'unknown') are now
automatically denied on discovery. Also updated 164 existing feeds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:13:22 -05:00
primal
254b751799 Add rich text links, language filter, and domain deny feature
- Use labeled links (Article · Audio) instead of raw URLs in posts
- Add language filter dropdown to dashboard with toggle selection
- Auto-deny feeds with no language on discovery
- Add deny/undeny buttons for domains to block crawling
- Denied domains set feeds to dead status, preventing future checks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 12:36:58 -05:00
primal
f4afb29980 Migrate from SQLite to PostgreSQL
- Replace modernc.org/sqlite with jackc/pgx/v5
- Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders)
- Use snake_case column names throughout
- Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search
- Add connection pooling with pgxpool
- Support Docker secrets for database password
- Add trigger to normalize feed URLs (strip https://, http://, www.)
- Fix anchor feed detection regex to avoid false positives
- Connect app container to atproto network for PostgreSQL access
- Add version indicator to dashboard UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:38:13 -05:00
primal
75835d771d Add AT Protocol publishing, media support, and SQLite stability
Publishing:
- Add publisher.go for posting feed items to AT Protocol PDS
- Support deterministic rkeys from SHA256(guid + discoveredAt)
- Handle multiple URLs in posts with facets for each link
- Image embed support (app.bsky.embed.images) for up to 4 images
- External embed with thumbnail fallback
- Podcast/audio enclosure URLs included in post text

Media extraction:
- Parse RSS enclosures (audio, video, images)
- Extract Media RSS content and thumbnails
- Extract images from HTML content in descriptions
- Store enclosure and imageUrls in items table

SQLite stability improvements:
- Add synchronous=NORMAL and wal_autocheckpoint pragmas
- Connection pool tuning (idle conns, max lifetime)
- Periodic WAL checkpoint every 5 minutes
- Hourly integrity checks with PRAGMA quick_check
- Daily hot backup via VACUUM INTO
- Docker stop_grace_period: 30s for graceful shutdown

Dashboard:
- Feed publishing UI and API endpoints
- Account creation with invite codes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 15:30:02 -05:00
primal
143807378f Add Docker support and refactor data layer 2026-01-26 16:02:05 -05:00
primal
219b49352e Add PebbleDB storage, domain tracking, and web dashboard
- Split main.go into separate files for better organization:
  crawler.go, domain.go, feed.go, parser.go, html.go, util.go
- Add PebbleDB for persistent storage of feeds and domains
- Store feeds with metadata: title, TTL, update frequency, ETag, etc.
- Track domains with crawl status (uncrawled/crawled/error)
- Normalize URLs by stripping scheme and www. prefix
- Add web dashboard on port 4321 with real-time stats:
  - Crawl progress with completion percentage
  - Feed counts by type (RSS/Atom)
  - Top TLDs and domains by feed count
  - Recent feeds table
- Filter out comment feeds from results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 16:29:00 -05:00