16 Commits

Author SHA1 Message Date
primal
f2bb1e72d2 Split domain processing into separate check and crawl loops
- StartDomainCheckLoop: DNS verification for unchecked domains (1000 workers)
- StartFeedCrawlLoop: Feed discovery on DNS-verified domains (100 workers)

This fixes starvation where 104M unchecked domains blocked 1.2M
DNS-verified domains from ever being crawled for feeds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 20:35:46 -05:00
primal
8a9001c02c Restore working codebase with all methods
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:53 -05:00
primal
be595cb403 v100 2026-01-30 22:35:08 -05:00
primal
522233c4a2 Tune concurrency settings: 100 workers, 100 batch size, 100 buffer
- Import batch size: 100 (was 100k)
- Domain check/crawl fetch size: 100 (was 1000)
- Feed check fetch size: 100 (was 1000)
- Worker count: 100 fixed (was NumCPU)
- Channel buffers: 100 (was 256)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 23:33:57 -05:00
primal
516848e529 Revise domain status flow: skip uses takedown, add drop for permanent deletion
- Import default changed from 'hold' to 'pass' (auto-crawl)
- Skip now uses PDS takedown (hides posts but preserves data)
- Added 'drop' status for permanent deletion (requires skip first)
- Added TakedownAccount/RestoreAccount PDS functions
- Un-skip restores PDS accounts and reactivates feeds
- Dashboard shows 'drop' button only for skipped domains

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 23:18:17 -05:00
primal
edce82f1af Skip bare TLDs during domain import
Domains without a dot (e.g., "com", "net", "org") are bare TLDs
from Common Crawl data and should not be processed as domains.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 22:28:09 -05:00
primal
1066f42189 Refactor large Go files into focused modules
Split dashboard.go (3,528 lines) into:
- routes.go: HTTP route registration
- api_domains.go: Domain API handlers
- api_feeds.go: Feed API handlers
- api_publish.go: Publishing API handlers
- api_search.go: Search API handlers
- templates.go: HTML templates
- dashboard.go: Stats functions only (235 lines)

Split publisher.go (1,502 lines) into:
- pds_auth.go: Authentication and account management
- pds_records.go: Record operations (upload, update, delete)
- handle.go: Handle derivation from feed URLs
- image.go: Image processing and favicon fetching
- publisher.go: Core types and PublishItem (439 lines)

Split feed.go (1,137 lines) into:
- item.go: Item struct and DB operations
- feed_check.go: Feed checking and processing
- feed.go: Feed struct and DB operations (565 lines)

Also includes domain import batch size increase (1k -> 100k).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 22:25:02 -05:00
primal
3999e96f26 Dashboard UI overhaul: inline feed details, TLD filtering, status improvements
- Feed details now expand inline instead of navigating to new page
- Add TLD section headers with domains sorted by TLD then name
- Add TLD filter button to show/hide domain sections by TLD
- Feed status behavior: pass creates account, hold crawls only, skip stops, drop cleans up
- Auto-follow new accounts from directory account (1440.news)
- Fix handle derivation (removed duplicate .1440.news suffix)
- Increase domain import batch size to 100k
- Various bug fixes for account creation and profile updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 20:51:05 -05:00
primal
a5af4e14c3 Simplify: auto-deny all domains starting with digits
Replaced complex checks (digit-dash, all-digit hostname) with simple rule:
deny any domain starting with a digit, except 1440.news.

Database: denied 2,870,090 domains and 24,885 feeds.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:32:42 -05:00
primal
2386d551fc Auto-deny all-digit domains, whitelist 1440.news
- Deny domains where hostname is all digits (e.g., 0000114.com)
- Never auto-deny 1440.news or subdomains
- Auto-pass feeds from 1440.news sources
- Updated 554,085 domains and 3,213 feeds in database

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 13:27:48 -05:00
primal
f780c493c2 Enable all TLDs for import and auto-deny spam domains
- Remove .com-only filter from vertices import
- Add shouldAutoDenyDomain() to detect spam patterns (^[0-9]-)
- Apply auto-deny to all domain insert paths (save, saveTx, bulk import)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 12:49:52 -05:00
primal
959abf06c0 Enable .com domain import from vertices.txt.gz
Filter imported domains to only .com TLD for now.
Re-enabled the import loop that was disabled for testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 21:59:14 -05:00
primal
f4afb29980 Migrate from SQLite to PostgreSQL
- Replace modernc.org/sqlite with jackc/pgx/v5
- Update all SQL queries for PostgreSQL syntax ($1, $2 placeholders)
- Use snake_case column names throughout
- Replace SQLite FTS5 with PostgreSQL tsvector/tsquery full-text search
- Add connection pooling with pgxpool
- Support Docker secrets for database password
- Add trigger to normalize feed URLs (strip https://, http://, www.)
- Fix anchor feed detection regex to avoid false positives
- Connect app container to atproto network for PostgreSQL access
- Add version indicator to dashboard UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:38:13 -05:00
primal
75835d771d Add AT Protocol publishing, media support, and SQLite stability
Publishing:
- Add publisher.go for posting feed items to AT Protocol PDS
- Support deterministic rkeys from SHA256(guid + discoveredAt)
- Handle multiple URLs in posts with facets for each link
- Image embed support (app.bsky.embed.images) for up to 4 images
- External embed with thumbnail fallback
- Podcast/audio enclosure URLs included in post text

Media extraction:
- Parse RSS enclosures (audio, video, images)
- Extract Media RSS content and thumbnails
- Extract images from HTML content in descriptions
- Store enclosure and imageUrls in items table

SQLite stability improvements:
- Add synchronous=NORMAL and wal_autocheckpoint pragmas
- Connection pool tuning (idle conns, max lifetime)
- Periodic WAL checkpoint every 5 minutes
- Hourly integrity checks with PRAGMA quick_check
- Daily hot backup via VACUUM INTO
- Docker stop_grace_period: 30s for graceful shutdown

Dashboard:
- Feed publishing UI and API endpoints
- Account creation with invite codes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 15:30:02 -05:00
primal
143807378f Add Docker support and refactor data layer 2026-01-26 16:02:05 -05:00
primal
219b49352e Add PebbleDB storage, domain tracking, and web dashboard
- Split main.go into separate files for better organization:
  crawler.go, domain.go, feed.go, parser.go, html.go, util.go
- Add PebbleDB for persistent storage of feeds and domains
- Store feeds with metadata: title, TTL, update frequency, ETag, etc.
- Track domains with crawl status (uncrawled/crawled/error)
- Normalize URLs by stripping scheme and www. prefix
- Add web dashboard on port 4321 with real-time stats:
  - Crawl progress with completion percentage
  - Feed counts by type (RSS/Atom)
  - Top TLDs and domains by feed count
  - Recent feeds table
- Filter out comment feeds from results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 16:29:00 -05:00