v100
This commit is contained in:
@@ -47,10 +47,9 @@ Multi-file Go application that crawls websites for RSS/Atom feeds, stores them i
|
||||
|
||||
### Concurrent Loops (main.go)
|
||||
|
||||
The application runs seven independent goroutine loops:
|
||||
The application runs six independent goroutine loops:
|
||||
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in batches of 100 (status='pass')
|
||||
- **Domain check loop** - HEAD requests to verify approved domains are reachable
|
||||
- **Crawl loop** - Worker pool crawls verified domains for feed discovery
|
||||
- **Crawl loop** - Worker pool crawls approved domains for feed discovery
|
||||
- **Feed check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
|
||||
- **Stats loop** - Updates cached dashboard statistics every minute
|
||||
- **Cleanup loop** - Removes items older than 12 months (weekly)
|
||||
@@ -78,7 +77,7 @@ The application runs seven independent goroutine loops:
|
||||
### Database Schema
|
||||
|
||||
PostgreSQL with pgx driver, using connection pooling:
|
||||
- **domains** - Hosts to crawl (status: hold/pass/skip/fail)
|
||||
- **domains** - Hosts to crawl (status: hold/pass/skip)
|
||||
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
|
||||
- **items** - Individual feed entries (guid + feed_url unique)
|
||||
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)
|
||||
@@ -88,11 +87,10 @@ Column naming: snake_case (e.g., `source_host`, `pub_date`, `item_count`)
|
||||
### Crawl Logic
|
||||
|
||||
1. Domains import as `pass` by default (auto-crawled)
|
||||
2. Check stage: HEAD request verifies domain is reachable, sets last_checked_at
|
||||
3. Crawl stage: Full recursive crawl (HTTPS, fallback HTTP)
|
||||
4. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
|
||||
5. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
||||
6. Parse discovered feeds for metadata, save with next_crawl_at
|
||||
2. Crawl loop picks up domains where `last_crawled_at IS NULL`
|
||||
3. Full recursive crawl (HTTPS, fallback HTTP) up to MaxDepth=10, MaxPagesPerHost=10
|
||||
4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
||||
5. Parse discovered feeds for metadata, save with next_crawl_at
|
||||
|
||||
### Feed Checking
|
||||
|
||||
@@ -103,17 +101,15 @@ Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 1
|
||||
Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
|
||||
Status values: `hold` (default/pending review), `pass` (approved), `skip` (rejected).
|
||||
|
||||
### Domain Processing (Two-Stage)
|
||||
|
||||
1. **Check stage** - HEAD request to verify domain is reachable
|
||||
2. **Crawl stage** - Full recursive crawl for feed discovery
|
||||
### Domain Processing
|
||||
|
||||
Domain status values:
|
||||
- `pass` (default on import) - Domain is crawled and checked automatically
|
||||
- `hold` (manual) - Pauses crawling, keeps existing feeds and items
|
||||
- `skip` (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
|
||||
- `drop` (manual, via button) - Permanently **deletes** all feeds, items, and PDS accounts (requires skip first)
|
||||
- `fail` (automatic) - Set when check/crawl fails, keeps existing feeds and items
|
||||
|
||||
Note: Errors during check/crawl are recorded in `last_error` but do not change the domain status.
|
||||
|
||||
Skip vs Drop:
|
||||
- `skip` is reversible - use "un-skip" to restore accounts and resume publishing
|
||||
|
||||
Reference in New Issue
Block a user