Update CLAUDE.md to reflect current multi-file architecture
The codebase evolved from a single-file app to a multi-file structure with SQLite persistence, dashboard, and concurrent processing loops. Updated documentation to accurately describe current architecture. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -2,50 +2,75 @@
|
|||||||
|
|
||||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
## Project Overview
|
## Build & Run
|
||||||
|
|
||||||
1440.news is a Go-based web feed crawler that discovers and catalogs RSS and Atom feeds from websites. It processes hosts from Common Crawl data (vertices.txt.gz) and outputs discovered feeds organized by TLD into `.feed` files.
|
|
||||||
|
|
||||||
## Build & Run Commands
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build
|
go build -o 1440.news . # Build
|
||||||
go build -o 1440.news main.go
|
./1440.news # Run (starts dashboard at http://localhost:4321)
|
||||||
|
go fmt ./... # Format
|
||||||
# Run (requires vertices.txt.gz in the working directory)
|
go vet ./... # Static analysis
|
||||||
./1440.news
|
|
||||||
|
|
||||||
# Format code
|
|
||||||
go fmt ./...
|
|
||||||
|
|
||||||
# Static analysis
|
|
||||||
go vet ./...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Requires `vertices.txt.gz` (Common Crawl domain list) in the working directory.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
**Single-file application** (`main.go`, ~656 lines) with these key components:
|
Multi-file Go application that crawls websites for RSS/Atom feeds, stores them in SQLite, and provides a web dashboard.
|
||||||
|
|
||||||
- `Crawler` struct - Core engine managing HTTP client, concurrency, and state
|
### Concurrent Loops (main.go)
|
||||||
- `Feed` struct - Simple URL + Type (rss/atom) structure
|
|
||||||
- RSS/Atom parsing structs for XML deserialization
|
|
||||||
|
|
||||||
**Concurrency model:**
|
The application runs five independent goroutine loops:
|
||||||
- Worker pool pattern with `runtime.NumCPU() - 1` goroutines
|
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in 10k batches
|
||||||
- `sync.Map` for thread-safe global URL deduplication
|
- **Crawl loop** - Worker pool processes unchecked domains, discovers feeds
|
||||||
- `sync.Mutex` for feed collection and TLD file operations
|
- **Check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
|
||||||
|
- **Stats loop** - Updates cached dashboard statistics every minute
|
||||||
|
- **Cleanup loop** - Removes items older than 12 months (weekly)
|
||||||
|
|
||||||
**Key functions:**
|
### File Structure
|
||||||
- `CrawlHosts()` - Main entry point, coordinates worker pool
|
|
||||||
- `crawlHost()` - Processes a single host (tries HTTPS then HTTP)
|
|
||||||
- `crawlPage()` - Recursive page crawler with depth/page limits
|
|
||||||
- `extractFeedLinks()` - Finds `<link rel="alternate">` feed references
|
|
||||||
- `extractAnchorFeeds()` - Finds anchor tags with rss/atom/feed in href
|
|
||||||
|
|
||||||
**Configuration (hardcoded in `NewCrawler()`):**
|
| File | Purpose |
|
||||||
- MaxDepth: 10, MaxPagesPerHost: 10, Timeout: 10s
|
|------|---------|
|
||||||
- UserAgent: "FeedCrawler/1.0"
|
| `crawler.go` | Crawler struct, worker pools, page fetching, recursive crawl logic |
|
||||||
- Max redirects: 10
|
| `domain.go` | Domain struct, DB operations, vertices file import |
|
||||||
|
| `feed.go` | Feed/Item structs, DB operations, feed checking with HTTP caching |
|
||||||
|
| `parser.go` | RSS/Atom XML parsing, date parsing, next-crawl calculation |
|
||||||
|
| `html.go` | HTML parsing: feed link extraction, anchor feed detection |
|
||||||
|
| `util.go` | URL normalization, host utilities, TLD extraction |
|
||||||
|
| `db.go` | SQLite schema (domains, feeds, items tables with FTS5) |
|
||||||
|
| `dashboard.go` | HTTP server, JSON APIs, HTML template |
|
||||||
|
|
||||||
**Input:** Common Crawl vertices file (gzipped TSV with reverse domain notation)
|
### Database Schema
|
||||||
**Output:** TLD-specific `.feed` files (e.g., `com.feed`, `org.feed`) containing sorted, deduplicated feed URLs
|
|
||||||
|
SQLite with WAL mode at `feeds/feeds.db`:
|
||||||
|
- **domains** - Hosts to crawl (status: unchecked/checked/error)
|
||||||
|
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers
|
||||||
|
- **items** - Individual feed entries (guid + feedUrl unique)
|
||||||
|
- **feeds_fts / items_fts** - FTS5 virtual tables for search
|
||||||
|
|
||||||
|
### Crawl Logic
|
||||||
|
|
||||||
|
1. Domain picked from `unchecked` status (random order)
|
||||||
|
2. Try HTTPS, fall back to HTTP
|
||||||
|
3. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
|
||||||
|
4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
||||||
|
5. Parse discovered feeds for metadata, save with nextCrawlAt
|
||||||
|
|
||||||
|
### Feed Checking
|
||||||
|
|
||||||
|
Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 100s per consecutive no-change. Respects RSS `<ttl>` and Syndication namespace hints.
|
||||||
|
|
||||||
|
## AT Protocol Integration (Planned)
|
||||||
|
|
||||||
|
Domain: 1440.news
|
||||||
|
|
||||||
|
User structure:
|
||||||
|
- `wehrv.1440.news` - Owner/admin account
|
||||||
|
- `{domain}.1440.news` - Catch-all feed per source (e.g., `wsj.com.1440.news`)
|
||||||
|
- `{category}.{domain}.1440.news` - Category-specific feeds (future)
|
||||||
|
|
||||||
|
Phases:
|
||||||
|
1. Local PDS setup
|
||||||
|
2. Account management
|
||||||
|
3. Auto-create domain users
|
||||||
|
4. Post articles to accounts
|
||||||
|
5. Category detection
|
||||||
|
|||||||
Reference in New Issue
Block a user