v100
This commit is contained in:
@@ -47,10 +47,9 @@ Multi-file Go application that crawls websites for RSS/Atom feeds, stores them i
|
|||||||
|
|
||||||
### Concurrent Loops (main.go)
|
### Concurrent Loops (main.go)
|
||||||
|
|
||||||
The application runs seven independent goroutine loops:
|
The application runs six independent goroutine loops:
|
||||||
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in batches of 100 (status='pass')
|
- **Import loop** - Reads `vertices.txt.gz` and inserts domains into DB in batches of 100 (status='pass')
|
||||||
- **Domain check loop** - HEAD requests to verify approved domains are reachable
|
- **Crawl loop** - Worker pool crawls approved domains for feed discovery
|
||||||
- **Crawl loop** - Worker pool crawls verified domains for feed discovery
|
|
||||||
- **Feed check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
|
- **Feed check loop** - Worker pool re-checks known feeds for updates (conditional HTTP)
|
||||||
- **Stats loop** - Updates cached dashboard statistics every minute
|
- **Stats loop** - Updates cached dashboard statistics every minute
|
||||||
- **Cleanup loop** - Removes items older than 12 months (weekly)
|
- **Cleanup loop** - Removes items older than 12 months (weekly)
|
||||||
@@ -78,7 +77,7 @@ The application runs seven independent goroutine loops:
|
|||||||
### Database Schema
|
### Database Schema
|
||||||
|
|
||||||
PostgreSQL with pgx driver, using connection pooling:
|
PostgreSQL with pgx driver, using connection pooling:
|
||||||
- **domains** - Hosts to crawl (status: hold/pass/skip/fail)
|
- **domains** - Hosts to crawl (status: hold/pass/skip)
|
||||||
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
|
- **feeds** - Discovered RSS/Atom feeds with metadata and cache headers (publish_status: hold/pass/skip)
|
||||||
- **items** - Individual feed entries (guid + feed_url unique)
|
- **items** - Individual feed entries (guid + feed_url unique)
|
||||||
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)
|
- **search_vector** - GENERATED tsvector columns for full-text search (GIN indexed)
|
||||||
@@ -88,11 +87,10 @@ Column naming: snake_case (e.g., `source_host`, `pub_date`, `item_count`)
|
|||||||
### Crawl Logic
|
### Crawl Logic
|
||||||
|
|
||||||
1. Domains import as `pass` by default (auto-crawled)
|
1. Domains import as `pass` by default (auto-crawled)
|
||||||
2. Check stage: HEAD request verifies domain is reachable, sets last_checked_at
|
2. Crawl loop picks up domains where `last_crawled_at IS NULL`
|
||||||
3. Crawl stage: Full recursive crawl (HTTPS, fallback HTTP)
|
3. Full recursive crawl (HTTPS, fallback HTTP) up to MaxDepth=10, MaxPagesPerHost=10
|
||||||
4. Recursive crawl up to MaxDepth=10, MaxPagesPerHost=10
|
4. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
||||||
5. Extract `<link rel="alternate">` and anchor hrefs containing rss/atom/feed
|
5. Parse discovered feeds for metadata, save with next_crawl_at
|
||||||
6. Parse discovered feeds for metadata, save with next_crawl_at
|
|
||||||
|
|
||||||
### Feed Checking
|
### Feed Checking
|
||||||
|
|
||||||
@@ -103,17 +101,15 @@ Uses conditional HTTP (ETag, If-Modified-Since). Adaptive backoff: base 100s + 1
|
|||||||
Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
|
Feeds with `publish_status = 'pass'` have their items automatically posted to AT Protocol.
|
||||||
Status values: `hold` (default/pending review), `pass` (approved), `skip` (rejected).
|
Status values: `hold` (default/pending review), `pass` (approved), `skip` (rejected).
|
||||||
|
|
||||||
### Domain Processing (Two-Stage)
|
### Domain Processing
|
||||||
|
|
||||||
1. **Check stage** - HEAD request to verify domain is reachable
|
|
||||||
2. **Crawl stage** - Full recursive crawl for feed discovery
|
|
||||||
|
|
||||||
Domain status values:
|
Domain status values:
|
||||||
- `pass` (default on import) - Domain is crawled and checked automatically
|
- `pass` (default on import) - Domain is crawled and checked automatically
|
||||||
- `hold` (manual) - Pauses crawling, keeps existing feeds and items
|
- `hold` (manual) - Pauses crawling, keeps existing feeds and items
|
||||||
- `skip` (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
|
- `skip` (manual) - Takes down PDS accounts (hides posts), marks feeds inactive, preserves all data
|
||||||
- `drop` (manual, via button) - Permanently **deletes** all feeds, items, and PDS accounts (requires skip first)
|
- `drop` (manual, via button) - Permanently **deletes** all feeds, items, and PDS accounts (requires skip first)
|
||||||
- `fail` (automatic) - Set when check/crawl fails, keeps existing feeds and items
|
|
||||||
|
Note: Errors during check/crawl are recorded in `last_error` but do not change the domain status.
|
||||||
|
|
||||||
Skip vs Drop:
|
Skip vs Drop:
|
||||||
- `skip` is reversible - use "un-skip" to restore accounts and resume publishing
|
- `skip` is reversible - use "un-skip" to restore accounts and resume publishing
|
||||||
|
|||||||
+27
-15
@@ -50,6 +50,7 @@ func (c *Crawler) handleAPIDomains(w http.ResponseWriter, r *http.Request) {
|
|||||||
status := r.URL.Query().Get("status")
|
status := r.URL.Query().Get("status")
|
||||||
hasFeeds := r.URL.Query().Get("has_feeds") == "true"
|
hasFeeds := r.URL.Query().Get("has_feeds") == "true"
|
||||||
search := r.URL.Query().Get("search")
|
search := r.URL.Query().Get("search")
|
||||||
|
tldFilter := r.URL.Query().Get("tld")
|
||||||
limit := 100
|
limit := 100
|
||||||
offset := 0
|
offset := 0
|
||||||
if l := r.URL.Query().Get("limit"); l != "" {
|
if l := r.URL.Query().Get("limit"); l != "" {
|
||||||
@@ -68,7 +69,22 @@ func (c *Crawler) handleAPIDomains(w http.ResponseWriter, r *http.Request) {
|
|||||||
if hasFeeds {
|
if hasFeeds {
|
||||||
// Only domains with feeds
|
// Only domains with feeds
|
||||||
searchPattern := "%" + strings.ToLower(search) + "%"
|
searchPattern := "%" + strings.ToLower(search) + "%"
|
||||||
if search != "" {
|
if tldFilter != "" {
|
||||||
|
// Filter by specific TLD
|
||||||
|
rows, err = c.db.Query(`
|
||||||
|
SELECT d.host, d.tld, d.status, d.last_error, f.feed_count
|
||||||
|
FROM domains d
|
||||||
|
INNER JOIN (
|
||||||
|
SELECT source_host, COUNT(*) as feed_count
|
||||||
|
FROM feeds
|
||||||
|
WHERE item_count > 0
|
||||||
|
GROUP BY source_host
|
||||||
|
) f ON d.host = f.source_host
|
||||||
|
WHERE d.status != 'skip' AND d.tld = $1
|
||||||
|
ORDER BY d.host ASC
|
||||||
|
LIMIT $2 OFFSET $3
|
||||||
|
`, tldFilter, limit, offset)
|
||||||
|
} else if search != "" {
|
||||||
// Search in domain host or feed title/url
|
// Search in domain host or feed title/url
|
||||||
rows, err = c.db.Query(`
|
rows, err = c.db.Query(`
|
||||||
SELECT DISTINCT d.host, d.tld, d.status, d.last_error, f.feed_count
|
SELECT DISTINCT d.host, d.tld, d.status, d.last_error, f.feed_count
|
||||||
@@ -306,7 +322,7 @@ func (c *Crawler) handleAPIDomainFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
rows, err := c.db.Query(`
|
rows, err := c.db.Query(`
|
||||||
SELECT url, title, type, status, error_count, last_error, item_count, publish_status, language
|
SELECT url, title, type, status, last_error, item_count, publish_status, language
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE source_host = $1
|
WHERE source_host = $1
|
||||||
ORDER BY url ASC
|
ORDER BY url ASC
|
||||||
@@ -323,7 +339,6 @@ func (c *Crawler) handleAPIDomainFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
Title string `json:"title"`
|
Title string `json:"title"`
|
||||||
Type string `json:"type"`
|
Type string `json:"type"`
|
||||||
Status string `json:"status,omitempty"`
|
Status string `json:"status,omitempty"`
|
||||||
ErrorCount int `json:"error_count,omitempty"`
|
|
||||||
LastError string `json:"last_error,omitempty"`
|
LastError string `json:"last_error,omitempty"`
|
||||||
ItemCount int `json:"item_count,omitempty"`
|
ItemCount int `json:"item_count,omitempty"`
|
||||||
PublishStatus string `json:"publish_status,omitempty"`
|
PublishStatus string `json:"publish_status,omitempty"`
|
||||||
@@ -334,8 +349,8 @@ func (c *Crawler) handleAPIDomainFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
for rows.Next() {
|
for rows.Next() {
|
||||||
var f FeedInfo
|
var f FeedInfo
|
||||||
var title, status, lastError, publishStatus, language *string
|
var title, status, lastError, publishStatus, language *string
|
||||||
var errorCount, itemCount *int
|
var itemCount *int
|
||||||
if err := rows.Scan(&f.URL, &title, &f.Type, &status, &errorCount, &lastError, &itemCount, &publishStatus, &language); err != nil {
|
if err := rows.Scan(&f.URL, &title, &f.Type, &status, &lastError, &itemCount, &publishStatus, &language); err != nil {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
f.Title = StringValue(title)
|
f.Title = StringValue(title)
|
||||||
@@ -343,9 +358,6 @@ func (c *Crawler) handleAPIDomainFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
f.LastError = StringValue(lastError)
|
f.LastError = StringValue(lastError)
|
||||||
f.PublishStatus = StringValue(publishStatus)
|
f.PublishStatus = StringValue(publishStatus)
|
||||||
f.Language = StringValue(language)
|
f.Language = StringValue(language)
|
||||||
if errorCount != nil {
|
|
||||||
f.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
f.ItemCount = *itemCount
|
f.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
@@ -357,7 +369,7 @@ func (c *Crawler) handleAPIDomainFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// handleAPISetDomainStatus sets the status for a domain
|
// handleAPISetDomainStatus sets the status for a domain
|
||||||
// status must be 'hold', 'pass', 'skip', or 'fail' (use /api/dropDomain for 'drop')
|
// status must be 'hold', 'pass', or 'skip' (use /api/dropDomain for 'drop')
|
||||||
func (c *Crawler) handleAPISetDomainStatus(w http.ResponseWriter, r *http.Request) {
|
func (c *Crawler) handleAPISetDomainStatus(w http.ResponseWriter, r *http.Request) {
|
||||||
host := r.URL.Query().Get("host")
|
host := r.URL.Query().Get("host")
|
||||||
status := r.URL.Query().Get("status")
|
status := r.URL.Query().Get("status")
|
||||||
@@ -366,8 +378,8 @@ func (c *Crawler) handleAPISetDomainStatus(w http.ResponseWriter, r *http.Reques
|
|||||||
http.Error(w, "host parameter required", http.StatusBadRequest)
|
http.Error(w, "host parameter required", http.StatusBadRequest)
|
||||||
return
|
return
|
||||||
}
|
}
|
||||||
if status != "hold" && status != "pass" && status != "skip" && status != "fail" {
|
if status != "hold" && status != "pass" && status != "skip" {
|
||||||
http.Error(w, "status must be 'hold', 'pass', 'skip', or 'fail' (use /api/dropDomain for permanent deletion)", http.StatusBadRequest)
|
http.Error(w, "status must be 'hold', 'pass', or 'skip' (use /api/dropDomain for permanent deletion)", http.StatusBadRequest)
|
||||||
return
|
return
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -839,9 +851,9 @@ func (c *Crawler) skipDomain(host string) DomainActionResult {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Mark feeds as inactive (but don't delete)
|
// Mark feeds as skipped (but don't delete)
|
||||||
feedsAffected, err := c.db.Exec(`
|
feedsAffected, err := c.db.Exec(`
|
||||||
UPDATE feeds SET status = 'inactive', publish_status = 'skip'
|
UPDATE feeds SET status = 'skip', publish_status = 'skip'
|
||||||
WHERE source_host = $1
|
WHERE source_host = $1
|
||||||
`, host)
|
`, host)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -999,9 +1011,9 @@ func (c *Crawler) restoreDomain(host string) DomainActionResult {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Restore feeds to active status
|
// Restore feeds to pass status
|
||||||
feedsAffected, err := c.db.Exec(`
|
feedsAffected, err := c.db.Exec(`
|
||||||
UPDATE feeds SET status = 'active', publish_status = 'pass'
|
UPDATE feeds SET status = 'pass', publish_status = 'pass'
|
||||||
WHERE source_host = $1
|
WHERE source_host = $1
|
||||||
`, host)
|
`, host)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
|
|||||||
+34
-67
@@ -18,56 +18,48 @@ func (c *Crawler) handleAPIFeedInfo(w http.ResponseWriter, r *http.Request) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
type FeedDetails struct {
|
type FeedDetails struct {
|
||||||
URL string `json:"url"`
|
URL string `json:"url"`
|
||||||
Type string `json:"type,omitempty"`
|
Type string `json:"type,omitempty"`
|
||||||
Category string `json:"category,omitempty"`
|
Category string `json:"category,omitempty"`
|
||||||
Title string `json:"title,omitempty"`
|
Title string `json:"title,omitempty"`
|
||||||
Description string `json:"description,omitempty"`
|
Description string `json:"description,omitempty"`
|
||||||
Language string `json:"language,omitempty"`
|
Language string `json:"language,omitempty"`
|
||||||
SiteURL string `json:"siteUrl,omitempty"`
|
SiteURL string `json:"siteUrl,omitempty"`
|
||||||
DiscoveredAt string `json:"discoveredAt,omitempty"`
|
DiscoveredAt string `json:"discoveredAt,omitempty"`
|
||||||
LastCrawledAt string `json:"lastCrawledAt,omitempty"`
|
LastCrawledAt string `json:"lastCrawledAt,omitempty"`
|
||||||
NextCrawlAt string `json:"nextCrawlAt,omitempty"`
|
NextCrawlAt string `json:"nextCrawlAt,omitempty"`
|
||||||
LastBuildDate string `json:"lastBuildDate,omitempty"`
|
LastBuildDate string `json:"lastBuildDate,omitempty"`
|
||||||
TTLMinutes int `json:"ttlMinutes,omitempty"`
|
Status string `json:"status,omitempty"`
|
||||||
UpdatePeriod string `json:"updatePeriod,omitempty"`
|
LastError string `json:"lastError,omitempty"`
|
||||||
UpdateFreq int `json:"updateFreq,omitempty"`
|
ItemCount int `json:"itemCount,omitempty"`
|
||||||
Status string `json:"status,omitempty"`
|
OldestItemDate string `json:"oldestItemDate,omitempty"`
|
||||||
ErrorCount int `json:"errorCount,omitempty"`
|
NewestItemDate string `json:"newestItemDate,omitempty"`
|
||||||
LastError string `json:"lastError,omitempty"`
|
PublishStatus string `json:"publishStatus,omitempty"`
|
||||||
ItemCount int `json:"itemCount,omitempty"`
|
PublishAccount string `json:"publishAccount,omitempty"`
|
||||||
AvgPostFreqHrs float64 `json:"avgPostFreqHrs,omitempty"`
|
|
||||||
OldestItemDate string `json:"oldestItemDate,omitempty"`
|
|
||||||
NewestItemDate string `json:"newestItemDate,omitempty"`
|
|
||||||
PublishStatus string `json:"publishStatus,omitempty"`
|
|
||||||
PublishAccount string `json:"publishAccount,omitempty"`
|
|
||||||
}
|
}
|
||||||
|
|
||||||
var f FeedDetails
|
var f FeedDetails
|
||||||
var category, title, description, language, siteUrl *string
|
var category, title, description, language, siteUrl *string
|
||||||
var lastCrawledAt, nextCrawlAt, lastBuildDate *time.Time
|
var lastCrawledAt, nextCrawlAt, lastBuildDate *time.Time
|
||||||
var updatePeriod, status, lastError *string
|
var status, lastError *string
|
||||||
var oldestItemDate, newestItemDate *time.Time
|
var oldestItemDate, newestItemDate *time.Time
|
||||||
var ttlMinutes, updateFreq, errorCount, itemCount *int
|
var itemCount *int
|
||||||
var avgPostFreqHrs *float64
|
|
||||||
var discoveredAt time.Time
|
var discoveredAt time.Time
|
||||||
var publishStatus, publishAccount *string
|
var publishStatus, publishAccount *string
|
||||||
|
|
||||||
err := c.db.QueryRow(`
|
err := c.db.QueryRow(`
|
||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error,
|
||||||
status, error_count, last_error,
|
|
||||||
(SELECT COUNT(*) FROM items WHERE feed_url = feeds.url) as item_count,
|
(SELECT COUNT(*) FROM items WHERE feed_url = feeds.url) as item_count,
|
||||||
avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
oldest_item_date, newest_item_date,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds WHERE url = $1
|
FROM feeds WHERE url = $1
|
||||||
`, feedURL).Scan(
|
`, feedURL).Scan(
|
||||||
&f.URL, &f.Type, &category, &title, &description, &language, &siteUrl,
|
&f.URL, &f.Type, &category, &title, &description, &language, &siteUrl,
|
||||||
&discoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
&discoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
||||||
&ttlMinutes, &updatePeriod, &updateFreq,
|
&status, &lastError,
|
||||||
&status, &errorCount, &lastError,
|
&itemCount, &oldestItemDate, &newestItemDate,
|
||||||
&itemCount, &avgPostFreqHrs, &oldestItemDate, &newestItemDate,
|
|
||||||
&publishStatus, &publishAccount,
|
&publishStatus, &publishAccount,
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -95,24 +87,11 @@ func (c *Crawler) handleAPIFeedInfo(w http.ResponseWriter, r *http.Request) {
|
|||||||
if lastBuildDate != nil {
|
if lastBuildDate != nil {
|
||||||
f.LastBuildDate = lastBuildDate.Format(time.RFC3339)
|
f.LastBuildDate = lastBuildDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
if ttlMinutes != nil {
|
|
||||||
f.TTLMinutes = *ttlMinutes
|
|
||||||
}
|
|
||||||
f.UpdatePeriod = StringValue(updatePeriod)
|
|
||||||
if updateFreq != nil {
|
|
||||||
f.UpdateFreq = *updateFreq
|
|
||||||
}
|
|
||||||
f.Status = StringValue(status)
|
f.Status = StringValue(status)
|
||||||
if errorCount != nil {
|
|
||||||
f.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
f.LastError = StringValue(lastError)
|
f.LastError = StringValue(lastError)
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
f.ItemCount = *itemCount
|
f.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
if avgPostFreqHrs != nil {
|
|
||||||
f.AvgPostFreqHrs = *avgPostFreqHrs
|
|
||||||
}
|
|
||||||
if oldestItemDate != nil {
|
if oldestItemDate != nil {
|
||||||
f.OldestItemDate = oldestItemDate.Format(time.RFC3339)
|
f.OldestItemDate = oldestItemDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
@@ -175,7 +154,7 @@ func (c *Crawler) handleAPIFeedsByStatus(w http.ResponseWriter, r *http.Request)
|
|||||||
}
|
}
|
||||||
|
|
||||||
rows, err := c.db.Query(`
|
rows, err := c.db.Query(`
|
||||||
SELECT url, title, type, source_host, tld, status, error_count, last_error, item_count
|
SELECT url, title, type, source_host, tld, status, last_error, item_count
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE status = $1
|
WHERE status = $1
|
||||||
ORDER BY url ASC
|
ORDER BY url ASC
|
||||||
@@ -194,7 +173,6 @@ func (c *Crawler) handleAPIFeedsByStatus(w http.ResponseWriter, r *http.Request)
|
|||||||
SourceHost string `json:"source_host"`
|
SourceHost string `json:"source_host"`
|
||||||
TLD string `json:"tld"`
|
TLD string `json:"tld"`
|
||||||
Status string `json:"status"`
|
Status string `json:"status"`
|
||||||
ErrorCount int `json:"error_count,omitempty"`
|
|
||||||
LastError string `json:"last_error,omitempty"`
|
LastError string `json:"last_error,omitempty"`
|
||||||
ItemCount int `json:"item_count,omitempty"`
|
ItemCount int `json:"item_count,omitempty"`
|
||||||
}
|
}
|
||||||
@@ -203,17 +181,14 @@ func (c *Crawler) handleAPIFeedsByStatus(w http.ResponseWriter, r *http.Request)
|
|||||||
for rows.Next() {
|
for rows.Next() {
|
||||||
var f FeedInfo
|
var f FeedInfo
|
||||||
var title, sourceHost, tld, lastError *string
|
var title, sourceHost, tld, lastError *string
|
||||||
var errorCount, itemCount *int
|
var itemCount *int
|
||||||
if err := rows.Scan(&f.URL, &title, &f.Type, &sourceHost, &tld, &f.Status, &errorCount, &lastError, &itemCount); err != nil {
|
if err := rows.Scan(&f.URL, &title, &f.Type, &sourceHost, &tld, &f.Status, &lastError, &itemCount); err != nil {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
f.Title = StringValue(title)
|
f.Title = StringValue(title)
|
||||||
f.SourceHost = StringValue(sourceHost)
|
f.SourceHost = StringValue(sourceHost)
|
||||||
f.TLD = StringValue(tld)
|
f.TLD = StringValue(tld)
|
||||||
f.LastError = StringValue(lastError)
|
f.LastError = StringValue(lastError)
|
||||||
if errorCount != nil {
|
|
||||||
f.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
f.ItemCount = *itemCount
|
f.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
@@ -243,7 +218,7 @@ func (c *Crawler) handleAPIFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
var err error
|
var err error
|
||||||
if publishStatus != "" {
|
if publishStatus != "" {
|
||||||
rows, err = c.db.Query(`
|
rows, err = c.db.Query(`
|
||||||
SELECT url, title, type, source_host, tld, status, error_count, last_error, item_count, publish_status, language
|
SELECT url, title, type, source_host, tld, status, last_error, item_count, publish_status, language
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE publish_status = $1
|
WHERE publish_status = $1
|
||||||
ORDER BY url ASC
|
ORDER BY url ASC
|
||||||
@@ -251,7 +226,7 @@ func (c *Crawler) handleAPIFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
`, publishStatus, limit, offset)
|
`, publishStatus, limit, offset)
|
||||||
} else {
|
} else {
|
||||||
rows, err = c.db.Query(`
|
rows, err = c.db.Query(`
|
||||||
SELECT url, title, type, source_host, tld, status, error_count, last_error, item_count, publish_status, language
|
SELECT url, title, type, source_host, tld, status, last_error, item_count, publish_status, language
|
||||||
FROM feeds
|
FROM feeds
|
||||||
ORDER BY url ASC
|
ORDER BY url ASC
|
||||||
LIMIT $1 OFFSET $2
|
LIMIT $1 OFFSET $2
|
||||||
@@ -270,7 +245,6 @@ func (c *Crawler) handleAPIFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
SourceHost string `json:"source_host"`
|
SourceHost string `json:"source_host"`
|
||||||
TLD string `json:"tld"`
|
TLD string `json:"tld"`
|
||||||
Status string `json:"status"`
|
Status string `json:"status"`
|
||||||
ErrorCount int `json:"error_count,omitempty"`
|
|
||||||
LastError string `json:"last_error,omitempty"`
|
LastError string `json:"last_error,omitempty"`
|
||||||
ItemCount int `json:"item_count,omitempty"`
|
ItemCount int `json:"item_count,omitempty"`
|
||||||
PublishStatus string `json:"publish_status,omitempty"`
|
PublishStatus string `json:"publish_status,omitempty"`
|
||||||
@@ -281,8 +255,8 @@ func (c *Crawler) handleAPIFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
for rows.Next() {
|
for rows.Next() {
|
||||||
var f FeedInfo
|
var f FeedInfo
|
||||||
var title, sourceHost, tld, lastError, publishStatus, language *string
|
var title, sourceHost, tld, lastError, publishStatus, language *string
|
||||||
var errorCount, itemCount *int
|
var itemCount *int
|
||||||
if err := rows.Scan(&f.URL, &title, &f.Type, &sourceHost, &tld, &f.Status, &errorCount, &lastError, &itemCount, &publishStatus, &language); err != nil {
|
if err := rows.Scan(&f.URL, &title, &f.Type, &sourceHost, &tld, &f.Status, &lastError, &itemCount, &publishStatus, &language); err != nil {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
f.Title = StringValue(title)
|
f.Title = StringValue(title)
|
||||||
@@ -291,9 +265,6 @@ func (c *Crawler) handleAPIFeeds(w http.ResponseWriter, r *http.Request) {
|
|||||||
f.LastError = StringValue(lastError)
|
f.LastError = StringValue(lastError)
|
||||||
f.PublishStatus = StringValue(publishStatus)
|
f.PublishStatus = StringValue(publishStatus)
|
||||||
f.Language = StringValue(language)
|
f.Language = StringValue(language)
|
||||||
if errorCount != nil {
|
|
||||||
f.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
f.ItemCount = *itemCount
|
f.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
@@ -308,7 +279,7 @@ func (c *Crawler) filterFeeds(w http.ResponseWriter, tld, domain, status string,
|
|||||||
var args []interface{}
|
var args []interface{}
|
||||||
argNum := 1
|
argNum := 1
|
||||||
query := `
|
query := `
|
||||||
SELECT url, title, type, category, source_host, tld, status, error_count, last_error, item_count, language
|
SELECT url, title, type, category, source_host, tld, status, last_error, item_count, language
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE 1=1`
|
WHERE 1=1`
|
||||||
|
|
||||||
@@ -360,7 +331,6 @@ func (c *Crawler) filterFeeds(w http.ResponseWriter, tld, domain, status string,
|
|||||||
SourceHost string `json:"source_host"`
|
SourceHost string `json:"source_host"`
|
||||||
TLD string `json:"tld"`
|
TLD string `json:"tld"`
|
||||||
Status string `json:"status"`
|
Status string `json:"status"`
|
||||||
ErrorCount int `json:"error_count,omitempty"`
|
|
||||||
LastError string `json:"last_error,omitempty"`
|
LastError string `json:"last_error,omitempty"`
|
||||||
ItemCount int `json:"item_count,omitempty"`
|
ItemCount int `json:"item_count,omitempty"`
|
||||||
Language string `json:"language,omitempty"`
|
Language string `json:"language,omitempty"`
|
||||||
@@ -370,8 +340,8 @@ func (c *Crawler) filterFeeds(w http.ResponseWriter, tld, domain, status string,
|
|||||||
for rows.Next() {
|
for rows.Next() {
|
||||||
var f FeedInfo
|
var f FeedInfo
|
||||||
var title, category, sourceHost, tldVal, lastError, language *string
|
var title, category, sourceHost, tldVal, lastError, language *string
|
||||||
var errorCount, itemCount *int
|
var itemCount *int
|
||||||
if err := rows.Scan(&f.URL, &title, &f.Type, &category, &sourceHost, &tldVal, &f.Status, &errorCount, &lastError, &itemCount, &language); err != nil {
|
if err := rows.Scan(&f.URL, &title, &f.Type, &category, &sourceHost, &tldVal, &f.Status, &lastError, &itemCount, &language); err != nil {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
f.Title = StringValue(title)
|
f.Title = StringValue(title)
|
||||||
@@ -383,9 +353,6 @@ func (c *Crawler) filterFeeds(w http.ResponseWriter, tld, domain, status string,
|
|||||||
f.SourceHost = StringValue(sourceHost)
|
f.SourceHost = StringValue(sourceHost)
|
||||||
f.TLD = StringValue(tldVal)
|
f.TLD = StringValue(tldVal)
|
||||||
f.LastError = StringValue(lastError)
|
f.LastError = StringValue(lastError)
|
||||||
if errorCount != nil {
|
|
||||||
f.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
f.ItemCount = *itemCount
|
f.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
|
|||||||
+35
-73
@@ -16,32 +16,27 @@ type SearchResult struct {
|
|||||||
}
|
}
|
||||||
|
|
||||||
type SearchFeed struct {
|
type SearchFeed struct {
|
||||||
URL string `json:"url"`
|
URL string `json:"url"`
|
||||||
Type string `json:"type"`
|
Type string `json:"type"`
|
||||||
Category string `json:"category"`
|
Category string `json:"category"`
|
||||||
Title string `json:"title"`
|
Title string `json:"title"`
|
||||||
Description string `json:"description"`
|
Description string `json:"description"`
|
||||||
Language string `json:"language"`
|
Language string `json:"language"`
|
||||||
SiteURL string `json:"site_url"`
|
SiteURL string `json:"site_url"`
|
||||||
DiscoveredAt string `json:"discovered_at"`
|
DiscoveredAt string `json:"discovered_at"`
|
||||||
LastCrawledAt string `json:"last_crawled_at"`
|
LastCrawledAt string `json:"last_crawled_at"`
|
||||||
NextCrawlAt string `json:"next_crawl_at"`
|
NextCrawlAt string `json:"next_crawl_at"`
|
||||||
LastBuildDate string `json:"last_build_date"`
|
LastBuildDate string `json:"last_build_date"`
|
||||||
TTLMinutes int `json:"ttl_minutes"`
|
Status string `json:"status"`
|
||||||
UpdatePeriod string `json:"update_period"`
|
LastError string `json:"last_error"`
|
||||||
UpdateFreq int `json:"update_freq"`
|
LastErrorAt string `json:"last_error_at"`
|
||||||
Status string `json:"status"`
|
SourceURL string `json:"source_url"`
|
||||||
ErrorCount int `json:"error_count"`
|
SourceHost string `json:"source_host"`
|
||||||
LastError string `json:"last_error"`
|
TLD string `json:"tld"`
|
||||||
LastErrorAt string `json:"last_error_at"`
|
ItemCount int `json:"item_count"`
|
||||||
SourceURL string `json:"source_url"`
|
OldestItemDate string `json:"oldest_item_date"`
|
||||||
SourceHost string `json:"source_host"`
|
NewestItemDate string `json:"newest_item_date"`
|
||||||
TLD string `json:"tld"`
|
NoUpdate bool `json:"no_update"`
|
||||||
ItemCount int `json:"item_count"`
|
|
||||||
AvgPostFreqHrs float64 `json:"avg_post_freq_hrs"`
|
|
||||||
OldestItemDate string `json:"oldest_item_date"`
|
|
||||||
NewestItemDate string `json:"newest_item_date"`
|
|
||||||
NoUpdate bool `json:"no_update"`
|
|
||||||
}
|
}
|
||||||
|
|
||||||
type SearchItem struct {
|
type SearchItem struct {
|
||||||
@@ -82,20 +77,18 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
var feedType, category, title, description, language, siteUrl *string
|
var feedType, category, title, description, language, siteUrl *string
|
||||||
var discoveredAt time.Time
|
var discoveredAt time.Time
|
||||||
var lastCrawledAt, nextCrawlAt, lastBuildDate *time.Time
|
var lastCrawledAt, nextCrawlAt, lastBuildDate *time.Time
|
||||||
var ttlMinutes, updateFreq, errorCount, itemCount *int
|
var itemCount *int
|
||||||
var updatePeriod, status, lastError *string
|
var status, lastError *string
|
||||||
var lastErrorAt *time.Time
|
var lastErrorAt *time.Time
|
||||||
var sourceUrl, sourceHost, tld *string
|
var sourceUrl, sourceHost, tld *string
|
||||||
var avgPostFreqHrs *float64
|
|
||||||
var oldestItemDate, newestItemDate *time.Time
|
var oldestItemDate, newestItemDate *time.Time
|
||||||
var noUpdate *bool
|
var noUpdate *bool
|
||||||
|
|
||||||
if err := rows.Scan(&url, &feedType, &category, &title, &description, &language, &siteUrl,
|
if err := rows.Scan(&url, &feedType, &category, &title, &description, &language, &siteUrl,
|
||||||
&discoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
&discoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
||||||
&ttlMinutes, &updatePeriod, &updateFreq,
|
&status, &lastError, &lastErrorAt,
|
||||||
&status, &errorCount, &lastError, &lastErrorAt,
|
|
||||||
&sourceUrl, &sourceHost, &tld,
|
&sourceUrl, &sourceHost, &tld,
|
||||||
&itemCount, &avgPostFreqHrs, &oldestItemDate, &newestItemDate, &noUpdate); err != nil {
|
&itemCount, &oldestItemDate, &newestItemDate, &noUpdate); err != nil {
|
||||||
return "", SearchFeed{}, false
|
return "", SearchFeed{}, false
|
||||||
}
|
}
|
||||||
cat := StringValue(category)
|
cat := StringValue(category)
|
||||||
@@ -111,7 +104,6 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
Language: StringValue(language),
|
Language: StringValue(language),
|
||||||
SiteURL: StringValue(siteUrl),
|
SiteURL: StringValue(siteUrl),
|
||||||
DiscoveredAt: discoveredAt.Format(time.RFC3339),
|
DiscoveredAt: discoveredAt.Format(time.RFC3339),
|
||||||
UpdatePeriod: StringValue(updatePeriod),
|
|
||||||
Status: StringValue(status),
|
Status: StringValue(status),
|
||||||
LastError: StringValue(lastError),
|
LastError: StringValue(lastError),
|
||||||
SourceURL: StringValue(sourceUrl),
|
SourceURL: StringValue(sourceUrl),
|
||||||
@@ -127,24 +119,12 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
if lastBuildDate != nil {
|
if lastBuildDate != nil {
|
||||||
sf.LastBuildDate = lastBuildDate.Format(time.RFC3339)
|
sf.LastBuildDate = lastBuildDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
if ttlMinutes != nil {
|
|
||||||
sf.TTLMinutes = *ttlMinutes
|
|
||||||
}
|
|
||||||
if updateFreq != nil {
|
|
||||||
sf.UpdateFreq = *updateFreq
|
|
||||||
}
|
|
||||||
if errorCount != nil {
|
|
||||||
sf.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
if lastErrorAt != nil {
|
if lastErrorAt != nil {
|
||||||
sf.LastErrorAt = lastErrorAt.Format(time.RFC3339)
|
sf.LastErrorAt = lastErrorAt.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
sf.ItemCount = *itemCount
|
sf.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
if avgPostFreqHrs != nil {
|
|
||||||
sf.AvgPostFreqHrs = *avgPostFreqHrs
|
|
||||||
}
|
|
||||||
if oldestItemDate != nil {
|
if oldestItemDate != nil {
|
||||||
sf.OldestItemDate = oldestItemDate.Format(time.RFC3339)
|
sf.OldestItemDate = oldestItemDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
@@ -161,10 +141,9 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
hostRows, err := c.db.Query(`
|
hostRows, err := c.db.Query(`
|
||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date, no_update
|
item_count, oldest_item_date, newest_item_date, no_update
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE source_host ILIKE $1 OR url ILIKE $1
|
WHERE source_host ILIKE $1 OR url ILIKE $1
|
||||||
LIMIT $2
|
LIMIT $2
|
||||||
@@ -185,10 +164,9 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
feedRows, err := c.db.Query(`
|
feedRows, err := c.db.Query(`
|
||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date, no_update
|
item_count, oldest_item_date, newest_item_date, no_update
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE search_vector @@ to_tsquery('english', $1)
|
WHERE search_vector @@ to_tsquery('english', $1)
|
||||||
LIMIT $2
|
LIMIT $2
|
||||||
@@ -251,28 +229,25 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
var fType, fCategory, fTitle, fDesc, fLang, fSiteUrl *string
|
var fType, fCategory, fTitle, fDesc, fLang, fSiteUrl *string
|
||||||
var fDiscoveredAt time.Time
|
var fDiscoveredAt time.Time
|
||||||
var fLastCrawledAt, fNextCrawlAt, fLastBuildDate *time.Time
|
var fLastCrawledAt, fNextCrawlAt, fLastBuildDate *time.Time
|
||||||
var fTTLMinutes, fUpdateFreq, fErrorCount, fItemCount *int
|
var fItemCount *int
|
||||||
var fUpdatePeriod, fStatus, fLastError *string
|
var fStatus, fLastError *string
|
||||||
var fLastErrorAt *time.Time
|
var fLastErrorAt *time.Time
|
||||||
var fSourceUrl, fSourceHost, fTLD *string
|
var fSourceUrl, fSourceHost, fTLD *string
|
||||||
var fAvgPostFreqHrs *float64
|
|
||||||
var fOldestItemDate, fNewestItemDate *time.Time
|
var fOldestItemDate, fNewestItemDate *time.Time
|
||||||
var fNoUpdate *bool
|
var fNoUpdate *bool
|
||||||
|
|
||||||
c.db.QueryRow(`
|
c.db.QueryRow(`
|
||||||
SELECT type, category, title, description, language, site_url,
|
SELECT type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date, no_update
|
item_count, oldest_item_date, newest_item_date, no_update
|
||||||
FROM feeds WHERE url = $1
|
FROM feeds WHERE url = $1
|
||||||
`, feedUrl).Scan(&fType, &fCategory, &fTitle, &fDesc, &fLang, &fSiteUrl,
|
`, feedUrl).Scan(&fType, &fCategory, &fTitle, &fDesc, &fLang, &fSiteUrl,
|
||||||
&fDiscoveredAt, &fLastCrawledAt, &fNextCrawlAt, &fLastBuildDate,
|
&fDiscoveredAt, &fLastCrawledAt, &fNextCrawlAt, &fLastBuildDate,
|
||||||
&fTTLMinutes, &fUpdatePeriod, &fUpdateFreq,
|
&fStatus, &fLastError, &fLastErrorAt,
|
||||||
&fStatus, &fErrorCount, &fLastError, &fLastErrorAt,
|
|
||||||
&fSourceUrl, &fSourceHost, &fTLD,
|
&fSourceUrl, &fSourceHost, &fTLD,
|
||||||
&fItemCount, &fAvgPostFreqHrs, &fOldestItemDate, &fNewestItemDate, &fNoUpdate)
|
&fItemCount, &fOldestItemDate, &fNewestItemDate, &fNoUpdate)
|
||||||
|
|
||||||
fCat := StringValue(fCategory)
|
fCat := StringValue(fCategory)
|
||||||
if fCat == "" {
|
if fCat == "" {
|
||||||
@@ -287,7 +262,6 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
Language: StringValue(fLang),
|
Language: StringValue(fLang),
|
||||||
SiteURL: StringValue(fSiteUrl),
|
SiteURL: StringValue(fSiteUrl),
|
||||||
DiscoveredAt: fDiscoveredAt.Format(time.RFC3339),
|
DiscoveredAt: fDiscoveredAt.Format(time.RFC3339),
|
||||||
UpdatePeriod: StringValue(fUpdatePeriod),
|
|
||||||
Status: StringValue(fStatus),
|
Status: StringValue(fStatus),
|
||||||
LastError: StringValue(fLastError),
|
LastError: StringValue(fLastError),
|
||||||
SourceURL: StringValue(fSourceUrl),
|
SourceURL: StringValue(fSourceUrl),
|
||||||
@@ -303,24 +277,12 @@ func (c *Crawler) handleAPISearch(w http.ResponseWriter, r *http.Request) {
|
|||||||
if fLastBuildDate != nil {
|
if fLastBuildDate != nil {
|
||||||
sf.LastBuildDate = fLastBuildDate.Format(time.RFC3339)
|
sf.LastBuildDate = fLastBuildDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
if fTTLMinutes != nil {
|
|
||||||
sf.TTLMinutes = *fTTLMinutes
|
|
||||||
}
|
|
||||||
if fUpdateFreq != nil {
|
|
||||||
sf.UpdateFreq = *fUpdateFreq
|
|
||||||
}
|
|
||||||
if fErrorCount != nil {
|
|
||||||
sf.ErrorCount = *fErrorCount
|
|
||||||
}
|
|
||||||
if fLastErrorAt != nil {
|
if fLastErrorAt != nil {
|
||||||
sf.LastErrorAt = fLastErrorAt.Format(time.RFC3339)
|
sf.LastErrorAt = fLastErrorAt.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
if fItemCount != nil {
|
if fItemCount != nil {
|
||||||
sf.ItemCount = *fItemCount
|
sf.ItemCount = *fItemCount
|
||||||
}
|
}
|
||||||
if fAvgPostFreqHrs != nil {
|
|
||||||
sf.AvgPostFreqHrs = *fAvgPostFreqHrs
|
|
||||||
}
|
|
||||||
if fOldestItemDate != nil {
|
if fOldestItemDate != nil {
|
||||||
sf.OldestItemDate = fOldestItemDate.Format(time.RFC3339)
|
sf.OldestItemDate = fOldestItemDate.Format(time.RFC3339)
|
||||||
}
|
}
|
||||||
|
|||||||
+3
-80
@@ -293,7 +293,7 @@ func (c *Crawler) getAccountForFeed(feedURL string) string {
|
|||||||
var account *string
|
var account *string
|
||||||
err := c.db.QueryRow(`
|
err := c.db.QueryRow(`
|
||||||
SELECT publish_account FROM feeds
|
SELECT publish_account FROM feeds
|
||||||
WHERE url = $1 AND publish_status = 'pass' AND status = 'active'
|
WHERE url = $1 AND publish_status = 'pass' AND status = 'pass'
|
||||||
`, feedURL).Scan(&account)
|
`, feedURL).Scan(&account)
|
||||||
if err != nil || account == nil || *account == "" {
|
if err != nil || account == nil || *account == "" {
|
||||||
// Derive handle from feed URL
|
// Derive handle from feed URL
|
||||||
@@ -406,7 +406,7 @@ func (c *Crawler) GetAllUnpublishedItems(limit int) ([]Item, error) {
|
|||||||
FROM items i
|
FROM items i
|
||||||
JOIN feeds f ON i.feed_url = f.url
|
JOIN feeds f ON i.feed_url = f.url
|
||||||
WHERE f.publish_status = 'pass'
|
WHERE f.publish_status = 'pass'
|
||||||
AND f.status = 'active'
|
AND f.status = 'pass'
|
||||||
AND i.published_at IS NULL
|
AND i.published_at IS NULL
|
||||||
ORDER BY i.discovered_at ASC
|
ORDER BY i.discovered_at ASC
|
||||||
LIMIT $1
|
LIMIT $1
|
||||||
@@ -467,84 +467,7 @@ func (c *Crawler) GetAllUnpublishedItems(limit int) ([]Item, error) {
|
|||||||
return items, nil
|
return items, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// StartDomainCheckLoop runs HEAD requests on approved domains to verify they're reachable
|
// StartCrawlLoop runs the domain crawling loop independently
|
||||||
func (c *Crawler) StartDomainCheckLoop() {
|
|
||||||
numWorkers := 100
|
|
||||||
|
|
||||||
// Buffered channel for domain work
|
|
||||||
workChan := make(chan *Domain, 100)
|
|
||||||
|
|
||||||
// Start workers
|
|
||||||
for i := 0; i < numWorkers; i++ {
|
|
||||||
go func() {
|
|
||||||
for domain := range workChan {
|
|
||||||
// Do HEAD request to verify domain is reachable
|
|
||||||
checkErr := c.checkDomain(domain.Host)
|
|
||||||
errStr := ""
|
|
||||||
if checkErr != nil {
|
|
||||||
errStr = checkErr.Error()
|
|
||||||
}
|
|
||||||
if err := c.markDomainChecked(domain.Host, errStr); err != nil {
|
|
||||||
fmt.Printf("Error marking domain %s as checked: %v\n", domain.Host, err)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}()
|
|
||||||
}
|
|
||||||
|
|
||||||
const fetchSize = 100
|
|
||||||
for {
|
|
||||||
domains, err := c.GetDomainsToCheck(fetchSize)
|
|
||||||
if err != nil {
|
|
||||||
fmt.Printf("Error fetching domains to check: %v\n", err)
|
|
||||||
}
|
|
||||||
|
|
||||||
if len(domains) == 0 {
|
|
||||||
time.Sleep(1 * time.Second)
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
|
|
||||||
fmt.Printf("%s domain-check: %d domains to verify\n", time.Now().Format("15:04:05"), len(domains))
|
|
||||||
|
|
||||||
for _, domain := range domains {
|
|
||||||
workChan <- domain
|
|
||||||
}
|
|
||||||
|
|
||||||
time.Sleep(1 * time.Second)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// checkDomain performs a HEAD request to verify a domain is reachable
|
|
||||||
func (c *Crawler) checkDomain(host string) error {
|
|
||||||
url := "https://" + host
|
|
||||||
req, err := http.NewRequest("HEAD", url, nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
req.Header.Set("User-Agent", c.UserAgent)
|
|
||||||
|
|
||||||
resp, err := c.client.Do(req)
|
|
||||||
if err != nil {
|
|
||||||
// Try HTTP fallback
|
|
||||||
url = "http://" + host
|
|
||||||
req, err = http.NewRequest("HEAD", url, nil)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
req.Header.Set("User-Agent", c.UserAgent)
|
|
||||||
resp, err = c.client.Do(req)
|
|
||||||
if err != nil {
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
}
|
|
||||||
defer resp.Body.Close()
|
|
||||||
|
|
||||||
if resp.StatusCode >= 400 {
|
|
||||||
return fmt.Errorf("HTTP %d", resp.StatusCode)
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// StartCrawlLoop runs the domain crawling loop independently (crawls checked domains)
|
|
||||||
func (c *Crawler) StartCrawlLoop() {
|
func (c *Crawler) StartCrawlLoop() {
|
||||||
numWorkers := 100
|
numWorkers := 100
|
||||||
|
|
||||||
|
|||||||
@@ -12,7 +12,6 @@ type DashboardStats struct {
|
|||||||
HoldDomains int `json:"hold_domains"`
|
HoldDomains int `json:"hold_domains"`
|
||||||
PassDomains int `json:"pass_domains"`
|
PassDomains int `json:"pass_domains"`
|
||||||
SkipDomains int `json:"skip_domains"`
|
SkipDomains int `json:"skip_domains"`
|
||||||
FailDomains int `json:"fail_domains"`
|
|
||||||
|
|
||||||
// Feed stats
|
// Feed stats
|
||||||
TotalFeeds int `json:"total_feeds"`
|
TotalFeeds int `json:"total_feeds"`
|
||||||
@@ -187,8 +186,6 @@ func (c *Crawler) collectDomainStats(stats *DashboardStats) error {
|
|||||||
stats.PassDomains = count
|
stats.PassDomains = count
|
||||||
case "skip":
|
case "skip":
|
||||||
stats.SkipDomains = count
|
stats.SkipDomains = count
|
||||||
case "fail":
|
|
||||||
stats.FailDomains = count
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if err := rows.Err(); err != nil {
|
if err := rows.Err(); err != nil {
|
||||||
|
|||||||
@@ -27,8 +27,7 @@ CREATE TABLE IF NOT EXISTS domains (
|
|||||||
CREATE INDEX IF NOT EXISTS idx_domains_status ON domains(status);
|
CREATE INDEX IF NOT EXISTS idx_domains_status ON domains(status);
|
||||||
CREATE INDEX IF NOT EXISTS idx_domains_tld ON domains(tld);
|
CREATE INDEX IF NOT EXISTS idx_domains_tld ON domains(tld);
|
||||||
CREATE INDEX IF NOT EXISTS idx_domains_feeds_found ON domains(feeds_found DESC) WHERE feeds_found > 0;
|
CREATE INDEX IF NOT EXISTS idx_domains_feeds_found ON domains(feeds_found DESC) WHERE feeds_found > 0;
|
||||||
CREATE INDEX IF NOT EXISTS idx_domains_to_check ON domains(status) WHERE last_checked_at IS NULL;
|
CREATE INDEX IF NOT EXISTS idx_domains_to_crawl ON domains(status) WHERE last_crawled_at IS NULL;
|
||||||
CREATE INDEX IF NOT EXISTS idx_domains_to_crawl ON domains(status) WHERE last_checked_at IS NOT NULL AND last_crawled_at IS NULL;
|
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS feeds (
|
CREATE TABLE IF NOT EXISTS feeds (
|
||||||
url TEXT PRIMARY KEY,
|
url TEXT PRIMARY KEY,
|
||||||
@@ -47,12 +46,7 @@ CREATE TABLE IF NOT EXISTS feeds (
|
|||||||
etag TEXT,
|
etag TEXT,
|
||||||
last_modified TEXT,
|
last_modified TEXT,
|
||||||
|
|
||||||
ttl_minutes INTEGER,
|
status TEXT DEFAULT 'pass' CHECK(status IN ('hold', 'pass', 'skip')),
|
||||||
update_period TEXT,
|
|
||||||
update_freq INTEGER,
|
|
||||||
|
|
||||||
status TEXT DEFAULT 'active',
|
|
||||||
error_count INTEGER DEFAULT 0,
|
|
||||||
last_error TEXT,
|
last_error TEXT,
|
||||||
last_error_at TIMESTAMPTZ,
|
last_error_at TIMESTAMPTZ,
|
||||||
|
|
||||||
@@ -61,7 +55,6 @@ CREATE TABLE IF NOT EXISTS feeds (
|
|||||||
tld TEXT,
|
tld TEXT,
|
||||||
|
|
||||||
item_count INTEGER,
|
item_count INTEGER,
|
||||||
avg_post_freq_hrs DOUBLE PRECISION,
|
|
||||||
oldest_item_date TIMESTAMPTZ,
|
oldest_item_date TIMESTAMPTZ,
|
||||||
newest_item_date TIMESTAMPTZ,
|
newest_item_date TIMESTAMPTZ,
|
||||||
|
|
||||||
@@ -90,6 +83,7 @@ CREATE INDEX IF NOT EXISTS idx_feeds_status ON feeds(status);
|
|||||||
CREATE INDEX IF NOT EXISTS idx_feeds_discovered_at ON feeds(discovered_at);
|
CREATE INDEX IF NOT EXISTS idx_feeds_discovered_at ON feeds(discovered_at);
|
||||||
CREATE INDEX IF NOT EXISTS idx_feeds_title ON feeds(title);
|
CREATE INDEX IF NOT EXISTS idx_feeds_title ON feeds(title);
|
||||||
CREATE INDEX IF NOT EXISTS idx_feeds_search ON feeds USING GIN(search_vector);
|
CREATE INDEX IF NOT EXISTS idx_feeds_search ON feeds USING GIN(search_vector);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_feeds_due_check ON feeds(next_crawl_at, no_update DESC) WHERE status = 'pass';
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS items (
|
CREATE TABLE IF NOT EXISTS items (
|
||||||
id BIGSERIAL PRIMARY KEY,
|
id BIGSERIAL PRIMARY KEY,
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ import (
|
|||||||
)
|
)
|
||||||
|
|
||||||
// Domain represents a host to be crawled for feeds
|
// Domain represents a host to be crawled for feeds
|
||||||
// Status: hold (pending review), pass (approved), skip (not processing), fail (error)
|
// Status: hold (pending review), pass (approved), skip (not processing)
|
||||||
type Domain struct {
|
type Domain struct {
|
||||||
Host string `json:"host"`
|
Host string `json:"host"`
|
||||||
Status string `json:"status"`
|
Status string `json:"status"`
|
||||||
@@ -123,28 +123,12 @@ func (c *Crawler) getDomain(host string) (*Domain, error) {
|
|||||||
return domain, nil
|
return domain, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// GetDomainsToCheck returns domains ready for checking (status='pass', never checked)
|
// GetDomainsToCrawl returns domains ready for crawling (status='pass', not yet crawled)
|
||||||
func (c *Crawler) GetDomainsToCheck(limit int) ([]*Domain, error) {
|
|
||||||
rows, err := c.db.Query(`
|
|
||||||
SELECT host, status, discovered_at, last_checked_at, last_crawled_at, feeds_found, last_error, tld
|
|
||||||
FROM domains WHERE status = 'pass' AND last_checked_at IS NULL
|
|
||||||
ORDER BY discovered_at ASC
|
|
||||||
LIMIT $1
|
|
||||||
`, limit)
|
|
||||||
if err != nil {
|
|
||||||
return nil, err
|
|
||||||
}
|
|
||||||
defer rows.Close()
|
|
||||||
|
|
||||||
return c.scanDomains(rows)
|
|
||||||
}
|
|
||||||
|
|
||||||
// GetDomainsToCrawl returns domains ready for crawling (status='pass', checked but not crawled)
|
|
||||||
func (c *Crawler) GetDomainsToCrawl(limit int) ([]*Domain, error) {
|
func (c *Crawler) GetDomainsToCrawl(limit int) ([]*Domain, error) {
|
||||||
rows, err := c.db.Query(`
|
rows, err := c.db.Query(`
|
||||||
SELECT host, status, discovered_at, last_checked_at, last_crawled_at, feeds_found, last_error, tld
|
SELECT host, status, discovered_at, last_checked_at, last_crawled_at, feeds_found, last_error, tld
|
||||||
FROM domains WHERE status = 'pass' AND last_checked_at IS NOT NULL AND last_crawled_at IS NULL
|
FROM domains WHERE status = 'pass' AND last_crawled_at IS NULL
|
||||||
ORDER BY discovered_at ASC
|
ORDER BY discovered_at DESC
|
||||||
LIMIT $1
|
LIMIT $1
|
||||||
`, limit)
|
`, limit)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -180,29 +164,12 @@ func (c *Crawler) scanDomains(rows pgx.Rows) ([]*Domain, error) {
|
|||||||
return domains, rows.Err()
|
return domains, rows.Err()
|
||||||
}
|
}
|
||||||
|
|
||||||
// markDomainChecked updates a domain after the check (HEAD request) stage
|
|
||||||
func (c *Crawler) markDomainChecked(host string, lastError string) error {
|
|
||||||
now := time.Now()
|
|
||||||
if lastError != "" {
|
|
||||||
_, err := c.db.Exec(`
|
|
||||||
UPDATE domains SET status = 'fail', last_checked_at = $1, last_error = $2
|
|
||||||
WHERE host = $3
|
|
||||||
`, now, lastError, normalizeHost(host))
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
_, err := c.db.Exec(`
|
|
||||||
UPDATE domains SET last_checked_at = $1, last_error = NULL
|
|
||||||
WHERE host = $2
|
|
||||||
`, now, normalizeHost(host))
|
|
||||||
return err
|
|
||||||
}
|
|
||||||
|
|
||||||
// markDomainCrawled updates a domain after the crawl stage
|
// markDomainCrawled updates a domain after the crawl stage
|
||||||
func (c *Crawler) markDomainCrawled(host string, feedsFound int, lastError string) error {
|
func (c *Crawler) markDomainCrawled(host string, feedsFound int, lastError string) error {
|
||||||
now := time.Now()
|
now := time.Now()
|
||||||
if lastError != "" {
|
if lastError != "" {
|
||||||
_, err := c.db.Exec(`
|
_, err := c.db.Exec(`
|
||||||
UPDATE domains SET status = 'fail', last_crawled_at = $1, feeds_found = $2, last_error = $3
|
UPDATE domains SET last_crawled_at = $1, feeds_found = $2, last_error = $3
|
||||||
WHERE host = $4
|
WHERE host = $4
|
||||||
`, now, feedsFound, lastError, normalizeHost(host))
|
`, now, feedsFound, lastError, normalizeHost(host))
|
||||||
return err
|
return err
|
||||||
|
|||||||
@@ -109,14 +109,8 @@ type Feed struct {
|
|||||||
ETag string `json:"etag,omitempty"`
|
ETag string `json:"etag,omitempty"`
|
||||||
LastModified string `json:"last_modified,omitempty"`
|
LastModified string `json:"last_modified,omitempty"`
|
||||||
|
|
||||||
// Feed hints for crawl scheduling
|
|
||||||
TTLMinutes int `json:"ttl_minutes,omitempty"` // From RSS <ttl> element
|
|
||||||
UpdatePeriod string `json:"update_period,omitempty"` // From sy:updatePeriod (hourly, daily, weekly, monthly, yearly)
|
|
||||||
UpdateFreq int `json:"update_freq,omitempty"` // From sy:updateFrequency
|
|
||||||
|
|
||||||
// Health tracking
|
// Health tracking
|
||||||
Status string `json:"status"` // "active", "dead", "redirect", "error"
|
Status string `json:"status"` // "pass", "hold", "skip"
|
||||||
ErrorCount int `json:"error_count"`
|
|
||||||
LastError string `json:"last_error,omitempty"`
|
LastError string `json:"last_error,omitempty"`
|
||||||
LastErrorAt time.Time `json:"last_error_at,omitempty"`
|
LastErrorAt time.Time `json:"last_error_at,omitempty"`
|
||||||
|
|
||||||
@@ -126,8 +120,7 @@ type Feed struct {
|
|||||||
TLD string `json:"tld,omitempty"`
|
TLD string `json:"tld,omitempty"`
|
||||||
|
|
||||||
// Content stats
|
// Content stats
|
||||||
ItemCount int `json:"item_count,omitempty"` // Number of items in last crawl
|
ItemCount int `json:"item_count,omitempty"` // Number of items in last crawl
|
||||||
AvgPostFreqHrs float64 `json:"avg_post_freq_hrs,omitempty"` // Average hours between posts
|
|
||||||
OldestItemDate time.Time `json:"oldest_item_date,omitempty"`
|
OldestItemDate time.Time `json:"oldest_item_date,omitempty"`
|
||||||
NewestItemDate time.Time `json:"newest_item_date,omitempty"`
|
NewestItemDate time.Time `json:"newest_item_date,omitempty"`
|
||||||
|
|
||||||
@@ -162,13 +155,12 @@ func (c *Crawler) saveFeed(feed *Feed) error {
|
|||||||
url, type, category, title, description, language, site_url,
|
url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30)
|
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25)
|
||||||
ON CONFLICT(url) DO UPDATE SET
|
ON CONFLICT(url) DO UPDATE SET
|
||||||
type = EXCLUDED.type,
|
type = EXCLUDED.type,
|
||||||
category = EXCLUDED.category,
|
category = EXCLUDED.category,
|
||||||
@@ -181,15 +173,10 @@ func (c *Crawler) saveFeed(feed *Feed) error {
|
|||||||
last_build_date = EXCLUDED.last_build_date,
|
last_build_date = EXCLUDED.last_build_date,
|
||||||
etag = EXCLUDED.etag,
|
etag = EXCLUDED.etag,
|
||||||
last_modified = EXCLUDED.last_modified,
|
last_modified = EXCLUDED.last_modified,
|
||||||
ttl_minutes = EXCLUDED.ttl_minutes,
|
|
||||||
update_period = EXCLUDED.update_period,
|
|
||||||
update_freq = EXCLUDED.update_freq,
|
|
||||||
status = EXCLUDED.status,
|
status = EXCLUDED.status,
|
||||||
error_count = EXCLUDED.error_count,
|
|
||||||
last_error = EXCLUDED.last_error,
|
last_error = EXCLUDED.last_error,
|
||||||
last_error_at = EXCLUDED.last_error_at,
|
last_error_at = EXCLUDED.last_error_at,
|
||||||
item_count = EXCLUDED.item_count,
|
item_count = EXCLUDED.item_count,
|
||||||
avg_post_freq_hrs = EXCLUDED.avg_post_freq_hrs,
|
|
||||||
oldest_item_date = EXCLUDED.oldest_item_date,
|
oldest_item_date = EXCLUDED.oldest_item_date,
|
||||||
newest_item_date = EXCLUDED.newest_item_date,
|
newest_item_date = EXCLUDED.newest_item_date,
|
||||||
no_update = EXCLUDED.no_update,
|
no_update = EXCLUDED.no_update,
|
||||||
@@ -200,10 +187,9 @@ func (c *Crawler) saveFeed(feed *Feed) error {
|
|||||||
NullableString(feed.Language), NullableString(feed.SiteURL),
|
NullableString(feed.Language), NullableString(feed.SiteURL),
|
||||||
feed.DiscoveredAt, NullableTime(feed.LastCrawledAt), NullableTime(feed.NextCrawlAt), NullableTime(feed.LastBuildDate),
|
feed.DiscoveredAt, NullableTime(feed.LastCrawledAt), NullableTime(feed.NextCrawlAt), NullableTime(feed.LastBuildDate),
|
||||||
NullableString(feed.ETag), NullableString(feed.LastModified),
|
NullableString(feed.ETag), NullableString(feed.LastModified),
|
||||||
feed.TTLMinutes, NullableString(feed.UpdatePeriod), feed.UpdateFreq,
|
feed.Status, NullableString(feed.LastError), NullableTime(feed.LastErrorAt),
|
||||||
feed.Status, feed.ErrorCount, NullableString(feed.LastError), NullableTime(feed.LastErrorAt),
|
|
||||||
NullableString(feed.SourceURL), NullableString(feed.SourceHost), NullableString(feed.TLD),
|
NullableString(feed.SourceURL), NullableString(feed.SourceHost), NullableString(feed.TLD),
|
||||||
feed.ItemCount, feed.AvgPostFreqHrs, NullableTime(feed.OldestItemDate), NullableTime(feed.NewestItemDate),
|
feed.ItemCount, NullableTime(feed.OldestItemDate), NullableTime(feed.NewestItemDate),
|
||||||
feed.NoUpdate,
|
feed.NoUpdate,
|
||||||
publishStatus, NullableString(feed.PublishAccount),
|
publishStatus, NullableString(feed.PublishAccount),
|
||||||
)
|
)
|
||||||
@@ -215,19 +201,17 @@ func (c *Crawler) getFeed(feedURL string) (*Feed, error) {
|
|||||||
feed := &Feed{}
|
feed := &Feed{}
|
||||||
var category, title, description, language, siteURL *string
|
var category, title, description, language, siteURL *string
|
||||||
var lastCrawledAt, nextCrawlAt, lastBuildDate, lastErrorAt, oldestItemDate, newestItemDate *time.Time
|
var lastCrawledAt, nextCrawlAt, lastBuildDate, lastErrorAt, oldestItemDate, newestItemDate *time.Time
|
||||||
var etag, lastModified, updatePeriod, lastError, sourceURL, sourceHost, tld *string
|
var etag, lastModified, lastError, sourceURL, sourceHost, tld *string
|
||||||
var avgPostFreqHrs *float64
|
|
||||||
var publishStatus, publishAccount *string
|
var publishStatus, publishAccount *string
|
||||||
var ttlMinutes, updateFreq, errorCount, itemCount, noUpdate *int
|
var itemCount, noUpdate *int
|
||||||
|
|
||||||
err := c.db.QueryRow(`
|
err := c.db.QueryRow(`
|
||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds WHERE url = $1
|
FROM feeds WHERE url = $1
|
||||||
@@ -235,10 +219,9 @@ func (c *Crawler) getFeed(feedURL string) (*Feed, error) {
|
|||||||
&feed.URL, &feed.Type, &category, &title, &description, &language, &siteURL,
|
&feed.URL, &feed.Type, &category, &title, &description, &language, &siteURL,
|
||||||
&feed.DiscoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
&feed.DiscoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
||||||
&etag, &lastModified,
|
&etag, &lastModified,
|
||||||
&ttlMinutes, &updatePeriod, &updateFreq,
|
&feed.Status, &lastError, &lastErrorAt,
|
||||||
&feed.Status, &errorCount, &lastError, &lastErrorAt,
|
|
||||||
&sourceURL, &sourceHost, &tld,
|
&sourceURL, &sourceHost, &tld,
|
||||||
&itemCount, &avgPostFreqHrs, &oldestItemDate, &newestItemDate,
|
&itemCount, &oldestItemDate, &newestItemDate,
|
||||||
&noUpdate,
|
&noUpdate,
|
||||||
&publishStatus, &publishAccount,
|
&publishStatus, &publishAccount,
|
||||||
)
|
)
|
||||||
@@ -265,16 +248,6 @@ func (c *Crawler) getFeed(feedURL string) (*Feed, error) {
|
|||||||
feed.LastBuildDate = TimeValue(lastBuildDate)
|
feed.LastBuildDate = TimeValue(lastBuildDate)
|
||||||
feed.ETag = StringValue(etag)
|
feed.ETag = StringValue(etag)
|
||||||
feed.LastModified = StringValue(lastModified)
|
feed.LastModified = StringValue(lastModified)
|
||||||
if ttlMinutes != nil {
|
|
||||||
feed.TTLMinutes = *ttlMinutes
|
|
||||||
}
|
|
||||||
feed.UpdatePeriod = StringValue(updatePeriod)
|
|
||||||
if updateFreq != nil {
|
|
||||||
feed.UpdateFreq = *updateFreq
|
|
||||||
}
|
|
||||||
if errorCount != nil {
|
|
||||||
feed.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
feed.LastError = StringValue(lastError)
|
feed.LastError = StringValue(lastError)
|
||||||
feed.LastErrorAt = TimeValue(lastErrorAt)
|
feed.LastErrorAt = TimeValue(lastErrorAt)
|
||||||
feed.SourceURL = StringValue(sourceURL)
|
feed.SourceURL = StringValue(sourceURL)
|
||||||
@@ -283,9 +256,6 @@ func (c *Crawler) getFeed(feedURL string) (*Feed, error) {
|
|||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
feed.ItemCount = *itemCount
|
feed.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
if avgPostFreqHrs != nil {
|
|
||||||
feed.AvgPostFreqHrs = *avgPostFreqHrs
|
|
||||||
}
|
|
||||||
feed.OldestItemDate = TimeValue(oldestItemDate)
|
feed.OldestItemDate = TimeValue(oldestItemDate)
|
||||||
feed.NewestItemDate = TimeValue(newestItemDate)
|
feed.NewestItemDate = TimeValue(newestItemDate)
|
||||||
if noUpdate != nil {
|
if noUpdate != nil {
|
||||||
@@ -314,10 +284,9 @@ func (c *Crawler) GetAllFeeds() ([]*Feed, error) {
|
|||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds
|
FROM feeds
|
||||||
@@ -344,21 +313,20 @@ func (c *Crawler) GetFeedCountByHost(host string) (int, error) {
|
|||||||
return count, err
|
return count, err
|
||||||
}
|
}
|
||||||
|
|
||||||
// GetFeedsDueForCheck returns feeds where next_crawl_at <= now, ordered randomly, limited to n
|
// GetFeedsDueForCheck returns feeds where next_crawl_at <= now, ordered by no_update desc (prioritize infrequent feeds)
|
||||||
func (c *Crawler) GetFeedsDueForCheck(limit int) ([]*Feed, error) {
|
func (c *Crawler) GetFeedsDueForCheck(limit int) ([]*Feed, error) {
|
||||||
rows, err := c.db.Query(`
|
rows, err := c.db.Query(`
|
||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE next_crawl_at <= NOW() AND status != 'dead'
|
WHERE next_crawl_at <= NOW() AND status = 'pass'
|
||||||
ORDER BY RANDOM()
|
ORDER BY no_update DESC
|
||||||
LIMIT $1
|
LIMIT $1
|
||||||
`, limit)
|
`, limit)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -375,10 +343,9 @@ func (c *Crawler) GetFeedsByHost(host string) ([]*Feed, error) {
|
|||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds WHERE source_host = $1
|
FROM feeds WHERE source_host = $1
|
||||||
@@ -398,10 +365,9 @@ func (c *Crawler) SearchFeeds(query string) ([]*Feed, error) {
|
|||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds
|
FROM feeds
|
||||||
@@ -424,9 +390,8 @@ func scanFeeds(rows pgx.Rows) ([]*Feed, error) {
|
|||||||
feed := &Feed{}
|
feed := &Feed{}
|
||||||
var feedType, category, title, description, language, siteURL *string
|
var feedType, category, title, description, language, siteURL *string
|
||||||
var lastCrawledAt, nextCrawlAt, lastBuildDate, lastErrorAt, oldestItemDate, newestItemDate *time.Time
|
var lastCrawledAt, nextCrawlAt, lastBuildDate, lastErrorAt, oldestItemDate, newestItemDate *time.Time
|
||||||
var etag, lastModified, updatePeriod, lastError, sourceURL, sourceHost, tld *string
|
var etag, lastModified, lastError, sourceURL, sourceHost, tld *string
|
||||||
var ttlMinutes, updateFreq, errorCount, itemCount, noUpdate *int
|
var itemCount, noUpdate *int
|
||||||
var avgPostFreqHrs *float64
|
|
||||||
var status *string
|
var status *string
|
||||||
var publishStatus, publishAccount *string
|
var publishStatus, publishAccount *string
|
||||||
|
|
||||||
@@ -434,10 +399,9 @@ func scanFeeds(rows pgx.Rows) ([]*Feed, error) {
|
|||||||
&feed.URL, &feedType, &category, &title, &description, &language, &siteURL,
|
&feed.URL, &feedType, &category, &title, &description, &language, &siteURL,
|
||||||
&feed.DiscoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
&feed.DiscoveredAt, &lastCrawledAt, &nextCrawlAt, &lastBuildDate,
|
||||||
&etag, &lastModified,
|
&etag, &lastModified,
|
||||||
&ttlMinutes, &updatePeriod, &updateFreq,
|
&status, &lastError, &lastErrorAt,
|
||||||
&status, &errorCount, &lastError, &lastErrorAt,
|
|
||||||
&sourceURL, &sourceHost, &tld,
|
&sourceURL, &sourceHost, &tld,
|
||||||
&itemCount, &avgPostFreqHrs, &oldestItemDate, &newestItemDate,
|
&itemCount, &oldestItemDate, &newestItemDate,
|
||||||
&noUpdate,
|
&noUpdate,
|
||||||
&publishStatus, &publishAccount,
|
&publishStatus, &publishAccount,
|
||||||
); err != nil {
|
); err != nil {
|
||||||
@@ -460,17 +424,7 @@ func scanFeeds(rows pgx.Rows) ([]*Feed, error) {
|
|||||||
feed.LastBuildDate = TimeValue(lastBuildDate)
|
feed.LastBuildDate = TimeValue(lastBuildDate)
|
||||||
feed.ETag = StringValue(etag)
|
feed.ETag = StringValue(etag)
|
||||||
feed.LastModified = StringValue(lastModified)
|
feed.LastModified = StringValue(lastModified)
|
||||||
if ttlMinutes != nil {
|
|
||||||
feed.TTLMinutes = *ttlMinutes
|
|
||||||
}
|
|
||||||
feed.UpdatePeriod = StringValue(updatePeriod)
|
|
||||||
if updateFreq != nil {
|
|
||||||
feed.UpdateFreq = *updateFreq
|
|
||||||
}
|
|
||||||
feed.Status = StringValue(status)
|
feed.Status = StringValue(status)
|
||||||
if errorCount != nil {
|
|
||||||
feed.ErrorCount = *errorCount
|
|
||||||
}
|
|
||||||
feed.LastError = StringValue(lastError)
|
feed.LastError = StringValue(lastError)
|
||||||
feed.LastErrorAt = TimeValue(lastErrorAt)
|
feed.LastErrorAt = TimeValue(lastErrorAt)
|
||||||
feed.SourceURL = StringValue(sourceURL)
|
feed.SourceURL = StringValue(sourceURL)
|
||||||
@@ -479,9 +433,6 @@ func scanFeeds(rows pgx.Rows) ([]*Feed, error) {
|
|||||||
if itemCount != nil {
|
if itemCount != nil {
|
||||||
feed.ItemCount = *itemCount
|
feed.ItemCount = *itemCount
|
||||||
}
|
}
|
||||||
if avgPostFreqHrs != nil {
|
|
||||||
feed.AvgPostFreqHrs = *avgPostFreqHrs
|
|
||||||
}
|
|
||||||
feed.OldestItemDate = TimeValue(oldestItemDate)
|
feed.OldestItemDate = TimeValue(oldestItemDate)
|
||||||
feed.NewestItemDate = TimeValue(newestItemDate)
|
feed.NewestItemDate = TimeValue(newestItemDate)
|
||||||
if noUpdate != nil {
|
if noUpdate != nil {
|
||||||
@@ -522,10 +473,9 @@ func (c *Crawler) GetFeedsByPublishStatus(status string) ([]*Feed, error) {
|
|||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds
|
FROM feeds
|
||||||
@@ -545,14 +495,13 @@ func (c *Crawler) GetPublishCandidates(limit int) ([]*Feed, error) {
|
|||||||
SELECT url, type, category, title, description, language, site_url,
|
SELECT url, type, category, title, description, language, site_url,
|
||||||
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
discovered_at, last_crawled_at, next_crawl_at, last_build_date,
|
||||||
etag, last_modified,
|
etag, last_modified,
|
||||||
ttl_minutes, update_period, update_freq,
|
status, last_error, last_error_at,
|
||||||
status, error_count, last_error, last_error_at,
|
|
||||||
source_url, source_host, tld,
|
source_url, source_host, tld,
|
||||||
item_count, avg_post_freq_hrs, oldest_item_date, newest_item_date,
|
item_count, oldest_item_date, newest_item_date,
|
||||||
no_update,
|
no_update,
|
||||||
publish_status, publish_account
|
publish_status, publish_account
|
||||||
FROM feeds
|
FROM feeds
|
||||||
WHERE publish_status = 'hold' AND item_count > 0 AND status = 'active'
|
WHERE publish_status = 'hold' AND item_count > 0 AND status = 'pass'
|
||||||
ORDER BY item_count DESC
|
ORDER BY item_count DESC
|
||||||
LIMIT $1
|
LIMIT $1
|
||||||
`, limit)
|
`, limit)
|
||||||
|
|||||||
+21
-25
@@ -32,7 +32,7 @@ func (c *Crawler) processFeed(feedURL, sourceHost, body string, headers http.Hea
|
|||||||
Category: classifyFeed(feedURL),
|
Category: classifyFeed(feedURL),
|
||||||
DiscoveredAt: now,
|
DiscoveredAt: now,
|
||||||
LastCrawledAt: now,
|
LastCrawledAt: now,
|
||||||
Status: "active",
|
Status: "pass",
|
||||||
SourceHost: sourceHost,
|
SourceHost: sourceHost,
|
||||||
TLD: getTLD(sourceHost),
|
TLD: getTLD(sourceHost),
|
||||||
ETag: headers.Get("ETag"),
|
ETag: headers.Get("ETag"),
|
||||||
@@ -88,7 +88,7 @@ func (c *Crawler) addFeed(feedURL, feedType, sourceHost, sourceURL string) {
|
|||||||
Type: feedType,
|
Type: feedType,
|
||||||
Category: classifyFeed(feedURL),
|
Category: classifyFeed(feedURL),
|
||||||
DiscoveredAt: now,
|
DiscoveredAt: now,
|
||||||
Status: "active",
|
Status: "pass",
|
||||||
SourceURL: normalizeURL(sourceURL),
|
SourceURL: normalizeURL(sourceURL),
|
||||||
SourceHost: sourceHost,
|
SourceHost: sourceHost,
|
||||||
TLD: getTLD(sourceHost),
|
TLD: getTLD(sourceHost),
|
||||||
@@ -149,16 +149,15 @@ func (c *Crawler) CheckFeed(feed *Feed) (bool, error) {
|
|||||||
}
|
}
|
||||||
now := time.Now()
|
now := time.Now()
|
||||||
feed.LastCrawledAt = now
|
feed.LastCrawledAt = now
|
||||||
feed.ErrorCount++
|
|
||||||
feed.NoUpdate++
|
feed.NoUpdate++
|
||||||
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
||||||
feed.LastError = err.Error()
|
feed.LastError = err.Error()
|
||||||
feed.LastErrorAt = now
|
feed.LastErrorAt = now
|
||||||
feed.Status = "error"
|
feed.Status = "hold"
|
||||||
// Auto-hold feeds that fail 100+ times
|
// Auto-hold feeds after 1000 consecutive failures/no-changes
|
||||||
if feed.ErrorCount >= 100 && feed.PublishStatus == "pass" {
|
if feed.NoUpdate >= 1000 && feed.PublishStatus == "pass" {
|
||||||
feed.PublishStatus = "hold"
|
feed.PublishStatus = "hold"
|
||||||
fmt.Printf("Feed auto-held after %d errors: %s\n", feed.ErrorCount, feed.URL)
|
fmt.Printf("Feed auto-held after %d no-updates: %s\n", feed.NoUpdate, feed.URL)
|
||||||
}
|
}
|
||||||
c.saveFeed(feed)
|
c.saveFeed(feed)
|
||||||
return false, err
|
return false, err
|
||||||
@@ -173,29 +172,28 @@ func (c *Crawler) CheckFeed(feed *Feed) (bool, error) {
|
|||||||
feed.NoUpdate++
|
feed.NoUpdate++
|
||||||
// Adaptive backoff: 100s base + 100s per consecutive no-change
|
// Adaptive backoff: 100s base + 100s per consecutive no-change
|
||||||
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
||||||
feed.ErrorCount = 0
|
|
||||||
feed.LastError = ""
|
feed.LastError = ""
|
||||||
feed.Status = "active"
|
feed.Status = "pass"
|
||||||
|
// Auto-hold feeds after 1000 consecutive no-changes
|
||||||
|
if feed.NoUpdate >= 1000 && feed.PublishStatus == "pass" {
|
||||||
|
feed.PublishStatus = "hold"
|
||||||
|
fmt.Printf("Feed auto-held after %d no-updates: %s\n", feed.NoUpdate, feed.URL)
|
||||||
|
}
|
||||||
c.saveFeed(feed)
|
c.saveFeed(feed)
|
||||||
return false, nil
|
return false, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// Non-200 response
|
// Non-200 response
|
||||||
if resp.StatusCode != http.StatusOK {
|
if resp.StatusCode != http.StatusOK {
|
||||||
feed.ErrorCount++
|
|
||||||
feed.NoUpdate++
|
feed.NoUpdate++
|
||||||
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
||||||
feed.LastError = resp.Status
|
feed.LastError = resp.Status
|
||||||
feed.LastErrorAt = now
|
feed.LastErrorAt = now
|
||||||
if resp.StatusCode == http.StatusNotFound || resp.StatusCode == http.StatusGone {
|
feed.Status = "hold"
|
||||||
feed.Status = "dead"
|
// Auto-hold feeds after 1000 consecutive failures/no-changes
|
||||||
} else {
|
if feed.NoUpdate >= 1000 && feed.PublishStatus == "pass" {
|
||||||
feed.Status = "error"
|
|
||||||
}
|
|
||||||
// Auto-hold feeds that fail 100+ times
|
|
||||||
if feed.ErrorCount >= 100 && feed.PublishStatus == "pass" {
|
|
||||||
feed.PublishStatus = "hold"
|
feed.PublishStatus = "hold"
|
||||||
fmt.Printf("Feed auto-held after %d errors: %s\n", feed.ErrorCount, feed.URL)
|
fmt.Printf("Feed auto-held after %d no-updates: %s\n", feed.NoUpdate, feed.URL)
|
||||||
}
|
}
|
||||||
c.saveFeed(feed)
|
c.saveFeed(feed)
|
||||||
return false, nil
|
return false, nil
|
||||||
@@ -204,16 +202,15 @@ func (c *Crawler) CheckFeed(feed *Feed) (bool, error) {
|
|||||||
// 200 OK - feed has new content
|
// 200 OK - feed has new content
|
||||||
bodyBytes, err := io.ReadAll(resp.Body)
|
bodyBytes, err := io.ReadAll(resp.Body)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
feed.ErrorCount++
|
|
||||||
feed.NoUpdate++
|
feed.NoUpdate++
|
||||||
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
feed.NextCrawlAt = now.Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
||||||
feed.LastError = err.Error()
|
feed.LastError = err.Error()
|
||||||
feed.LastErrorAt = now
|
feed.LastErrorAt = now
|
||||||
feed.Status = "error"
|
feed.Status = "hold"
|
||||||
// Auto-hold feeds that fail 100+ times
|
// Auto-hold feeds after 1000 consecutive failures/no-changes
|
||||||
if feed.ErrorCount >= 100 && feed.PublishStatus == "pass" {
|
if feed.NoUpdate >= 1000 && feed.PublishStatus == "pass" {
|
||||||
feed.PublishStatus = "hold"
|
feed.PublishStatus = "hold"
|
||||||
fmt.Printf("Feed auto-held after %d errors: %s\n", feed.ErrorCount, feed.URL)
|
fmt.Printf("Feed auto-held after %d no-updates: %s\n", feed.NoUpdate, feed.URL)
|
||||||
}
|
}
|
||||||
c.saveFeed(feed)
|
c.saveFeed(feed)
|
||||||
return false, err
|
return false, err
|
||||||
@@ -242,9 +239,8 @@ func (c *Crawler) CheckFeed(feed *Feed) (bool, error) {
|
|||||||
// Content changed - reset backoff
|
// Content changed - reset backoff
|
||||||
feed.NoUpdate = 0
|
feed.NoUpdate = 0
|
||||||
feed.NextCrawlAt = now.Add(100 * time.Second)
|
feed.NextCrawlAt = now.Add(100 * time.Second)
|
||||||
feed.ErrorCount = 0
|
|
||||||
feed.LastError = ""
|
feed.LastError = ""
|
||||||
feed.Status = "active"
|
feed.Status = "pass"
|
||||||
c.saveFeed(feed)
|
c.saveFeed(feed)
|
||||||
|
|
||||||
// Save items
|
// Save items
|
||||||
|
|||||||
@@ -30,7 +30,7 @@ func main() {
|
|||||||
go crawler.UpdateStats()
|
go crawler.UpdateStats()
|
||||||
|
|
||||||
// Start all loops independently
|
// Start all loops independently
|
||||||
fmt.Println("Starting import, crawl, check, and stats loops...")
|
fmt.Println("Starting import, crawl, and stats loops...")
|
||||||
|
|
||||||
// Import loop (background) - imports .com domains from vertices.txt.gz
|
// Import loop (background) - imports .com domains from vertices.txt.gz
|
||||||
go crawler.ImportDomainsInBackground("vertices.txt.gz")
|
go crawler.ImportDomainsInBackground("vertices.txt.gz")
|
||||||
@@ -56,10 +56,7 @@ func main() {
|
|||||||
// Publish loop (background) - autopublishes items for approved feeds
|
// Publish loop (background) - autopublishes items for approved feeds
|
||||||
go crawler.StartPublishLoop()
|
go crawler.StartPublishLoop()
|
||||||
|
|
||||||
// Domain check loop (background) - verifies approved domains are reachable
|
// Crawl loop (background) - crawls approved domains for feeds
|
||||||
go crawler.StartDomainCheckLoop()
|
|
||||||
|
|
||||||
// Crawl loop (background) - crawls checked domains for feeds
|
|
||||||
go crawler.StartCrawlLoop()
|
go crawler.StartCrawlLoop()
|
||||||
|
|
||||||
// Wait for shutdown signal
|
// Wait for shutdown signal
|
||||||
|
|||||||
@@ -163,9 +163,6 @@ func (c *Crawler) parseRSSMetadata(body string, feed *Feed) []*Item {
|
|||||||
feed.Description = ch.Description
|
feed.Description = ch.Description
|
||||||
feed.Language = ch.Language
|
feed.Language = ch.Language
|
||||||
feed.SiteURL = normalizeURL(ch.Link)
|
feed.SiteURL = normalizeURL(ch.Link)
|
||||||
feed.TTLMinutes = ch.TTL
|
|
||||||
feed.UpdatePeriod = ch.UpdatePeriod
|
|
||||||
feed.UpdateFreq = ch.UpdateFreq
|
|
||||||
feed.ItemCount = len(ch.Items)
|
feed.ItemCount = len(ch.Items)
|
||||||
|
|
||||||
// Detect podcast
|
// Detect podcast
|
||||||
@@ -251,10 +248,6 @@ func (c *Crawler) parseRSSMetadata(body string, feed *Feed) []*Item {
|
|||||||
feed.OldestItemDate = oldest
|
feed.OldestItemDate = oldest
|
||||||
feed.NewestItemDate = newest
|
feed.NewestItemDate = newest
|
||||||
|
|
||||||
if len(dates) > 1 {
|
|
||||||
totalHours := newest.Sub(oldest).Hours()
|
|
||||||
feed.AvgPostFreqHrs = totalHours / float64(len(dates)-1)
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return items
|
return items
|
||||||
@@ -367,10 +360,6 @@ func (c *Crawler) parseAtomMetadata(body string, feed *Feed) []*Item {
|
|||||||
feed.OldestItemDate = oldest
|
feed.OldestItemDate = oldest
|
||||||
feed.NewestItemDate = newest
|
feed.NewestItemDate = newest
|
||||||
|
|
||||||
if len(dates) > 1 {
|
|
||||||
totalHours := newest.Sub(oldest).Hours()
|
|
||||||
feed.AvgPostFreqHrs = totalHours / float64(len(dates)-1)
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return items
|
return items
|
||||||
@@ -399,48 +388,8 @@ func parseRSSDate(s string) (time.Time, error) {
|
|||||||
|
|
||||||
// calculateNextCrawl determines when to next crawl this feed
|
// calculateNextCrawl determines when to next crawl this feed
|
||||||
func (c *Crawler) calculateNextCrawl(feed *Feed) time.Time {
|
func (c *Crawler) calculateNextCrawl(feed *Feed) time.Time {
|
||||||
now := time.Now()
|
// Adaptive backoff: 100s base + 100s per consecutive no-change
|
||||||
|
return time.Now().Add(time.Duration(100+100*feed.NoUpdate) * time.Second)
|
||||||
// If TTL is specified, use it
|
|
||||||
if feed.TTLMinutes > 0 {
|
|
||||||
return now.Add(time.Duration(feed.TTLMinutes) * time.Minute)
|
|
||||||
}
|
|
||||||
|
|
||||||
// If updatePeriod is specified
|
|
||||||
if feed.UpdatePeriod != "" {
|
|
||||||
freq := feed.UpdateFreq
|
|
||||||
if freq == 0 {
|
|
||||||
freq = 1
|
|
||||||
}
|
|
||||||
switch strings.ToLower(feed.UpdatePeriod) {
|
|
||||||
case "hourly":
|
|
||||||
return now.Add(time.Duration(freq) * time.Hour)
|
|
||||||
case "daily":
|
|
||||||
return now.Add(time.Duration(freq) * 24 * time.Hour)
|
|
||||||
case "weekly":
|
|
||||||
return now.Add(time.Duration(freq) * 7 * 24 * time.Hour)
|
|
||||||
case "monthly":
|
|
||||||
return now.Add(time.Duration(freq) * 30 * 24 * time.Hour)
|
|
||||||
case "yearly":
|
|
||||||
return now.Add(time.Duration(freq) * 365 * 24 * time.Hour)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// If we have average post frequency, use that
|
|
||||||
if feed.AvgPostFreqHrs > 0 {
|
|
||||||
// Crawl at half the average frequency, but at least every hour and at most once per day
|
|
||||||
crawlInterval := feed.AvgPostFreqHrs / 2
|
|
||||||
if crawlInterval < 1 {
|
|
||||||
crawlInterval = 1
|
|
||||||
}
|
|
||||||
if crawlInterval > 24 {
|
|
||||||
crawlInterval = 24
|
|
||||||
}
|
|
||||||
return now.Add(time.Duration(crawlInterval * float64(time.Hour)))
|
|
||||||
}
|
|
||||||
|
|
||||||
// Default: crawl every 6 hours
|
|
||||||
return now.Add(6 * time.Hour)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// extractItemImages extracts image URLs from an RSS item
|
// extractItemImages extracts image URLs from an RSS item
|
||||||
@@ -661,10 +610,6 @@ func (c *Crawler) parseJSONFeedMetadata(body string, feed *Feed) []*Item {
|
|||||||
feed.OldestItemDate = oldest
|
feed.OldestItemDate = oldest
|
||||||
feed.NewestItemDate = newest
|
feed.NewestItemDate = newest
|
||||||
|
|
||||||
if len(dates) > 1 {
|
|
||||||
totalHours := newest.Sub(oldest).Hours()
|
|
||||||
feed.AvgPostFreqHrs = totalHours / float64(len(dates)-1)
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return items
|
return items
|
||||||
|
|||||||
+161
-66
@@ -91,16 +91,13 @@ function initDashboard() {
|
|||||||
['Language', f.language],
|
['Language', f.language],
|
||||||
['Site URL', f.siteUrl],
|
['Site URL', f.siteUrl],
|
||||||
['Status', f.status],
|
['Status', f.status],
|
||||||
['Error Count', f.errorCount],
|
|
||||||
['Last Error', f.lastError],
|
['Last Error', f.lastError],
|
||||||
['Item Count', f.itemCount],
|
['Item Count', f.itemCount],
|
||||||
['Avg Post Freq', f.avgPostFreqHrs ? f.avgPostFreqHrs.toFixed(1) + ' hrs' : null],
|
|
||||||
['Oldest Item', f.oldestItemDate],
|
['Oldest Item', f.oldestItemDate],
|
||||||
['Newest Item', f.newestItemDate],
|
['Newest Item', f.newestItemDate],
|
||||||
['Discovered', f.discoveredAt],
|
['Discovered', f.discoveredAt],
|
||||||
['Last Crawled', f.lastCrawledAt],
|
['Last Crawled', f.lastCrawledAt],
|
||||||
['Next Crawl', f.nextCrawlAt],
|
['Next Crawl', f.nextCrawlAt],
|
||||||
['TTL', f.ttlMinutes ? f.ttlMinutes + ' min' : null],
|
|
||||||
['Publish Status', f.publishStatus],
|
['Publish Status', f.publishStatus],
|
||||||
['Publish Account', f.publishAccount],
|
['Publish Account', f.publishAccount],
|
||||||
];
|
];
|
||||||
@@ -132,8 +129,8 @@ function initDashboard() {
|
|||||||
let html = '';
|
let html = '';
|
||||||
items.forEach(item => {
|
items.forEach(item => {
|
||||||
const date = item.pub_date ? new Date(item.pub_date).toLocaleDateString() : '';
|
const date = item.pub_date ? new Date(item.pub_date).toLocaleDateString() : '';
|
||||||
html += `<div style="padding: 2px 0; border-bottom: 1px solid #1a1a1a;">`;
|
html += `<div style="padding: 2px 0; border-bottom: 1px solid #1a1a1a; overflow: hidden;">`;
|
||||||
html += `<span style="color: #666; margin-right: 8px;">${escapeHtml(date)}</span>`;
|
html += `<div style="float: left; width: 6em; white-space: nowrap; margin-right: 6px; color: #666; text-align: right;">${escapeHtml(date)} </div>`;
|
||||||
if (item.link) {
|
if (item.link) {
|
||||||
html += `<a href="${escapeHtml(item.link)}" target="_blank" style="color: #0af; text-decoration: none;">${escapeHtml(item.title || item.link)}</a>`;
|
html += `<a href="${escapeHtml(item.link)}" target="_blank" style="color: #0af; text-decoration: none;">${escapeHtml(item.title || item.link)}</a>`;
|
||||||
} else {
|
} else {
|
||||||
@@ -151,44 +148,88 @@ function initDashboard() {
|
|||||||
const statusConfig = {
|
const statusConfig = {
|
||||||
hold: { color: '#f90', bg: '#330', border: '#550' },
|
hold: { color: '#f90', bg: '#330', border: '#550' },
|
||||||
skip: { color: '#f66', bg: '#400', border: '#600' },
|
skip: { color: '#f66', bg: '#400', border: '#600' },
|
||||||
pass: { color: '#0f0', bg: '#040', border: '#060' },
|
pass: { color: '#0f0', bg: '#040', border: '#060' }
|
||||||
fail: { color: '#f00', bg: '#400', border: '#600' }
|
|
||||||
};
|
};
|
||||||
|
|
||||||
// Render status buttons
|
// Render status buttons
|
||||||
function renderStatusBtns(currentStatus, type, id, errorStatus) {
|
function renderStatusBtns(currentStatus, type, id) {
|
||||||
const order = ['pass', 'hold', 'skip'];
|
const order = ['pass', 'hold', 'skip'];
|
||||||
const showFail = errorStatus === 'error' || errorStatus === 'dead';
|
|
||||||
let html = '<div class="status-btn-group" style="display: inline-flex; margin-right: 10px;">';
|
let html = '<div class="status-btn-group" style="display: inline-flex; margin-right: 10px;">';
|
||||||
order.forEach((s, i) => {
|
order.forEach((s, i) => {
|
||||||
const cfg = statusConfig[s];
|
const cfg = statusConfig[s];
|
||||||
const isActive = s === currentStatus;
|
const isActive = s === currentStatus;
|
||||||
const bg = isActive ? cfg.bg : '#111';
|
const bg = isActive ? cfg.bg : '#1a1a1a';
|
||||||
const border = isActive ? cfg.border : '#333';
|
const border = isActive ? cfg.border : '#333';
|
||||||
const color = isActive ? cfg.color : '#444';
|
const color = isActive ? cfg.color : '#ccc';
|
||||||
html += `<button class="status-btn" data-type="${type}" data-id="${escapeHtml(id)}" data-status="${s}"
|
html += `<button class="status-btn" data-type="${type}" data-id="${escapeHtml(id)}" data-status="${s}"
|
||||||
style="padding: 2px 6px; font-family: monospace;
|
style="padding: 2px 6px; background: ${bg}; border: 1px solid ${border}; border-radius: 3px;
|
||||||
background: ${bg}; border: 1px solid ${border}; border-radius: 3px;
|
|
||||||
color: ${color}; cursor: pointer; margin-left: ${i > 0 ? '1px' : '0'};">${s}</button>`;
|
color: ${color}; cursor: pointer; margin-left: ${i > 0 ? '1px' : '0'};">${s}</button>`;
|
||||||
});
|
});
|
||||||
if (showFail) {
|
|
||||||
const cfg = statusConfig.fail;
|
|
||||||
html += `<button disabled style="padding: 2px 6px; font-family: monospace;
|
|
||||||
background: ${cfg.bg}; border: 1px solid ${cfg.border}; border-radius: 3px;
|
|
||||||
color: ${cfg.color}; cursor: default; margin-left: 1px;">fail</button>`;
|
|
||||||
}
|
|
||||||
html += '</div>';
|
html += '</div>';
|
||||||
return html;
|
return html;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Render TLD section header
|
||||||
|
function renderTLDHeader(tld) {
|
||||||
|
return `<div class="tld-section" data-tld="${escapeHtml(tld)}">
|
||||||
|
<div class="tld-header" style="display: flex; align-items: center; padding: 10px; background: #1a1a1a; border-bottom: 1px solid #333; cursor: pointer; user-select: none;">
|
||||||
|
<span class="tld-toggle" style="color: #666; margin-right: 10px;">▼</span>
|
||||||
|
<span style="color: #0af; font-weight: bold; font-size: 1.1em;">.${escapeHtml(tld)}</span>
|
||||||
|
</div>
|
||||||
|
<div class="tld-content" style="display: block;">`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function renderTLDFooter(tld) {
|
||||||
|
return `<div class="tld-footer" style="display: flex; align-items: center; justify-content: flex-start; padding: 6px 10px; background: #1a1a1a; border-top: 1px solid #333; cursor: pointer; user-select: none;">
|
||||||
|
<span style="color: #666; font-size: 0.9em;">▲ .${escapeHtml(tld)}</span>
|
||||||
|
</div>`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function closeTLDSection(container, tld) {
|
||||||
|
const tldContent = container.querySelector(`.tld-section[data-tld="${tld}"] .tld-content`);
|
||||||
|
if (tldContent) {
|
||||||
|
tldContent.insertAdjacentHTML('beforeend', renderTLDFooter(tld));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Event delegation for TLD header/footer clicks (toggle section)
|
||||||
|
document.addEventListener('click', (e) => {
|
||||||
|
const tldHeader = e.target.closest('.tld-header');
|
||||||
|
const tldFooter = e.target.closest('.tld-footer');
|
||||||
|
if (tldHeader || tldFooter) {
|
||||||
|
const section = (tldHeader || tldFooter).closest('.tld-section');
|
||||||
|
if (section) {
|
||||||
|
const content = section.querySelector('.tld-content');
|
||||||
|
const toggle = section.querySelector('.tld-toggle');
|
||||||
|
if (content) {
|
||||||
|
const isVisible = content.style.display !== 'none';
|
||||||
|
content.style.display = isVisible ? 'none' : 'block';
|
||||||
|
if (toggle) toggle.textContent = isVisible ? '▶' : '▼';
|
||||||
|
|
||||||
|
if (isVisible) {
|
||||||
|
// Closing - scroll to next TLD section
|
||||||
|
const nextSection = section.nextElementSibling;
|
||||||
|
if (nextSection && nextSection.classList.contains('tld-section')) {
|
||||||
|
nextSection.scrollIntoView({ behavior: 'smooth', block: 'start' });
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// Opening - load domains if not already loaded
|
||||||
|
if (section.dataset.loaded === 'false') {
|
||||||
|
loadTLDDomains(section, searchQuery);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
// Render domain row with feeds
|
// Render domain row with feeds
|
||||||
function renderDomainRow(d) {
|
function renderDomainRow(d) {
|
||||||
const status = d.status || 'hold';
|
const status = d.status || 'hold';
|
||||||
const hasError = !!d.last_error;
|
|
||||||
|
|
||||||
let html = `<div class="domain-block" data-host="${escapeHtml(d.host)}" data-status="${status}">`;
|
let html = `<div class="domain-block" data-host="${escapeHtml(d.host)}" data-status="${status}">`;
|
||||||
html += `<div class="domain-row" style="display: flex; align-items: center; padding: 8px 10px; border-bottom: 1px solid #202020;">`;
|
html += `<div class="domain-row" style="display: flex; align-items: center; padding: 8px 10px; border-bottom: 1px solid #202020;">`;
|
||||||
html += renderStatusBtns(status, 'domain', d.host, hasError ? 'error' : null);
|
html += renderStatusBtns(status, 'domain', d.host);
|
||||||
html += `<a class="domain-name" href="https://${escapeHtml(d.host)}" target="_blank" style="color: #0af; text-decoration: none;">${escapeHtml(d.host)}</a>`;
|
html += `<a class="domain-name" href="https://${escapeHtml(d.host)}" target="_blank" style="color: #0af; text-decoration: none;">${escapeHtml(d.host)}</a>`;
|
||||||
|
|
||||||
if (d.last_error) {
|
if (d.last_error) {
|
||||||
@@ -206,15 +247,11 @@ function initDashboard() {
|
|||||||
html += `<div class="inline-feed-block" data-url="${escapeHtml(f.url)}" data-status="${feedStatus}">`;
|
html += `<div class="inline-feed-block" data-url="${escapeHtml(f.url)}" data-status="${feedStatus}">`;
|
||||||
html += `<div class="feed-row" style="display: flex; align-items: center; padding: 4px 0;">`;
|
html += `<div class="feed-row" style="display: flex; align-items: center; padding: 4px 0;">`;
|
||||||
|
|
||||||
const lang = f.language || '';
|
html += `<span style="width: 48px; flex-shrink: 0; white-space: nowrap; margin-right: 6px; color: #666; text-align: center;">${escapeHtml(f.language || '')} </span>`;
|
||||||
html += `<span style="display: inline-block; width: 32px; margin-right: 6px; color: #666; font-family: monospace; text-align: center;">${escapeHtml(lang)}</span>`;
|
html += renderStatusBtns(feedStatus, 'feed', f.url);
|
||||||
html += renderStatusBtns(feedStatus, 'feed', f.url, f.status);
|
|
||||||
|
|
||||||
const statusColor = f.status === 'active' ? '#484' : f.status === 'error' ? '#a66' : '#666';
|
|
||||||
html += `<span style="color: ${statusColor}; font-family: monospace; width: 50px; margin-right: 6px;">${escapeHtml(f.status || 'active')}</span>`;
|
|
||||||
|
|
||||||
if (f.item_count > 0) {
|
if (f.item_count > 0) {
|
||||||
html += `<span style="color: #888; font-family: monospace; width: 55px; margin-right: 6px; text-align: right;">${commaFormat(f.item_count)}</span>`;
|
html += `<span style="color: #888; width: 55px; margin-right: 6px; text-align: center;">${commaFormat(f.item_count)}</span>`;
|
||||||
} else {
|
} else {
|
||||||
html += `<span style="width: 55px; margin-right: 6px;"></span>`;
|
html += `<span style="width: 55px; margin-right: 6px;"></span>`;
|
||||||
}
|
}
|
||||||
@@ -235,6 +272,7 @@ function initDashboard() {
|
|||||||
html += '<div class="feed-items" style="display: block; padding: 4px 10px; margin-left: 10px; border-left: 2px solid #333;"></div>';
|
html += '<div class="feed-items" style="display: block; padding: 4px 10px; margin-left: 10px; border-left: 2px solid #333;"></div>';
|
||||||
html += '</div>';
|
html += '</div>';
|
||||||
});
|
});
|
||||||
|
html += '<div style="height: 8px;"></div>';
|
||||||
html += '</div>';
|
html += '</div>';
|
||||||
}
|
}
|
||||||
html += '</div>';
|
html += '</div>';
|
||||||
@@ -285,7 +323,7 @@ function initDashboard() {
|
|||||||
infiniteScrollState = null;
|
infiniteScrollState = null;
|
||||||
}
|
}
|
||||||
|
|
||||||
window.addEventListener('scroll', async () => {
|
async function checkInfiniteScroll() {
|
||||||
if (!infiniteScrollState || infiniteScrollState.ended || isLoadingMore) return;
|
if (!infiniteScrollState || infiniteScrollState.ended || isLoadingMore) return;
|
||||||
const scrollY = window.scrollY + window.innerHeight;
|
const scrollY = window.scrollY + window.innerHeight;
|
||||||
const docHeight = document.documentElement.scrollHeight;
|
const docHeight = document.documentElement.scrollHeight;
|
||||||
@@ -294,60 +332,119 @@ function initDashboard() {
|
|||||||
await infiniteScrollState.loadMore();
|
await infiniteScrollState.loadMore();
|
||||||
isLoadingMore = false;
|
isLoadingMore = false;
|
||||||
}
|
}
|
||||||
});
|
}
|
||||||
|
|
||||||
|
window.addEventListener('scroll', checkInfiniteScroll);
|
||||||
|
|
||||||
|
// Load and display feeds with lazy-loading TLD sections
|
||||||
|
let tldObserver = null;
|
||||||
|
|
||||||
// Load and display feeds
|
|
||||||
async function loadFeeds(query = '') {
|
async function loadFeeds(query = '') {
|
||||||
const output = document.getElementById('output');
|
const output = document.getElementById('output');
|
||||||
output.innerHTML = '<div class="domain-list"></div><div id="infiniteLoader" style="text-align: center; padding: 10px; color: #666;">Loading...</div>';
|
output.innerHTML = '<div class="domain-list"></div><div id="infiniteLoader" style="text-align: center; padding: 10px; color: #666;">Loading TLDs...</div>';
|
||||||
|
|
||||||
let offset = 0;
|
// Disconnect previous observer if any
|
||||||
const limit = 100;
|
if (tldObserver) {
|
||||||
|
tldObserver.disconnect();
|
||||||
|
}
|
||||||
|
|
||||||
async function loadMore() {
|
try {
|
||||||
try {
|
// Fetch all TLDs first
|
||||||
let url = `/api/domains?limit=${limit}&offset=${offset}&sort=alpha&has_feeds=true`;
|
const tldsResp = await fetch('/api/tlds?has_feeds=true');
|
||||||
if (query) {
|
const tlds = await tldsResp.json();
|
||||||
url += `&search=${encodeURIComponent(query)}`;
|
|
||||||
}
|
|
||||||
|
|
||||||
const resp = await fetch(url);
|
if (!tlds || tlds.length === 0) {
|
||||||
const domains = await resp.json();
|
document.getElementById('infiniteLoader').textContent = 'No feeds found';
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
if (!domains || domains.length === 0) {
|
const container = output.querySelector('.domain-list');
|
||||||
if (infiniteScrollState) infiniteScrollState.ended = true;
|
|
||||||
document.getElementById('infiniteLoader').textContent = offset === 0 ? 'No feeds found' : 'End of list';
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
const container = output.querySelector('.domain-list');
|
// Render all TLD sections as collapsed placeholders
|
||||||
domains.forEach(d => {
|
tlds.forEach(t => {
|
||||||
container.insertAdjacentHTML('beforeend', renderDomainRow(d));
|
const tld = t.tld || 'unknown';
|
||||||
|
container.insertAdjacentHTML('beforeend', `
|
||||||
|
<div class="tld-section" data-tld="${escapeHtml(tld)}" data-loaded="false">
|
||||||
|
<div class="tld-header" style="display: flex; align-items: center; padding: 10px; background: #1a1a1a; border-bottom: 1px solid #333; cursor: pointer; user-select: none;">
|
||||||
|
<span class="tld-toggle" style="color: #666; margin-right: 10px;">▶</span>
|
||||||
|
<span style="color: #0af; font-weight: bold; font-size: 1.1em;">.${escapeHtml(tld)}</span>
|
||||||
|
<span style="color: #666; margin-left: 10px; font-size: 0.9em;">(${t.domain_count} domains)</span>
|
||||||
|
</div>
|
||||||
|
<div class="tld-content" style="display: none;">
|
||||||
|
<div class="tld-loading" style="padding: 10px; color: #666;">Loading...</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`);
|
||||||
|
});
|
||||||
|
|
||||||
|
document.getElementById('infiniteLoader').textContent = `${tlds.length} TLDs loaded`;
|
||||||
|
|
||||||
|
// Set up IntersectionObserver for lazy loading (loads even when collapsed)
|
||||||
|
tldObserver = new IntersectionObserver((entries) => {
|
||||||
|
entries.forEach(entry => {
|
||||||
|
if (entry.isIntersecting) {
|
||||||
|
const section = entry.target;
|
||||||
|
if (section.dataset.loaded === 'false') {
|
||||||
|
loadTLDDomains(section, query);
|
||||||
|
tldObserver.unobserve(section);
|
||||||
|
}
|
||||||
|
}
|
||||||
});
|
});
|
||||||
attachStatusHandlers(container);
|
}, { rootMargin: '500px' });
|
||||||
|
|
||||||
|
// Observe all TLD sections
|
||||||
|
container.querySelectorAll('.tld-section').forEach(section => {
|
||||||
|
tldObserver.observe(section);
|
||||||
|
});
|
||||||
|
|
||||||
|
} catch (err) {
|
||||||
|
document.getElementById('infiniteLoader').textContent = 'Error: ' + err.message;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Load domains for a specific TLD section
|
||||||
|
async function loadTLDDomains(section, query = '') {
|
||||||
|
const tld = section.dataset.tld;
|
||||||
|
section.dataset.loaded = 'loading';
|
||||||
|
|
||||||
|
try {
|
||||||
|
let url = `/api/domains?has_feeds=true&tld=${encodeURIComponent(tld)}&limit=500`;
|
||||||
|
if (query) {
|
||||||
|
url += `&search=${encodeURIComponent(query)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
const resp = await fetch(url);
|
||||||
|
const domains = await resp.json();
|
||||||
|
|
||||||
|
const content = section.querySelector('.tld-content');
|
||||||
|
content.innerHTML = '';
|
||||||
|
|
||||||
|
if (!domains || domains.length === 0) {
|
||||||
|
content.innerHTML = '<div style="padding: 10px; color: #666;">No domains with feeds</div>';
|
||||||
|
} else {
|
||||||
|
domains.forEach(d => {
|
||||||
|
content.insertAdjacentHTML('beforeend', renderDomainRow(d));
|
||||||
|
});
|
||||||
|
// Add footer
|
||||||
|
content.insertAdjacentHTML('beforeend', renderTLDFooter(tld));
|
||||||
|
attachStatusHandlers(content);
|
||||||
|
|
||||||
// Load items for all feeds
|
// Load items for all feeds
|
||||||
container.querySelectorAll('.inline-feed-block').forEach(feedBlock => {
|
content.querySelectorAll('.inline-feed-block').forEach(feedBlock => {
|
||||||
const itemsDiv = feedBlock.querySelector('.feed-items');
|
const itemsDiv = feedBlock.querySelector('.feed-items');
|
||||||
if (itemsDiv && !itemsDiv.dataset.loaded) {
|
if (itemsDiv && !itemsDiv.dataset.loaded) {
|
||||||
itemsDiv.dataset.loaded = 'true';
|
itemsDiv.dataset.loaded = 'true';
|
||||||
loadFeedItems(feedBlock.dataset.url, itemsDiv);
|
loadFeedItems(feedBlock.dataset.url, itemsDiv);
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
offset += domains.length;
|
|
||||||
|
|
||||||
if (domains.length < limit) {
|
|
||||||
if (infiniteScrollState) infiniteScrollState.ended = true;
|
|
||||||
document.getElementById('infiniteLoader').textContent = 'End of list';
|
|
||||||
}
|
|
||||||
} catch (err) {
|
|
||||||
document.getElementById('infiniteLoader').textContent = 'Error: ' + err.message;
|
|
||||||
}
|
}
|
||||||
}
|
|
||||||
|
|
||||||
await loadMore();
|
section.dataset.loaded = 'true';
|
||||||
setupInfiniteScroll(loadMore);
|
} catch (err) {
|
||||||
|
const content = section.querySelector('.tld-content');
|
||||||
|
content.innerHTML = `<div style="padding: 10px; color: #f66;">Error: ${escapeHtml(err.message)}</div>`;
|
||||||
|
section.dataset.loaded = 'false';
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Search handler
|
// Search handler
|
||||||
@@ -357,7 +454,6 @@ function initDashboard() {
|
|||||||
clearTimeout(searchTimeout);
|
clearTimeout(searchTimeout);
|
||||||
searchTimeout = setTimeout(() => {
|
searchTimeout = setTimeout(() => {
|
||||||
searchQuery = searchInput.value.trim();
|
searchQuery = searchInput.value.trim();
|
||||||
clearInfiniteScroll();
|
|
||||||
loadFeeds(searchQuery);
|
loadFeeds(searchQuery);
|
||||||
}, 300);
|
}, 300);
|
||||||
});
|
});
|
||||||
@@ -374,7 +470,6 @@ function initDashboard() {
|
|||||||
document.getElementById('holdDomains').textContent = commaFormat(stats.hold_domains);
|
document.getElementById('holdDomains').textContent = commaFormat(stats.hold_domains);
|
||||||
document.getElementById('passDomains').textContent = commaFormat(stats.pass_domains);
|
document.getElementById('passDomains').textContent = commaFormat(stats.pass_domains);
|
||||||
document.getElementById('skipDomains').textContent = commaFormat(stats.skip_domains);
|
document.getElementById('skipDomains').textContent = commaFormat(stats.skip_domains);
|
||||||
document.getElementById('failDomains').textContent = commaFormat(stats.fail_domains);
|
|
||||||
document.getElementById('crawlRate').textContent = commaFormat(stats.crawl_rate);
|
document.getElementById('crawlRate').textContent = commaFormat(stats.crawl_rate);
|
||||||
document.getElementById('checkRate').textContent = commaFormat(stats.check_rate);
|
document.getElementById('checkRate').textContent = commaFormat(stats.check_rate);
|
||||||
document.getElementById('totalFeeds').textContent = commaFormat(stats.total_feeds);
|
document.getElementById('totalFeeds').textContent = commaFormat(stats.total_feeds);
|
||||||
|
|||||||
+5
-8
@@ -444,8 +444,9 @@ const dashboardHTML = `<!DOCTYPE html>
|
|||||||
<head>
|
<head>
|
||||||
<title>1440.news Feed Crawler</title>
|
<title>1440.news Feed Crawler</title>
|
||||||
<meta charset="utf-8">
|
<meta charset="utf-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||||
<link rel="stylesheet" href="/static/dashboard.css">
|
<link rel="stylesheet" href="/static/dashboard.css">
|
||||||
<script src="/static/dashboard.js?v=37"></script>
|
<script src="/static/dashboard.js?v=99"></script>
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
<h1>1440.news Feed Crawler</h1>
|
<h1>1440.news Feed Crawler</h1>
|
||||||
@@ -468,10 +469,6 @@ const dashboardHTML = `<!DOCTYPE html>
|
|||||||
<div class="stat-value" id="skipDomains" style="color: #f66;">{{comma .SkipDomains}}</div>
|
<div class="stat-value" id="skipDomains" style="color: #f66;">{{comma .SkipDomains}}</div>
|
||||||
<div class="stat-label">Skip</div>
|
<div class="stat-label">Skip</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="card">
|
|
||||||
<div class="stat-value" id="failDomains" style="color: #f00;">{{comma .FailDomains}}</div>
|
|
||||||
<div class="stat-label">Fail</div>
|
|
||||||
</div>
|
|
||||||
<div class="card">
|
<div class="card">
|
||||||
<div class="stat-value" id="crawlRate">{{comma .CrawlRate}}</div>
|
<div class="stat-value" id="crawlRate">{{comma .CrawlRate}}</div>
|
||||||
<div class="stat-label">crawls/min</div>
|
<div class="stat-label">crawls/min</div>
|
||||||
@@ -502,16 +499,16 @@ const dashboardHTML = `<!DOCTYPE html>
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="card" id="inputCard">
|
<div class="card" id="inputCard" style="position: sticky; top: 0; z-index: 100; background: #111;">
|
||||||
<input type="text" id="searchInput" placeholder="Search feeds..."
|
<input type="text" id="searchInput" placeholder="Search feeds..."
|
||||||
style="width: 100%; padding: 12px; background: #0a0a0a; border: 1px solid #333; border-radius: 4px; color: #fff; font-family: monospace;">
|
style="width: 100%; padding: 12px; background: #0a0a0a; border: 1px solid #333; border-radius: 4px; color: #fff;">
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="card" id="outputCard">
|
<div class="card" id="outputCard">
|
||||||
<div id="output"></div>
|
<div id="output"></div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div style="color: #333; font-size: 11px; margin-top: 10px;">v59</div>
|
<div style="color: #333; font-size: 11px; margin-top: 10px;">v100</div>
|
||||||
|
|
||||||
<div class="updated" id="updatedAt">Last updated: {{.UpdatedAt.Format "2006-01-02 15:04:05"}}</div>
|
<div class="updated" id="updatedAt">Last updated: {{.UpdatedAt.Format "2006-01-02 15:04:05"}}</div>
|
||||||
</body>
|
</body>
|
||||||
|
|||||||
Reference in New Issue
Block a user