Crawl

Start an asynchronous BFS (Breadth-First Search) crawl of a website. The job runs in the background — POST /v1/crawl returns immediately with a job ID, and you poll GET /v1/crawl/:id for progress.

Flow

POST /v1/crawl { url, maxDepth, maxPages, formats }
    │
    ├── Validate URL
    ├── Fetch robots.txt  (if respect_robots_txt = true)
    ├── Create goroutine for crawling
    └── Return job ID immediately
            │
            └── BFS Crawler (background)
                    │
                    ├── Depth 0: Seed URL
                    │       ├── Fetch (same render path as /v1/scrape)
                    │       ├── Extract content
                    │       └── Extract links
                    │
                    ├── Depth 1: Discovered URLs (same-origin, not disallowed)
                    │       └── ...
                    │
                    └── Until maxDepth or maxPages reached

BFS Algorithm

Queue = [seedURL at depth 0]

while queue not empty and results < maxPages:
    Extract all items at current depth as "frontier"

    Process frontier concurrently (up to maxConcurrency):
        ├── Check robots.txt ──→ skip if disallowed
        ├── Rate limit sleep (per-domain)
        ├── Fetch via shared *core.Scraper
        │     (same render path: HTTP or browser auto-escalation)
        ├── Extract content (markdown, html, links, etc.)
        └── Send result via StateCh

    Wait for all goroutines to complete

    For each discovered link (depth + 1):
        ├── Same origin? ──→ No: skip
        ├── Already visited? ──→ Yes: skip
        ├── Safe URL? (no javascript:, data:) ──→ No: skip
        └── robots.txt allowed? ──→ No: skip
        └── Add to queue

Key Features

robots.txt

Fetched once at crawl start per origin:

robots := FetchRobotsTxt(origin, userAgent)
// Stored and reused for all URLs in this crawl

IsAllowed(path) checks both User-agent: * and the specific user-agent configured.

Per-Domain Rate Limiting

limiter := newDomainRateLimiter(host, requestsPerSecond)
limiter.Sleep()  // blocks until it's time to make the next request

Also applies optional random jitter (jitterFactor) to the sleep duration to avoid uniform request patterns.

Concurrency Control

Two-level semaphore:

Global: Max maxConcurrency concurrent fetches across all hosts
Per-host: Max 10 concurrent fetches per individual host

sem := make(chan struct{}, maxConcurrency)  // global
hostSem[host] := make(chan struct{}, 10)       // per-host

Link Filtering

Only same-origin links are followed. Before adding a discovered URL to the queue:

Parse URL and extract origin
Check same origin as seed
Check not already visited (global deduplication map)
Check safe protocol (http/https only)
Check robots.txt allowance

State Reporting

Crawl state is streamed via a channel to the HTTP handler:

type CrawlState struct {
    ID        string       // Job ID
    Status    CrawlStatus  // "pending" | "scraping" | "completed" | "failed"
    Total     uint32       // URLs discovered so far
    Completed uint32       // Pages successfully scraped
    Data      []ScrapeData // Completed page results
    Error     *string      // Error message if failed
}

Per-Page Render Mode

Each page in the crawl uses the same render path as /v1/scrape:

If renderMode = "http" → HTTP fetcher only
If renderMode = "browser" → Always browser (CDP)
If renderMode = "auto" (default) → HTTP first, escalate on SPA/anti-bot triggers

This is done with the shared *core.Scraper — the same single render path used by all endpoints.

Flow​

BFS Algorithm​

Key Features​

robots.txt​

Per-Domain Rate Limiting​

Concurrency Control​

Link Filtering​

State Reporting​

Per-Page Render Mode​