Skip to main content

Scrape

Scrape a single URL and return content in one or more formats: markdown, html, rawHtml, plainText, links, imageLinks, or json (LLM extraction).

Render Flow

Request

├── renderMode = "http" ──→ HTTP Fetcher ──────────→ Extractor

└── renderMode = "auto" ──→ HTTP Fetcher

┌────────┴────────┐
│ Success + │
│ no triggers? │
└────────┬────────┘

┌──────────────┼──────────────┐
↓ ↓ ↓
Return it SPA / anti-bot Thin content /
│ / soft-block Cloudflare?
│ ↓ ↓
│ Browser CDP Browser CDP
↓ ↓ ↓
└──────────→ Extractor ←─────┘

┌─────┴─────┐
↓ ↓
Markdown HTML /
(primary) Plain Text

HTTP Fetcher

The HTTP fetcher (internal/core/http.go) makes a plain HTTP GET request with:

  • Stealth headers — Real browser User-Agent from a pool, injected via HeaderProfile
  • Custom headers — Caller-provided headers merged on top
  • TLS 1.2/1.3 — Minimum TLS version enforced
  • Retries — Up to HTTPMaxRetries for retriable errors (502, 503, 504, timeouts, connection resets)
  • Response size limit — Rejects responses over MaxResponseBytes
result, err := httpFetcher.Fetch(url, headers, waitForMs)

Response

type FetchResult struct {
URL string // Requested URL
FinalURL string // After redirects
StatusCode uint16 // HTTP status
HTML string // Response body
ContentType string // Content-Type header
RenderedWith string // "http" or "pdf"
Warning *string // Non-fatal warning
}

Browser (CDP) Fetcher

When renderMode = "browser" or auto mode escalates, the CDP fetcher (internal/core/renderer.go) launches a headless Chrome browser via the Chrome DevTools Protocol (CDP).

Browser Startup

RendererBootstrap

├── LightPanda ──→ WebSocket CDP endpoint
│ (auto-launched if no ws_url configured)

└── Chrome / Cloak ──→ User-provided WS URL

CDP Action Sequence

For each browser request, chromedp runs this action chain:

1. enableNetworkTracking
→ Enables Network CDP domain
→ Captures document response status code (normally hidden from chromedp)
→ Tracks in-flight requests for SPA network-idle fast exit

2. fetchBlockAction
→ Blocks analytics, ads, and tracking URLs
→ Logs blocked URLs for reporting

3. stealthInjectionAction (if stealth.enabled)
→ Patches navigator.webdriver → false
→ Adds window.chrome object
→ Patches navigator.plugins with fake Chrome plugins
→ Patches navigator.languages → ["en-US", "en"]
→ Patches WebGLRenderingContext.getParameter → Intel GPU
→ Patches Function.toString for native functions
→ Runs BEFORE page's own scripts via Page.addScriptToEvaluateOnNewDocument

4. navigateIgnoringHTTPStatus
→ Raw CDP page.Navigate
→ Ignores errorText (chromedp normally aborts on non-2xx)
→ Waits for Page.loadEventFired

5. dismissCookieBannersFastAction (only when waitMs == 0)
→ Fast cookie banner dismissal via JS

6a. SPA Readiness Poll (when waitMs == 0)
→ Polls every 200ms for:
- CSS selector match (main, article, [role=main], #content, #root > *, #app > *)
- Body text ≥ 800 chars
- Optional JS predicate
- Network idle for 500ms (fast exit for static pages)
→ Timeout: 15s budget
→ OR
6b. Fixed Wait (when waitMs > 0)
→ Single time.Sleep(waitMs)

7. AutoScrollAction
→ 30 steps of 90% viewport scroll, 200ms pause
→ Lazy-loads images and triggers XHR/fetch responses

8. HTML Extraction
→ Single Evaluate: returns [headHTML, bodyHTML, window.location.href]

CloakBrowser

When browser = "cloak" in config, a fresh Chrome instance is created per-request:

Each request → discoverCloakBrowserWSURL() → new Chrome instance → CDP

This prevents cookie/state sharing between requests and is useful for anti-bot avoidance.

Per-Host Concurrency Limiting

The hostPool semaphore prevents overwhelming any single origin:

Global pool (maxConcurrency)

└── Per-host slot (10 per host)

└── Browser tab (chromedp context)

Auto Mode — Escalation Triggers

In auto mode, QuickCrawl tries HTTP first, then checks for these triggers:

TriggerConditionAction
SPA shellHTML contains framework markers (React, Vue, Angular, Next.js, Gatsby)Escalate to browser
Soft-block statusHTTP 401, 403, 404, 405, 406, 410, 412, 429, 450, 500, 502, 503Escalate to browser
Thin contentBody text < 200 chars on 2xxEscalate to browser
Anti-bot challengeCloudflare, CAPTCHA, generic bot wall in HTMLEscalate to browser

If HTTP fails entirely and a browser is available, it also escalates.

SPA Readiness Poll

The SPA poll (internal/core/spa.go) is the core of the auto mode's intelligence:

Default selectors: main, article, [role=main], #content, #root > *, #app > *
Default min text: 800 chars
Poll interval: 200ms
Timeout: 15s

Exit conditions:

  • Ready: Selector matched + body text ≥ 800 + (predicate truthy if set)
  • Timeout: Budget exhausted — returns last observed state
  • Lenient exit: Body text ≥ 800 chars even without selector match (handles pages like Hacker News)

Network-idle fast exit: If no network requests for 500ms, exits immediately — this is the dominant path for static/SSR pages.

Extractor

After fetching, the Extractor (internal/extractor/) transforms raw HTML:

Raw HTML

├── ExtractMetadata → title, description, OG tags, canonical, language

├── preprocessHTML
│ ├── Strip <head>
│ ├── cleanNoise() — removes scripts, styles, nav, footer
│ └── Apply IncludeTags / ExcludeTags / CSSSelector filters

├── postprocessHTML
│ ├── Sanitize
│ ├── Deduplicate
│ └── Normalize whitespace

├── HTMLToMarkdown ──→ Clean, LLM-ready markdown
├── HTMLToPlainText ──→ Plain text without markup
└── ExtractLinks ──→ All href URLs + image src URLs

LLM Extraction

When formats: ["json"] is set and [extraction.llm] is configured:

Markdown + JSON Schema

├── Send to OpenAI (chat/completions)
│ with extraction_prompt system message

└── Validate against schema

└── data.json populated in response

Requires EXTRACTION__LLM__API_KEY and optionally EXTRACTION__LLM__MODEL.