Skip to main content

Scrape API

POST /v1/scrape — Scrape a single URL with one or more output formats.

The canonical endpoint for fetching and extracting content from one page. Supports HTTP and browser (JS) rendering, content filters, and LLM-based structured extraction.

Request Body

FieldTypeRequiredDescription
urlstringyesAbsolute http:// or https:// URL to scrape
formatsstring[]noOutput formats: markdown, html, rawHtml, plainText, links, imageLinks, json. Default: ["markdown"]
renderModestringno"auto", "browser", "http". Inherits server default
waitForintnoMilliseconds to wait after navigation (0-120000). Default: 0
headersobjectnoCustom HTTP headers for the fetch
includeTagsstring[]noCSS selectors to keep (e.g. ["article", "h1"])
excludeTagsstring[]noCSS selectors to drop (e.g. ["nav", ".ad"])
cssSelectorstringnoExtract content matching this CSS selector only
jsonSchemaobjectnoJSON Schema for LLM extraction (when formats: ["json"])
extractobjectnoLLM extraction overrides: { schema, prompt, responseFormat }
llmExtractionPromptstringnoPer-request LLM system prompt override
llmResponseFormatstringnoPer-request LLM response_format name override
ttlintnoCache TTL in seconds. 0 = bypass cache

Minimal Request

{
"url": "https://example.com"
}

Full Request with Filters

{
"url": "https://example.com/article",
"formats": ["markdown", "html", "links"],
"renderMode": "browser",
"waitFor": 2000,
"headers": { "Cookie": "session=abc" },
"includeTags": ["article", "h1", "h2", "p"],
"excludeTags": ["nav", "footer", ".advertisement"]
}

Success Response

{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use...",
"html": "<h1>Example Domain</h1><p>...</p>",
"plainText": "Example Domain...",
"links": ["https://www.iana.org/domains/example"],
"imageLinks": [],
"metadata": {
"title": "Example Domain",
"description": null,
"sourceURL": "https://example.com",
"language": "en",
"statusCode": 200,
"renderedMode": "http",
"timeTaken": 281
}
}
}

LLM Extraction Response

When formats includes "json" and LLM is configured:

{
"success": true,
"data": {
"markdown": "...",
"json": {
"title": "Example Domain",
"purpose": "documentation example"
},
"metadata": { "...": "..." }
}
}

Error Responses

StatusCodeCause
400invalid_requestMissing url, non-http(s) scheme, malformed JSON
400forbiddenBlocked by robots.txt
500internal_errorScraper not initialized
200 with success:falsehttpTarget returned HTTP 4xx/5xx
200 with success:falserenderer_errorBrowser requested but no Chrome WS URL configured