docs / web-extract

Web Extract API

Web Extract API

base /web-extract/v15 endpoints
post/web-extract/v1/scrape1 credit

Fetch one URL and return its content in the formats you choose (markdown, plain text, cleaned HTML, raw HTML, metadata, links, images, or structured data). Automatically upgrades thin single-page-app pages to a full browser render when needed. Works on any public page — news articles, product pages, documentation, blogs.

ParameterAllowed / rangeDescription
urlrequiredThe page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
formats = markdown,metadataoptionalmarkdown · text · html · rawHtml · metadata · links · images · jsonldWhich outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
render = autooptionalauto · never · forceWhen to activate browser rendering for JavaScript-heavy pages. 'auto' (default) detects thin/client-rendered pages and re-fetches them with a real browser automatically — rich server-rendered pages are served instantly without a browser. 'force' always uses a browser; 'never' skips rendering for maximum speed.
only_main_content = trueoptionalRemove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content.
include_tagsoptionalCSS selectors to include — only elements matching these selectors are kept in the extracted content, e.g. ['article', 'main', '.post-body'].
exclude_tagsoptionalCSS selectors to remove before extraction — useful for stripping ads, cookie banners, or sidebar widgets, e.g. ['nav', 'footer', '.ads'].
target_selectoroptionalReturn only the content inside this CSS selector — e.g. 'article.post' to extract a specific section of the page and ignore everything else.
timeout = 25optional3–60Per-request timeout in seconds (3-60, default 25).
Try in playground →
post/web-extract/v1/map2 credits

Discover a site's complete URL surface: reads robots.txt, sitemap.xml (and sitemap indexes), and the homepage's internal links — returns a deduplicated URL list with a page-type label for each (home, pricing, docs, blog, product, legal, contact, about). Use this to index any site or audit its structure.

ParameterAllowed / rangeDescription
urlrequiredThe page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
searchoptionalFilter discovered links by a URL substring — matched URLs are returned first, e.g. '/blog' to prioritize blog pages or '/docs' for documentation.
limit = 200optional1–1000Max links to return (1-1000, default 200).
include_subdomains = falseoptionalInclude subdomain URLs found in sitemaps (default false).
Try in playground →
post/web-extract/v1/crawl1 credit

Crawl a site starting from a seed URL (up to 25 pages): follows internal links breadth-first with configurable depth, include/exclude URL patterns, and returns every visited page in the formats you choose. Ideal for content indexing, site audits, and building knowledge bases from documentation or blog sites.

ParameterAllowed / rangeDescription
urlrequiredThe page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
max_pages = 10optional1–25Maximum number of pages to crawl in one call (1–25, default 10). Increase for deeper site coverage.
max_depth = 2optional0–5Max link depth from the seed URL (0-5, default 2).
same_domain_only = trueoptionalOnly follow links on the seed's registrable domain (default true).
include_patternsoptionalRegex patterns; only URLs whose path matches are crawled.
exclude_patternsoptionalRegex patterns; URLs whose path matches are skipped.
formats = markdown,metadataoptionalmarkdown · text · html · rawHtml · metadata · links · images · jsonldWhich outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
timeout = 25optional3–60Per-request timeout in seconds (3-60, default 25).
Try in playground →
post/web-extract/v1/extract2 credits

Pull structured data from any public page: returns objects (Product, Article, Organization, etc.), microdata presence, parsed table rows, heading outline, prices, emails, and phone numbers. Supply an optional field-to-path schema to extract specific values directly — great for e-commerce pricing, article metadata, and business listings.

ParameterAllowed / rangeDescription
urlrequiredThe page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
schemaoptionalOptional field extraction map — a JSON object where each key is your output field name and the value is a dot-path into the page's structured data, e.g. {'price': 'offers.price', 'name': 'name'}.
deterministic_only = trueoptionalUse only rule-based extraction (structured data, microdata, tables, headings, prices) — always-on and fully predictable. Default true; set false to opt into AI-assisted free-form extraction when available.
timeout = 25optional3–60Per-request timeout in seconds (3-60, default 25).
Try in playground →
post/web-extract/v1/batch1 credit

Scrape up to 10 URLs concurrently in one call (each SSRF-guarded). Shared formats. Firecrawl /batch-scrape parity (bounded).

ParameterAllowed / rangeDescription
urlsrequiredList of page URLs to scrape in one call (max 10); each is independently SSRF-guarded and fetched concurrently.
formats = markdown,metadataoptionalmarkdown · text · html · rawHtml · metadata · links · images · jsonldWhich outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
only_main_content = trueoptionalRemove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content.
timeout = 25optional3–60Per-request timeout in seconds (3-60, default 25).
Try in playground →