Web Extract API
Web Extract API
/web-extract/v1/scrape1 creditFetch one URL and return its content in the formats you choose (markdown, plain text, cleaned HTML, raw HTML, metadata, links, images, or structured data). Automatically upgrades thin single-page-app pages to a full browser render when needed. Works on any public page — news articles, product pages, documentation, blogs.
| Parameter | Allowed / range | Description | |
|---|---|---|---|
| url | required | — | The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked. |
| formats = markdown,metadata | optional | markdown · text · html · rawHtml · metadata · links · images · jsonld | Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine. |
| render = auto | optional | auto · never · force | When to activate browser rendering for JavaScript-heavy pages. 'auto' (default) detects thin/client-rendered pages and re-fetches them with a real browser automatically — rich server-rendered pages are served instantly without a browser. 'force' always uses a browser; 'never' skips rendering for maximum speed. |
| only_main_content = true | optional | — | Remove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content. |
| include_tags | optional | — | CSS selectors to include — only elements matching these selectors are kept in the extracted content, e.g. ['article', 'main', '.post-body']. |
| exclude_tags | optional | — | CSS selectors to remove before extraction — useful for stripping ads, cookie banners, or sidebar widgets, e.g. ['nav', 'footer', '.ads']. |
| target_selector | optional | — | Return only the content inside this CSS selector — e.g. 'article.post' to extract a specific section of the page and ignore everything else. |
| timeout = 25 | optional | 3–60 | Per-request timeout in seconds (3-60, default 25). |
/web-extract/v1/map2 creditsDiscover a site's complete URL surface: reads robots.txt, sitemap.xml (and sitemap indexes), and the homepage's internal links — returns a deduplicated URL list with a page-type label for each (home, pricing, docs, blog, product, legal, contact, about). Use this to index any site or audit its structure.
| Parameter | Allowed / range | Description | |
|---|---|---|---|
| url | required | — | The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked. |
| search | optional | — | Filter discovered links by a URL substring — matched URLs are returned first, e.g. '/blog' to prioritize blog pages or '/docs' for documentation. |
| limit = 200 | optional | 1–1000 | Max links to return (1-1000, default 200). |
| include_subdomains = false | optional | — | Include subdomain URLs found in sitemaps (default false). |
/web-extract/v1/crawl1 creditCrawl a site starting from a seed URL (up to 25 pages): follows internal links breadth-first with configurable depth, include/exclude URL patterns, and returns every visited page in the formats you choose. Ideal for content indexing, site audits, and building knowledge bases from documentation or blog sites.
| Parameter | Allowed / range | Description | |
|---|---|---|---|
| url | required | — | The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked. |
| max_pages = 10 | optional | 1–25 | Maximum number of pages to crawl in one call (1–25, default 10). Increase for deeper site coverage. |
| max_depth = 2 | optional | 0–5 | Max link depth from the seed URL (0-5, default 2). |
| same_domain_only = true | optional | — | Only follow links on the seed's registrable domain (default true). |
| include_patterns | optional | — | Regex patterns; only URLs whose path matches are crawled. |
| exclude_patterns | optional | — | Regex patterns; URLs whose path matches are skipped. |
| formats = markdown,metadata | optional | markdown · text · html · rawHtml · metadata · links · images · jsonld | Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine. |
| timeout = 25 | optional | 3–60 | Per-request timeout in seconds (3-60, default 25). |
/web-extract/v1/extract2 creditsPull structured data from any public page: returns objects (Product, Article, Organization, etc.), microdata presence, parsed table rows, heading outline, prices, emails, and phone numbers. Supply an optional field-to-path schema to extract specific values directly — great for e-commerce pricing, article metadata, and business listings.
| Parameter | Allowed / range | Description | |
|---|---|---|---|
| url | required | — | The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked. |
| schema | optional | — | Optional field extraction map — a JSON object where each key is your output field name and the value is a dot-path into the page's structured data, e.g. {'price': 'offers.price', 'name': 'name'}. |
| deterministic_only = true | optional | — | Use only rule-based extraction (structured data, microdata, tables, headings, prices) — always-on and fully predictable. Default true; set false to opt into AI-assisted free-form extraction when available. |
| timeout = 25 | optional | 3–60 | Per-request timeout in seconds (3-60, default 25). |
/web-extract/v1/batch1 creditScrape up to 10 URLs concurrently in one call (each SSRF-guarded). Shared formats. Firecrawl /batch-scrape parity (bounded).
| Parameter | Allowed / range | Description | |
|---|---|---|---|
| urls | required | — | List of page URLs to scrape in one call (max 10); each is independently SSRF-guarded and fetched concurrently. |
| formats = markdown,metadata | optional | markdown · text · html · rawHtml · metadata · links · images · jsonld | Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine. |
| only_main_content = true | optional | — | Remove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content. |
| timeout = 25 | optional | 3–60 | Per-request timeout in seconds (3-60, default 25). |
curl -X POST https://api.reefapi.com/web-extract/v1/scrape \
-H "x-api-key: $REEF_KEY" \
-H "content-type: application/json" \
-d '{"url":"https://en.wikipedia.org/wiki/Web_scraping","formats":["markdown","metadata"]}'{
"ok": true,
"data": { /* the result */ },
"meta": {
"latency_ms": 240,
"record_count": 12,
"completeness_pct": 100
},
"error": null
}