docs / web-extract

Web Extract API

base /web-extract/v15 endpoints

post/web-extract/v1/scrape1 credit

Fetch one URL and return its content in the formats you choose (markdown, plain text, cleaned HTML, raw HTML, metadata, links, images, or structured data). Automatically upgrades thin single-page-app pages to a full browser render when needed. Works on any public page — news articles, product pages, documentation, blogs.

Parameter		Allowed / range	Description
url	required	—	The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
formats = markdown,metadata	optional	markdown · text · html · rawHtml · metadata · links · images · jsonld	Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
render = auto	optional	auto · never · force	When to activate browser rendering for JavaScript-heavy pages. 'auto' (default) detects thin/client-rendered pages and re-fetches them with a real browser automatically — rich server-rendered pages are served instantly without a browser. 'force' always uses a browser; 'never' skips rendering for maximum speed.
only_main_content = true	optional	—	Remove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content.
include_tags	optional	—	CSS selectors to include — only elements matching these selectors are kept in the extracted content, e.g. ['article', 'main', '.post-body'].
exclude_tags	optional	—	CSS selectors to remove before extraction — useful for stripping ads, cookie banners, or sidebar widgets, e.g. ['nav', 'footer', '.ads'].
target_selector	optional	—	Return only the content inside this CSS selector — e.g. 'article.post' to extract a specific section of the page and ignore everything else.
timeout = 25	optional	3–60	Per-request timeout in seconds (3-60, default 25).

Try in playground →

post/web-extract/v1/map2 credits

Discover a site's complete URL surface: reads robots.txt, sitemap.xml (and sitemap indexes), and the homepage's internal links — returns a deduplicated URL list with a page-type label for each (home, pricing, docs, blog, product, legal, contact, about). Use this to index any site or audit its structure.

Parameter		Allowed / range	Description
url	required	—	The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
search	optional	—	Filter discovered links by a URL substring — matched URLs are returned first, e.g. '/blog' to prioritize blog pages or '/docs' for documentation.
limit = 200	optional	1–1000	Max links to return (1-1000, default 200).
include_subdomains = false	optional	—	Include subdomain URLs found in sitemaps (default false).

Try in playground →

post/web-extract/v1/crawl1 credit

Crawl a site starting from a seed URL (up to 25 pages): follows internal links breadth-first with configurable depth, include/exclude URL patterns, and returns every visited page in the formats you choose. Ideal for content indexing, site audits, and building knowledge bases from documentation or blog sites.

Parameter		Allowed / range	Description
url	required	—	The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
max_pages = 10	optional	1–25	Maximum number of pages to crawl in one call (1–25, default 10). Increase for deeper site coverage.
max_depth = 2	optional	0–5	Max link depth from the seed URL (0-5, default 2).
same_domain_only = true	optional	—	Only follow links on the seed's registrable domain (default true).
include_patterns	optional	—	Regex patterns; only URLs whose path matches are crawled.
exclude_patterns	optional	—	Regex patterns; URLs whose path matches are skipped.
formats = markdown,metadata	optional	markdown · text · html · rawHtml · metadata · links · images · jsonld	Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
timeout = 25	optional	3–60	Per-request timeout in seconds (3-60, default 25).

Try in playground →

post/web-extract/v1/extract2 credits

Pull structured data from any public page: returns objects (Product, Article, Organization, etc.), microdata presence, parsed table rows, heading outline, prices, emails, and phone numbers. Supply an optional field-to-path schema to extract specific values directly — great for e-commerce pricing, article metadata, and business listings.

Parameter		Allowed / range	Description
url	required	—	The page URL to extract. Full URL or bare domain (https:// assumed). Only http/https; private/internal/metadata targets are SSRF-blocked.
schema	optional	—	Optional field extraction map — a JSON object where each key is your output field name and the value is a dot-path into the page's structured data, e.g. {'price': 'offers.price', 'name': 'name'}.
deterministic_only = true	optional	—	Use only rule-based extraction (structured data, microdata, tables, headings, prices) — always-on and fully predictable. Default true; set false to opt into AI-assisted free-form extraction when available.
timeout = 25	optional	3–60	Per-request timeout in seconds (3-60, default 25).

Try in playground →

post/web-extract/v1/batch1 credit

Scrape up to 10 URLs concurrently in one call (each SSRF-guarded). Shared formats. Firecrawl /batch-scrape parity (bounded).

Parameter		Allowed / range	Description
urls	required	—	List of page URLs to scrape in one call (max 10); each is independently SSRF-guarded and fetched concurrently.
formats = markdown,metadata	optional	markdown · text · html · rawHtml · metadata · links · images · jsonld	Which outputs to return (array or comma-string). Any of: markdown, text, html (cleaned main-content), rawHtml, metadata, links, images, jsonld. Default: markdown+metadata. Unknown values are ignored. For screenshots use the web-capture engine.
only_main_content = true	optional	—	Remove navigation bars, headers, footers, and boilerplate — keep only the main article or content body (default true). Turn off to get the full raw page content.
timeout = 25	optional	3–60	Per-request timeout in seconds (3-60, default 25).

Try in playground →

Example request · scrape

curl -X POST https://api.reefapi.com/web-extract/v1/scrape \
  -H "x-api-key: $REEF_KEY" \
  -H "content-type: application/json" \
  -d '{"url":"https://en.wikipedia.org/wiki/Web_scraping","formats":["markdown","metadata"]}'

Response shape

{
  "ok": true,
  "data": { /* the result */ },
  "meta": {
    "latency_ms": 240,
    "record_count": 12,
    "completeness_pct": 100
  },
  "error": null
}