REST API · select mode
Selection-based crawl
Crawl exactly the URLs you pick — no link discovery, no BFS. The usual flow is two requests: GET /v1/sitemap/preview to list candidates, then POST /v1/jobs with mode: "select" and the URLs you chose.
Quota cost = number of URLs you submit. Pre-flight reserves exactly that — no over-counting like full mode.
When to use it
- The site has a sitemap and you want a coherent subset (one language, one section, one product category).
- Full mode would burn through quota on irrelevant pages — e.g. fluxys.com's sitemap contains all 6 language variants of every page.
- You already know the URLs you need and want to skip discovery entirely.
Step 1 — Discover URLs
GET
/v1/sitemap/preview200 Reads the seed domain's sitemap.xml (or sitemap_index.xml, or any robots.txt-advertised sitemap), falls back to a shallow link-crawl of the seed page if none is reachable. No quota is consumed — this is informational.
Query parameters
| Parameter | Type | Default |
|---|---|---|
url* Seed URL, usually a homepage. The discovery is same-domain only. | string (URL) | required |
limit Optional cap on URLs returned. Default is no cap — server-side safety net at 50,000. | integer | (unlimited) |
Example request
GET /v1/sitemap/preview
curl -G https://api.syttra.com/v1/sitemap/preview \
-H "Authorization: Bearer sk_live_..." \
--data-urlencode "url=https://www.qbus.be/nl"Example response
200 OK
{
"urls": [
"https://www.qbus.be/nl/about",
"https://www.qbus.be/nl/catalogus/dimmers",
"https://www.qbus.be/nl/catalogus/controllers",
...
],
"source": "sitemap",
"count": 1048,
"capped": false,
"assets_filtered": 1261
}Response fields
source:"sitemap"when at least one well-known or robots-advertised sitemap returned URLs;"shallow_crawl"when we fell back to scraping the homepage's<a href>s.capped:trueonly if you passed an explicitlimitand the result hit it. Alwaysfalsewhenlimitis omitted.assets_filtered: how many URLs were dropped because they pointed at images, PDFs, RSS feeds, JS / CSS bundles, or other non-page assets. Filtering is done on the URL path so?v=12345-style cache-busters don't fool it.
Rate-limited at 20 req/min per user.
Step 2 — Submit the crawl
POST
/v1/jobs202 AcceptedSubmits the URLs you picked from the preview. The crawler fetches each one once — no link discovery.
Request body
| Parameter | Type | Default |
|---|---|---|
url* Same-domain anchor — must share a host with every entry in `urls`. Acts as the security boundary so /v1/jobs can't be used as a generic any-site crawler proxy. | string (URL) | required |
mode* Set to "select" for this guide. | string enum | required |
urls* 1-1000 URLs to crawl. Every entry must share its domain with `url`. Order is preserved (deduped). max_pages is auto-set to len(urls); max_depth is forced to 1 (no link-following). | array of URLs | required |
export_formats Output formats. Currently: markdown, text. Multi-page output is concatenated with --- separators. | array of strings | ["markdown"] |
concurrency How many pages to fetch in parallel. 1-20. Higher = faster, but harder on the target site. | integer | 5 |
respect_robots Check robots.txt before crawling. Disabling is only allowed with an Enterprise agreement. | boolean | true |
check_tos Inspect the target site's ToS for anti-scraping clauses. Jobs fail with tos_blocked if detected. | boolean | true |
Example request
POST /v1/jobs
curl https://api.syttra.com/v1/jobs \
-H "Authorization: Bearer sk_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.qbus.be/nl",
"mode": "select",
"urls": [
"https://www.qbus.be/nl/about",
"https://www.qbus.be/nl/catalogus/dimmers",
"https://www.qbus.be/nl/catalogus/controllers"
],
"export_formats": ["markdown"]
}'Example response
202 Accepted
{
"id": "c2d9f5e8-...",
"status": "queued",
"url": "https://www.qbus.be/nl",
"mode": "select",
"pages_total": 3,
"links": {
"self": "/v1/jobs/c2d9f5e8-...",
"result": "/v1/jobs/c2d9f5e8-.../result"
}
}Constraints
- Same-domain only. Cross-domain entries in
urlsreturn422 validation_errorwith a specific message naming the offending URL. - 1 ≤ len(urls) ≤ 1000. Empty list or larger lists are rejected.
- max_pages and max_depth are derived. Passing them explicitly is harmless but ignored —
max_pages = len(urls)andmax_depth = 1. - Order preserved, duplicates collapsed. If you send the same URL twice, the second occurrence is silently dropped.
What's next
Polling, multi-page download format, and error handling live in the REST API overview.
- GET /v1/jobs/{id} — poll until
completed - GET /v1/jobs/{id}/result — concatenated markdown with
---page separators
Prefer a UI? /dashboard/crawls/new drives this exact two-step flow with a tree-view checkbox picker.