REST API · select mode

Selection-based crawl

Crawl exactly the URLs you pick — no link discovery, no BFS. The usual flow is two requests: GET /v1/sitemap/preview to list candidates, then POST /v1/jobs with mode: "select" and the URLs you chose.

Quota cost = number of URLs you submit. Pre-flight reserves exactly that — no over-counting like full mode.

When to use it

The site has a sitemap and you want a coherent subset (one language, one section, one product category).
Full mode would burn through quota on irrelevant pages — e.g. fluxys.com's sitemap contains all 6 language variants of every page.
You already know the URLs you need and want to skip discovery entirely.

Step 1 — Discover URLs

GET/v1/sitemap/preview200

Reads the seed domain's sitemap.xml (or sitemap_index.xml, or any robots.txt-advertised sitemap), falls back to a shallow link-crawl of the seed page if none is reachable. No quota is consumed — this is informational.

Query parameters

Parameter	Type	Default
url* Seed URL, usually a homepage. The discovery is same-domain only.	string (URL)	required
limit Optional cap on URLs returned. Default is no cap — server-side safety net at 50,000.	integer	(unlimited)

Example request

GET /v1/sitemap/preview

curl -G https://api.syttra.com/v1/sitemap/preview \
  -H "Authorization: Bearer sk_live_..." \
  --data-urlencode "url=https://www.qbus.be/nl"

Example response

200 OK

{
  "urls": [
    "https://www.qbus.be/nl/about",
    "https://www.qbus.be/nl/catalogus/dimmers",
    "https://www.qbus.be/nl/catalogus/controllers",
    ...
  ],
  "source": "sitemap",
  "count": 1048,
  "capped": false,
  "assets_filtered": 1261
}

Response fields

source: "sitemap" when at least one well-known or robots-advertised sitemap returned URLs; "shallow_crawl" when we fell back to scraping the homepage's <a href>s.
capped: true only if you passed an explicit limit and the result hit it. Always false when limit is omitted.
assets_filtered: how many URLs were dropped because they pointed at images, PDFs, RSS feeds, JS / CSS bundles, or other non-page assets. Filtering is done on the URL path so ?v=12345-style cache-busters don't fool it.

Rate-limited at 20 req/min per user.

Step 2 — Submit the crawl

POST/v1/jobs202 Accepted

Submits the URLs you picked from the preview. The crawler fetches each one once — no link discovery.

Request body

Parameter	Type	Default
url* Same-domain anchor — must share a host with every entry in `urls`. Acts as the security boundary so /v1/jobs can't be used as a generic any-site crawler proxy.	string (URL)	required
mode* Set to "select" for this guide.	string enum	required
urls* 1-1000 URLs to crawl. Every entry must share its domain with `url`. Order is preserved (deduped). max_pages is auto-set to len(urls); max_depth is forced to 1 (no link-following).	array of URLs	required
export_formats Output formats. Currently: markdown, text. Multi-page output is concatenated with --- separators.	array of strings	["markdown"]
concurrency How many pages to fetch in parallel. 1-20. Higher = faster, but harder on the target site.	integer	5
respect_robots Check robots.txt before crawling. Disabling is only allowed with an Enterprise agreement.	boolean	true
check_tos Inspect the target site's ToS for anti-scraping clauses. Jobs fail with tos_blocked if detected.	boolean	true

Example request

POST /v1/jobs

curl https://api.syttra.com/v1/jobs \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.qbus.be/nl",
    "mode": "select",
    "urls": [
      "https://www.qbus.be/nl/about",
      "https://www.qbus.be/nl/catalogus/dimmers",
      "https://www.qbus.be/nl/catalogus/controllers"
    ],
    "export_formats": ["markdown"]
  }'

Example response

202 Accepted

{
  "id": "c2d9f5e8-...",
  "status": "queued",
  "url": "https://www.qbus.be/nl",
  "mode": "select",
  "pages_total": 3,
  "links": {
    "self":   "/v1/jobs/c2d9f5e8-...",
    "result": "/v1/jobs/c2d9f5e8-.../result"
  }
}

Constraints

Same-domain only. Cross-domain entries in urls return 422 validation_error with a specific message naming the offending URL.
1 ≤ len(urls) ≤ 1000. Empty list or larger lists are rejected.
max_pages and max_depth are derived. Passing them explicitly is harmless but ignored — max_pages = len(urls) and max_depth = 1.
Order preserved, duplicates collapsed. If you send the same URL twice, the second occurrence is silently dropped.

What's next

Polling, multi-page download format, and error handling live in the REST API overview.

GET /v1/jobs/{id} — poll until completed
GET /v1/jobs/{id}/result — concatenated markdown with --- page separators

Prefer a UI? /dashboard/crawls/new drives this exact two-step flow with a tree-view checkbox picker.