Syttra
REST API · select mode

Selection-based crawl

Crawl exactly the URLs you pick — no link discovery, no BFS. The usual flow is two requests: GET /v1/sitemap/preview to list candidates, then POST /v1/jobs with mode: "select" and the URLs you chose.

Quota cost = number of URLs you submit. Pre-flight reserves exactly that — no over-counting like full mode.

When to use it

  • The site has a sitemap and you want a coherent subset (one language, one section, one product category).
  • Full mode would burn through quota on irrelevant pages — e.g. fluxys.com's sitemap contains all 6 language variants of every page.
  • You already know the URLs you need and want to skip discovery entirely.

Step 1 — Discover URLs

GET/v1/sitemap/preview200

Reads the seed domain's sitemap.xml (or sitemap_index.xml, or any robots.txt-advertised sitemap), falls back to a shallow link-crawl of the seed page if none is reachable. No quota is consumed — this is informational.

Query parameters

ParameterTypeDefault
url*
Seed URL, usually a homepage. The discovery is same-domain only.
string (URL)required
limit
Optional cap on URLs returned. Default is no cap — server-side safety net at 50,000.
integer(unlimited)

Example request

GET /v1/sitemap/preview
curl -G https://api.syttra.com/v1/sitemap/preview \
  -H "Authorization: Bearer sk_live_..." \
  --data-urlencode "url=https://www.qbus.be/nl"

Example response

200 OK
{
  "urls": [
    "https://www.qbus.be/nl/about",
    "https://www.qbus.be/nl/catalogus/dimmers",
    "https://www.qbus.be/nl/catalogus/controllers",
    ...
  ],
  "source": "sitemap",
  "count": 1048,
  "capped": false,
  "assets_filtered": 1261
}

Response fields

  • source: "sitemap" when at least one well-known or robots-advertised sitemap returned URLs; "shallow_crawl" when we fell back to scraping the homepage's <a href>s.
  • capped: true only if you passed an explicit limit and the result hit it. Always false when limit is omitted.
  • assets_filtered: how many URLs were dropped because they pointed at images, PDFs, RSS feeds, JS / CSS bundles, or other non-page assets. Filtering is done on the URL path so ?v=12345-style cache-busters don't fool it.

Rate-limited at 20 req/min per user.

Step 2 — Submit the crawl

POST/v1/jobs202 Accepted

Submits the URLs you picked from the preview. The crawler fetches each one once — no link discovery.

Request body

ParameterTypeDefault
url*
Same-domain anchor — must share a host with every entry in `urls`. Acts as the security boundary so /v1/jobs can't be used as a generic any-site crawler proxy.
string (URL)required
mode*
Set to "select" for this guide.
string enumrequired
urls*
1-1000 URLs to crawl. Every entry must share its domain with `url`. Order is preserved (deduped). max_pages is auto-set to len(urls); max_depth is forced to 1 (no link-following).
array of URLsrequired
export_formats
Output formats. Currently: markdown, text. Multi-page output is concatenated with --- separators.
array of strings["markdown"]
concurrency
How many pages to fetch in parallel. 1-20. Higher = faster, but harder on the target site.
integer5
respect_robots
Check robots.txt before crawling. Disabling is only allowed with an Enterprise agreement.
booleantrue
check_tos
Inspect the target site's ToS for anti-scraping clauses. Jobs fail with tos_blocked if detected.
booleantrue

Example request

POST /v1/jobs
curl https://api.syttra.com/v1/jobs \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.qbus.be/nl",
    "mode": "select",
    "urls": [
      "https://www.qbus.be/nl/about",
      "https://www.qbus.be/nl/catalogus/dimmers",
      "https://www.qbus.be/nl/catalogus/controllers"
    ],
    "export_formats": ["markdown"]
  }'

Example response

202 Accepted
{
  "id": "c2d9f5e8-...",
  "status": "queued",
  "url": "https://www.qbus.be/nl",
  "mode": "select",
  "pages_total": 3,
  "links": {
    "self":   "/v1/jobs/c2d9f5e8-...",
    "result": "/v1/jobs/c2d9f5e8-.../result"
  }
}

Constraints

  • Same-domain only. Cross-domain entries in urls return 422 validation_error with a specific message naming the offending URL.
  • 1 ≤ len(urls) ≤ 1000. Empty list or larger lists are rejected.
  • max_pages and max_depth are derived. Passing them explicitly is harmless but ignored — max_pages = len(urls) and max_depth = 1.
  • Order preserved, duplicates collapsed. If you send the same URL twice, the second occurrence is silently dropped.

What's next

Polling, multi-page download format, and error handling live in the REST API overview.

Prefer a UI? /dashboard/crawls/new drives this exact two-step flow with a tree-view checkbox picker.