REST API · full mode

Full-site crawl

Start from one URL and follow internal links breadth-first. Bounded by max_depth (how far from the seed) and max_pages (the hard cap on total pages). Same-domain only — links to other hosts are ignored.

When to use it

You want everything reachable from a seed URL.
The site has no useful sitemap (or you want fresher discovery than the sitemap shows).
You don't mind paying quota for some pages you'll never read — full mode reserves the entire max_pages budget upfront.

If you only want specific pages, prefer select mode — it lets you pick exact URLs from the sitemap before any quota is spent.

Create a full-site job

POST/v1/jobs202 Accepted

Submits a seed URL. The crawler walks all same-domain internal links via BFS until it hits max_depth or max_pages.

Request body

Parameter	Type	Default
url* Seed URL. Must be a public HTTPS URL. Same-domain link discovery starts here.	string (URL)	required
mode* Set to "full" for this guide.	string enum	required
export_formats Output formats. Currently: markdown, text. Multi-page output is concatenated with --- separators.	array of strings	["markdown"]
max_depth Maximum link depth from the seed. 1-10. Depth 1 = pages directly linked from seed; 2 = links of those pages; etc.	integer	3
max_pages Hard cap on total pages crawled. 1-1000. Quota pre-flight reserves this number — pick conservatively.	integer	50
concurrency How many pages to fetch in parallel. 1-20. Higher = faster, but harder on the target site.	integer	5
respect_robots Check robots.txt before crawling. Disabling is only allowed with an Enterprise agreement.	boolean	true
check_tos Inspect the target site's ToS for anti-scraping clauses. Jobs fail with tos_blocked if detected.	boolean	true

The urls field is select-mode only and rejected here with 422 validation_error.

Example request

POST /v1/jobs

curl https://api.syttra.com/v1/jobs \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "mode": "full",
    "max_depth": 2,
    "max_pages": 30,
    "concurrency": 5,
    "export_formats": ["markdown"]
  }'

Example response

202 Accepted

{
  "id": "b1c8e4d7-...",
  "status": "queued",
  "url": "https://example.com",
  "mode": "full",
  "links": {
    "self":   "/v1/jobs/b1c8e4d7-...",
    "result": "/v1/jobs/b1c8e4d7-.../result"
  }
}

Behaviour notes

Same-domain only. Links to other hostnames are silently dropped. Use multiple jobs (or select mode) for cross-domain crawls.
Asset URLs are skipped. Links to .png, .pdf, .css etc. are filtered before queuing — they don't count against max_pages.
Quota pre-flight reserves the budget. Submitting max_pages=50 when you've already used 60 of a 100-page allowance returns 402 quota_exceeded even if the actual crawl would only fetch 12.
BFS, not DFS. Pages closer to the seed are crawled first — handy if the crawl is cancelled mid-flight: the most relevant pages land first.

What's next

Polling, multi-page download format, and error handling live in the REST API overview. Quick links:

GET /v1/jobs/{id} — poll until completed
GET /v1/jobs/{id}/result — multi-page output is one document with --- separators
Error envelope