REST API · full mode
Full-site crawl
Start from one URL and follow internal links breadth-first. Bounded by max_depth (how far from the seed) and max_pages (the hard cap on total pages). Same-domain only — links to other hosts are ignored.
When to use it
- You want everything reachable from a seed URL.
- The site has no useful sitemap (or you want fresher discovery than the sitemap shows).
- You don't mind paying quota for some pages you'll never read — full mode reserves the entire
max_pagesbudget upfront.
If you only want specific pages, prefer select mode — it lets you pick exact URLs from the sitemap before any quota is spent.
Create a full-site job
POST
/v1/jobs202 AcceptedSubmits a seed URL. The crawler walks all same-domain internal links via BFS until it hits max_depth or max_pages.
Request body
| Parameter | Type | Default |
|---|---|---|
url* Seed URL. Must be a public HTTPS URL. Same-domain link discovery starts here. | string (URL) | required |
mode* Set to "full" for this guide. | string enum | required |
export_formats Output formats. Currently: markdown, text. Multi-page output is concatenated with --- separators. | array of strings | ["markdown"] |
max_depth Maximum link depth from the seed. 1-10. Depth 1 = pages directly linked from seed; 2 = links of those pages; etc. | integer | 3 |
max_pages Hard cap on total pages crawled. 1-1000. Quota pre-flight reserves this number — pick conservatively. | integer | 50 |
concurrency How many pages to fetch in parallel. 1-20. Higher = faster, but harder on the target site. | integer | 5 |
respect_robots Check robots.txt before crawling. Disabling is only allowed with an Enterprise agreement. | boolean | true |
check_tos Inspect the target site's ToS for anti-scraping clauses. Jobs fail with tos_blocked if detected. | boolean | true |
The urls field is select-mode only and rejected here with 422 validation_error.
Example request
POST /v1/jobs
curl https://api.syttra.com/v1/jobs \
-H "Authorization: Bearer sk_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"mode": "full",
"max_depth": 2,
"max_pages": 30,
"concurrency": 5,
"export_formats": ["markdown"]
}'Example response
202 Accepted
{
"id": "b1c8e4d7-...",
"status": "queued",
"url": "https://example.com",
"mode": "full",
"links": {
"self": "/v1/jobs/b1c8e4d7-...",
"result": "/v1/jobs/b1c8e4d7-.../result"
}
}Behaviour notes
- Same-domain only. Links to other hostnames are silently dropped. Use multiple jobs (or select mode) for cross-domain crawls.
- Asset URLs are skipped. Links to
.png,.pdf,.cssetc. are filtered before queuing — they don't count againstmax_pages. - Quota pre-flight reserves the budget. Submitting
max_pages=50when you've already used 60 of a 100-page allowance returns402 quota_exceededeven if the actual crawl would only fetch 12. - BFS, not DFS. Pages closer to the seed are crawled first — handy if the crawl is cancelled mid-flight: the most relevant pages land first.
What's next
Polling, multi-page download format, and error handling live in the REST API overview. Quick links:
- GET /v1/jobs/{id} — poll until
completed - GET /v1/jobs/{id}/result — multi-page output is one document with
---separators - Error envelope