> ## Documentation Index
> Fetch the complete documentation index at: https://docs.kadoa.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawling

> Crawl websites and convert pages into structured JSON or markdown.

Crawl all accessible subpages of a website and convert them into structured JSON or markdown:

* Initiating crawl sessions with single or multiple URLs
* Checking crawl status
* Listing crawled pages
* Accessing page content

## Prerequisites

* Kadoa account with [API key](/api-reference/introduction)
* SDK installed: `npm install @kadoa/node-sdk` or `uv add kadoa-sdk`

## 1. Start a Crawl

Start a crawl session with a single URL or multiple URLs from the same domain.

[View full API reference →](/api-reference/crawling/start-crawling-session)

<CodeGroup>
  ```typescript Node SDK theme={null}
  import { KadoaClient } from '@kadoa/node-sdk';

  const client = new KadoaClient({ apiKey: 'YOUR_API_KEY' });

  const result = await client.crawler.session.start({
    url: "https://demo.vercel.store/",
    maxDepth: 10,
    maxPages: 50,
  });

  console.log(result.sessionId);
  ```

  ```python Python SDK theme={null}
  from kadoa_sdk import KadoaClient, KadoaClientConfig

  client = KadoaClient(KadoaClientConfig(api_key="YOUR_API_KEY"))

  result = client.crawler.session.start({
      "url": "https://demo.vercel.store/",
      "maxDepth": 10,
      "maxPages": 50
  })

  print(result.session_id)
  ```

  ```bash REST API theme={null}
  curl -X POST https://api.kadoa.com/v4/crawl \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://demo.vercel.store/",
      "maxDepth": 10,
      "maxPages": 50
    }'
  ```

  ```json Response theme={null}
  {
    "message": "Session started",
    "sessionId": "YOUR_SESSION_ID",
    "error": null
  }
  ```
</CodeGroup>

### Multiple URLs

Crawl from multiple entry points on the same domain:

<CodeGroup>
  ```typescript Node SDK theme={null}
  const result = await client.crawler.session.start({
    startUrls: [
      "https://demo.vercel.store/",
      "https://demo.vercel.store/collections",
      "https://demo.vercel.store/about",
    ],
    maxDepth: 10,
    maxPages: 50,
  });
  ```

  ```python Python SDK theme={null}
  result = client.crawler.session.start({
      "startUrls": [
          "https://demo.vercel.store/",
          "https://demo.vercel.store/collections",
          "https://demo.vercel.store/about"
      ],
      "maxDepth": 10,
      "maxPages": 50
  })
  ```

  ```bash REST API theme={null}
  curl -X POST https://api.kadoa.com/v4/crawl \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "startUrls": [
        "https://demo.vercel.store/",
        "https://demo.vercel.store/collections",
        "https://demo.vercel.store/about"
      ],
      "maxDepth": 10,
      "maxPages": 50
    }'
  ```
</CodeGroup>

<Note>
  When using `startUrls`, all URLs must be from the same domain or subdomain. You can mix `www.example.com` and `shop.example.com`, but not `example.com` and `different-site.com`.
</Note>

## 2. Check Crawl Status

Monitor crawl progress to know when extraction is complete.

[View full API reference →](/api-reference/crawling/crawling-session-status)

<CodeGroup>
  ```typescript Node SDK theme={null}
  const status = await client.crawler.session.getSessionStatus(sessionId);

  console.log(status.payload.crawledPages);
  console.log(status.payload.finished);
  ```

  ```python Python SDK theme={null}
  status = client.crawler.session.get_session_status(session_id)

  print(status.payload["crawledPages"])
  print(status.payload["finished"])
  ```

  ```bash REST API theme={null}
  curl -X GET https://api.kadoa.com/v4/crawl/YOUR_SESSION_ID/status \
    -H "x-api-key: YOUR_API_KEY"
  ```

  ```json Response theme={null}
  {
    "payload": {
      "crawledPages": 14,
      "finished": true
    },
    "sessionId": "YOUR_SESSION_ID",
    "error": null
  }
  ```
</CodeGroup>

## 3. List Crawled Pages

Get a paginated list of crawled pages with their statuses.

[View full API reference →](/api-reference/crawling/get-crawled-pages)

<CodeGroup>
  ```typescript Node SDK theme={null}
  const pages = await client.crawler.session.getPages(sessionId, {
    currentPage: 1,
    pageSize: 100,
  });

  for (const page of pages.payload) {
    console.log(page.id, page.url, page.status);
  }
  ```

  ```python Python SDK theme={null}
  pages = client.crawler.session.get_pages(session_id, {
      "current_page": 1,
      "page_size": 100
  })

  for page in pages.payload:
      print(page["id"], page["url"], page["status"])
  ```

  ```bash REST API theme={null}
  curl -X GET "https://api.kadoa.com/v4/crawl/YOUR_SESSION_ID/pages?currentPage=0&pageSize=100" \
    -H "x-api-key: YOUR_API_KEY"
  ```

  ```json Response theme={null}
  {
    "payload": [
      { "id": "YOUR_PAGE_ID", "url": "https://demo.vercel.store/", "status": "DONE" },
      { "id": "abc123", "url": "https://demo.vercel.store/about", "status": "CRAWLING" }
    ],
    "pagination": {
      "currentPage": 1,
      "pageSize": 100,
      "totalRecords": 132,
      "totalPages": 2
    },
    "sessionId": "YOUR_SESSION_ID",
    "error": null
  }
  ```
</CodeGroup>

Page statuses: `DONE`, `CRAWLING`, `PENDING`

## 4. Retrieve Page Content

Get page content in markdown (LLM-ready) or HTML format.

[View full API reference →](/api-reference/crawling/get-crawled-page-meta)

<CodeGroup>
  ```typescript Node SDK theme={null}
  // Get as markdown
  const markdown = await client.crawler.session.getPage(sessionId, pageId, {
    format: "markdown",
  });

  console.log(markdown.payload);

  // Get as HTML
  const html = await client.crawler.session.getPage(sessionId, pageId, {
    format: "html",
  });
  ```

  ```python Python SDK theme={null}
  # Get as markdown
  markdown = client.crawler.session.get_page(
      session_id,
      page_id,
      {"format": "markdown"}
  )

  print(markdown.payload)

  # Get as HTML
  html = client.crawler.session.get_page(
      session_id,
      page_id,
      {"format": "html"}
  )
  ```

  ```bash REST API theme={null}
  # Markdown format
  curl -X GET "https://api.kadoa.com/v4/crawl/YOUR_SESSION_ID/pages/YOUR_PAGE_ID?format=md" \
    -H "x-api-key: YOUR_API_KEY"

  # HTML format
  curl -X GET "https://api.kadoa.com/v4/crawl/YOUR_SESSION_ID/pages/YOUR_PAGE_ID?format=html" \
    -H "x-api-key: YOUR_API_KEY"
  ```

  ```json Response theme={null}
  {
    "payload": "# Sample Markdown\n\n## Heading 2\n\nThis is **bold** text.",
    "pageId": "YOUR_PAGE_ID",
    "pageFormat": "md",
    "sessionId": "YOUR_SESSION_ID",
    "error": null
  }
  ```
</CodeGroup>

Supported formats: `md` (markdown), `html`

## Configuration Options

| Parameter        | Type      | Default | Description                                        |
| ---------------- | --------- | ------- | -------------------------------------------------- |
| `url`            | string    | -       | Single URL to crawl (use this or `startUrls`)      |
| `startUrls`      | string\[] | -       | Multiple URLs to crawl (use this or `url`)         |
| `maxDepth`       | number    | -       | Maximum crawl depth from entry points              |
| `maxPages`       | number    | -       | Maximum pages to crawl                             |
| `maxMatches`     | number    | -       | Stop after N matched pages (with blueprint)        |
| `pathsFilterIn`  | string\[] | -       | Regex patterns to include paths                    |
| `pathsFilterOut` | string\[] | -       | Regex patterns to exclude paths                    |
| `proxyType`      | string    | null    | Proxy type: `"dc"` (datacenter) or `"residential"` |
| `proxyCountry`   | string    | -       | ISO country code for proxy location                |
| `concurrency`    | number    | 20      | Number of parallel crawlers                        |
| `timeout`        | number    | -       | Request timeout in milliseconds                    |
| `strictDomain`   | boolean   | true    | Stay within the same domain                        |
| `loadImages`     | boolean   | true    | Load images during crawl                           |
| `callbackUrl`    | string    | -       | Webhook URL for completion notification            |

### Artifact Options

| Parameter        | Type    | Default | Description                   |
| ---------------- | ------- | ------- | ----------------------------- |
| `screenshot`     | boolean | false   | Capture page screenshots      |
| `screenshotFull` | boolean | false   | Capture full-page screenshots |
| `archivePdf`     | boolean | false   | Generate PDF archives         |

## Error Handling

| Error                 | Cause                      | Solution                    |
| --------------------- | -------------------------- | --------------------------- |
| 401 Unauthorized      | Invalid API key            | Verify API key in dashboard |
| 402 Payment Required  | Insufficient credits       | Top up account credits      |
| 404 Not Found         | Invalid session or page ID | Verify ID exists            |
| 429 Too Many Requests | Rate limit exceeded        | Reduce request frequency    |

## Next Steps

* [Set up webhooks](/docs/integrations/webhooks) for crawl completion notifications
* [API Reference →](/api-reference/crawling/start-crawling-session)