Introduction

Crawling is useful to automatically crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide helps you:

  • Initiate a crawling session
  • Check crawling session status
  • List crawled pages
  • Access crawled page content

Prerequisites

To get the most out of this guide, you’ll need to:

1. Start a Crawl

To initiate a web crawl, send a POST request to the /crawl endpoint with the desired configuration.

curl -X POST "https://api.kadoa.com/v4/crawl" \
  -H "Content-Type: application/json" \
  -H "x-api-key: <Kadoa API Key>" \
  -d '{
    "url": "https://demo.vercel.store/",
    "maxDepth": 10,
    "maxPages": 50
  }'

See API Reference

A successful response should look like this:

{
  "message": "Session started",
  "sessionId": "<sessionId>",
  "error": null
}

Use pathsFilterIn and pathsFilterOut to include or exclude specific paths. Adjust timeout, maxDepth, and maxPages to refine the crawling process.

2. Check Crawl Status

Monitor the progress of your crawling session using the /crawl/status/<sessionId> endpoint.

curl -X GET "https://api.kadoa.com/v4/crawl/<sessionId>/status" \
  -H "x-api-key: <Kadoa API Key>"

See API Reference

You’ll see how many pages are crawled and whether the session is finished:

{
  "payload": {
    "crawledPages": 14,
    "finished": true
  },
  "sessionId": "<sessionId>",
  "error": null
}

3. List Crawled Pages

Access the crawled pages using the /crawl/<sessionId>/pages endpoint with pagination.

curl -X GET "https://api.kadoa.com/v4/crawl/<sessionId>/pages?currentPage=0&pageSize=100" \
  -H "x-api-key: <Kadoa API Key>"

See API Reference

Query parameters:

  • currentPage: Positive integer, starting from 0.
  • pageSize: Positive integer, starting from 1.

Sample Response:

{
    "payload": [
        { "id": "<pageId>", "url": "https://demo.vercel.store/", "status": "DONE" },
        { "id": "<pageId>", "url": "https://demo.vercel.store/about", "status": "CRAWLING" }
    ],
    "pagination": {
        "currentPage": 1,
        "pageSize": 25,
        "totalRecords": 132,
        "totalPages": 5
    },
    "sessionId": "<sessionId>",
    "error": null
}

4. Retrieve Page Content

Now let’s retrieve the content of the crawled pages in our preferred format. The API can deliver the page payload directly in an LLM-ready format, such as markdown.

curl -X GET "https://api.kadoa.com/v4/crawl/<sessionId>/pages/page1?format=md" \
  -H "x-api-key: <Kadoa API Key>"

See API Reference

Supported Formats:

  • html: Full HTML structure
  • md: Markdown format

Sample response (Markdown):

{
    "payload": "# Sample Markdown\n\n## Heading 2\n\nThis is a **bold** text and this is *italic*.\n\n",
    "pageId": "<pageId>",
    "pageFormat": "md",
    "error": null
}

Sample response (HTML):

{
    "payload": "<!DOCTYPE html>\n<html>\n<head>\n<title>Sample HTML</title></head><body>...</body>\n</html>",
    "pageId": "<pageId>",
    "pageFormat": "html",
    "error": null
}