Crawling
Introduction
Crawling is useful to automatically crawl all accessible subpages of a website and convert them into structured JSON or markdown. This guide helps you:
- Initiate a crawling session
- Check crawling session status
- List crawled pages
- Access crawled page content
Prerequisites
To get the most out of this guide, you’ll need to:
- Create a Kadoa account
- Get your API key
1. Start a Crawl
To initiate a web crawl, send a POST request to the /crawl
endpoint with the desired configuration.
A successful response should look like this:
Use pathsFilterIn
and pathsFilterOut
to include or exclude specific paths.
Adjust timeout
, maxDepth
, and maxPages
to refine the crawling process.
2. Check Crawl Status
Monitor the progress of your crawling session using the /crawl/status/<sessionId>
endpoint.
You’ll see how many pages are crawled and whether the session is finished:
3. List Crawled Pages
Access the crawled pages using the /crawl/<sessionId>/pages
endpoint with pagination.
Query parameters:
currentPage
: Positive integer, starting from 0.pageSize
: Positive integer, starting from 1.
Sample Response:
4. Retrieve Page Content
Now let’s retrieve the content of the crawled pages in our preferred format. The API can deliver the page payload directly in an LLM-ready format, such as markdown.
Supported Formats:
html
: Full HTML structuremd
: Markdown format
Sample response (Markdown):
Sample response (HTML):