Overview
This guide shows you how to create workflows programmatically using either the Kadoa SDK or REST API. You’ll learn how to:
- Create workflows with different navigation modes
- Use existing schemas or define custom ones
- Set up AI Navigation with natural language instructions
- Configure monitoring and scheduling options
Prerequisites
Before you begin, you’ll need:
- A Kadoa account
- Your API key
- For SDK:
npm install @kadoa/node-sdk or yarn add @kadoa/node-sdk or pip install kadoa-sdk
Authentication
import { KadoaClient } from '@kadoa/node-sdk';
const client = new KadoaClient({
apiKey: 'your-api-key'
});
const status = await client.status();
console.log(status);
console.log(status.user);
Choose how you want to extract data from websites:
Auto-Detection
Auto-detect uses AI to detect and extract what’s on the page. If you’re using the REST API directly, auto-detection isn’t available and you need to pass a data schema.
// SDK: AI automatically detects and extracts data
const result = await client.extraction.run({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Auto Product Extraction",
limit: 10,
});
console.log(result.data);
Custom Schema
Define exactly what fields you want to extract for precise control:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Structured Product Extraction",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Product name", "STRING", {
example: "iPhone 15 Pro",
})
.field("price", "Price in USD", "MONEY")
.field("inStock", "Availability", "BOOLEAN")
.field("rating", "Rating 1-5", "NUMBER")
.field("releaseDate", "Launch date", "DATE"),
})
.create();
const result = await workflow.run({ limit: 10 });
// Use destructuring for cleaner access
const { data } = await result.fetchData({});
console.log(data);
Available Data Types:
STRING - Text content
NUMBER - Numeric values
BOOLEAN - True/false values
DATE - Date values
DATETIME - Date and time values
MONEY / CURRENCY - Monetary values
IMAGE - Image URLs
LINK - Hyperlinks
OBJECT - Nested objects
ARRAY - Lists of items
See all data types →
PDF Page Selection
When extracting from PDF URLs, you can specify which pages to process:
// POST https://api.kadoa.com/v4/workflows
{
"urls": ["https://example.com/report.pdf"],
"name": "PDF Extraction",
"entity": "Data",
"fields": [
{
"name": "content",
"dataType": "STRING",
"description": "Extracted content"
}
],
"pageNumbers": [1, 2, 3] // Extract only pages 1, 2, and 3
}
If pageNumbers is omitted, all pages are processed.
Extract unstructured content as HTML, Markdown, or plain text:
// Extract as Markdown
const extraction = await client
.extract({
urls: ["https://sandbox.kadoa.com/news"],
name: "Article Content",
extraction: (builder) => builder.raw("MARKDOWN"),
})
.create();
const run = await extraction.run({ limit: 10 });
const data = await run.fetchData({});
console.log(data);
Available Formats:
HTML - Raw HTML content
MARKDOWN - Markdown formatted text
PAGE_URL - URLs of extracted pages
Classification
Automatically categorize content into predefined classes:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/news"],
name: "Article Classifier",
extraction: (builder) =>
builder
.entity("Article")
.field("title", "Headline", "STRING", {
example: "Tech Company Announces New Product",
})
.field("content", "Article text", "STRING", {
example: "The article discusses the latest innovations...",
})
.classify("sentiment", "Content tone", [
{ title: "Positive", definition: "Optimistic tone" },
{ title: "Negative", definition: "Critical tone" },
{ title: "Neutral", definition: "Balanced tone" },
])
.classify("category", "Article topic", [
{ title: "Technology", definition: "Tech news" },
{ title: "Business", definition: "Business news" },
{ title: "Politics", definition: "Political news" },
]),
})
.create();
//Note: 'limit' here is limiting number of extracted records not fetched
const result = await workflow.run({ limit: 10, variables: {} });
console.log(result.jobId);
const data = result.fetchData({ limit: 10 });
console.log(data);
Navigation Modes
Kadoa supports five navigation modes to handle different website structures:
| Mode | Value | Best For |
|---|
| Single Page | single-page | Extract data from a single page |
| List | paginated-page | Navigate through lists with pagination |
| List + Details | page-and-detail | Navigate lists then open each item for details |
| All Pages | all-pages | Crawl all pages or up to maxPages pages and extract matching entities |
| AI Navigation | agentic-navigation | AI-driven navigation using natural language |
Learn more about Navigation Modes →
Navigation Mode Examples
Extract data from a single page, such as a job posting or product page:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/careers-simple"],
name: "Job Posting Monitor",
navigationMode: "single-page",
extraction: (builder) =>
builder
.entity("Job Posting")
.field("jobTitle", "Job title", "STRING", {
example: "Senior Software Engineer",
})
.field("department", "Department or team", "STRING", {
example: "Engineering",
})
.field("location", "Job location", "STRING", {
example: "San Francisco, CA",
}),
})
.setInterval({ interval: "DAILY" })
.create();
console.log("Workflow created:", workflow.workflowId);
const result = await workflow.run({ limit: 10, variables: {} });
console.log(result.jobId);
List Navigation
Navigate through paginated lists to extract multiple items:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Product Catalog Monitor",
navigationMode: "paginated-page",
extraction: () => ({ schemaId }),
})
.setInterval({ interval: "HOURLY" })
.create();
// Run the workflow
const result = await workflow.run({ limit: 10 });
const response = await result.fetchData({});
console.log("Extracted items:", response.data);
List + Details Navigation
Navigate through a list and then open each item for detailed extraction:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Product Details Extractor",
navigationMode: "page-and-detail",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Product name", "STRING", {
example: "Wireless Headphones",
})
.field("price", "Product price", "MONEY")
.field("description", "Full description", "STRING", {
example: "Premium noise-cancelling headphones...",
})
.field("specifications", "Technical specs", "STRING", {
example: "Battery life: 30 hours, Bluetooth 5.0...",
}),
})
.create();
const result = await workflow.run({ limit: 10 });
const productDetails = await result.fetchData({});
console.log(productDetails.data);
All Pages (Crawler) Navigation
Crawl all pages or up to maxPages pages (if specified) and extract matching entities from discovered pages.
The starting URL must display the entity you want to extract.
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Product Catalog Crawler",
navigationMode: "all-pages",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Product name", "STRING", {
example: "Sennheiser HD 6XX",
})
.field("price", "Product price", "MONEY")
.field("reviews", "Number of reviews", "STRING", {
example: "155 reviews",
}),
})
.create();
const result = await workflow.run({ limit: 10 });
const response = await result.fetchData({});
console.log(response.data);
All URLs must share the exact same hostname. For example, https://example.com and https://example.com/products are valid, but mixing https://example.com with https://www.example.com or https://shop.example.com will be rejected.
Crawler parameters:
| Parameter | Description | Default |
|---|
maxPages | Maximum pages to crawl (1-100,000). Crawling stops when reached. | 10,000 |
maxDepth | Maximum crawl depth from starting URL (1-200) | 50 |
pathsFilterIn | Regex patterns to include specific paths (e.g., ["/products/.*"]) | None |
pathsFilterOut | Regex patterns to exclude specific paths (e.g., ["/admin/.*"]) | None |
The crawler will crawl all pages or up to maxPages pages (if specified) and extract entities matching your schema from those pages.
Raw Data Mode (No Schema)
Crawl a website and retrieve raw page artifacts (HTML, Markdown, screenshots) without defining an entity or schema. Useful for LLM ingestion, site archival, or content analysis.
// POST https://api.kadoa.com/v4/workflows
{
"urls": ["https://example.com"],
"name": "Site Archive",
"outputOptions": {
"includeHtml": true,
"includeMarkdown": true,
"includeScreenshots": false,
"includeJson": false
},
"maxPages": 500,
"maxDepth": 5
}
Output options:
| Field | Type | Default | Description |
|---|
includeHtml | boolean | true | Include raw HTML content |
includeMarkdown | boolean | false | Include Markdown-converted content |
includeScreenshots | boolean | false | Include Base64-encoded screenshots |
includeJson | boolean | false | Include JSON extraction results |
At least one of includeHtml or includeMarkdown must be true.
Crawler parameters (same as structured crawling):
| Parameter | Description | Default |
|---|
maxPages | Maximum pages to crawl (1-100,000) | 10,000 |
maxDepth | Maximum crawl depth from starting URL (1-200) | 50 |
pathsFilterIn | Regex patterns to include specific paths | None |
pathsFilterOut | Regex patterns to exclude specific paths | None |
Retrieving data:
Use the standard /data endpoint. Each row represents one crawled page:
// GET https://api.kadoa.com/v4/workflows/{workflowId}/data
{
"data": [
{
"url": "https://example.com/page-1",
"html": "<!DOCTYPE html>...",
"markdown": "# Page Title\n\nContent..."
},
{
"url": "https://example.com/page-2",
"html": "<!DOCTYPE html>...",
"markdown": "# Another Page\n\n..."
}
],
"pagination": { "page": 1, "totalPages": 5 }
}
AI Navigation
AI Navigation enables autonomous website navigation through natural language instructions. The AI understands your intent and navigates complex websites automatically.
Learn more about AI Navigation →
Schema Options
AI Navigation supports three approaches:
- Existing Schema (
schemaId) - Reference a pre-built schema from your account
- Custom Schema (
entity + fields) - Define specific fields and data types
- Auto-Detected Schema (no schema) - Let AI determine what data to extract
AI Navigation with Existing Schema
Use a pre-built schema by referencing its ID:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/careers-directory"],
name: "AI Job Scraper",
navigationMode: "agentic-navigation",
extraction: () => ({
schemaId: schema.id,
}),
userPrompt: `Navigate to the careers section, find all
engineering job postings, and extract the job details
including requirements and benefits. Make sure to
click 'Load More' if present.`,
})
.create();
console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
console.log(`Workflow finished. RunId: ${w}`);
});
AI Navigation with Custom Schema
Define your own schema for precise data extraction:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/careers-directory"],
name: "AI Job Scraper with Schema",
navigationMode: "agentic-navigation",
extraction: (builder) =>
builder
.entity("Job Posting")
.field("jobTitle", "Job title", "STRING", {
example: "Product Manager",
})
.field("description", "Job description", "STRING", {
example: "Lead product strategy and roadmap...",
})
.field("requirements", "Job requirements", "STRING", {
example: "5+ years experience in product management",
})
.field("benefits", "Benefits offered", "STRING", {
example: "Health insurance, 401k, remote work",
}),
userPrompt: `Navigate to the careers section and extract job details.`,
})
.create();
console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
console.log(`Workflow finished. RunId: ${w}`);
});
AI Navigation with Auto-Detected Schema
Let AI determine what data to extract based on your instructions:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/news"],
name: "AI Blog Scraper",
navigationMode: "agentic-navigation",
userPrompt: `Find all blog posts from 2024. For each post,
extract the title, author, publication date, and content.`,
})
.create();
console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
console.log(`Workflow finished. RunId: ${w}`);
});
Using Variables in AI Navigation
Variables allow dynamic workflows that reference values defined in your dashboard. Create variables in the UI first, then reference them in API requests:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce"],
name: "Dynamic Product Search",
navigationMode: "agentic-navigation",
userPrompt: `Navigate to search and loop through
'@productTypes', press search, and extract
product details for all results.`,
})
.create();
console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
console.log(`Workflow finished. RunId: ${w}`);
});
Variable Workflow:
- Create variables in the dashboard UI (e.g.,
productTypes)
- Reference them using
@variableName syntax in your prompt
- The backend automatically interpolates variables using account values
Scheduling & Running Workflows
Scheduling Options
Configure when your workflow runs:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce/pagination"],
name: "Scheduled Extraction",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Product name", "STRING", { example: "Sample" }),
})
.setInterval({
schedules: ["0 9 * * MON-FRI", "0 18 * * MON-FRI"],
})
.create();
// Workflow runs automatically on schedule
console.log("Scheduled workflow:", workflow.workflowId);
Available intervals:
ONLY_ONCE - Run once
HOURLY, DAILY, WEEKLY, MONTHLY - Standard intervals
REAL_TIME - Continuous monitoring (Self-service limited to 10 workflows; Enterprise is unlimited)
CUSTOM - Use cron expressions
Manual Execution
Run workflows on demand:
const workflow = await client.workflow.get(workflowId);
console.log(`Current workflows state: ${workflow.displayState}`);
const result = await client.workflow.runWorkflow(workflowId, {
limit: 10,
});
console.log(`Workflow scheduled with runId: ${result.jobId}`);
Checking Workflow Status
When using the API, poll the workflow status to know when extraction is complete:
const extraction = await client
.extract({
urls: ["https://sandbox.kadoa.com/ecommerce/pagination"],
name: "Paginated Extraction",
navigationMode: "paginated-page",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Product name", "STRING", {
example: "Sennheiser HD 6XX",
})
.field("price", "Product price", "MONEY"),
})
.create();
const result = await extraction.run({ limit: 10 });
// Fetch a single page with pagination info
const page = await result.fetchData({ page: 1, limit: 5 });
console.log("Page data:", page.data);
console.log("Pagination:", page.pagination);
// Or get all data at once
const allData = await result.fetchAllData({});
console.log("All data:", allData);
Workflow States:
IN_PROGRESS - Extraction is running
COMPLETED - Data is ready to retrieve
FAILED - Extraction failed (check errors field)
Pagination Handling
Automatically navigate through multiple pages of results:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/magic"],
name: "Geo-located Extraction",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Title", "STRING", { example: "example" }),
})
.setLocation({
type: "manual",
isoCode: "US",
})
.create();
Advanced Configuration
Proxy Locations
Specify geographic location for scraping:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/magic"],
name: "Direct Activation",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Title", "STRING", { example: "example" }),
})
.bypassPreview() // Skip review step
.create();
// Workflow is immediately active
Available locations:
US - United States
GB - United Kingdom
DE - Germany
NL - Netherlands
CA - Canada
auto - Automatic selection
Preview Mode
Skip manual review and activate workflows immediately:
const workflow = await client
.extract({
urls: ["https://sandbox.kadoa.com/magic"],
name: "Direct Activation",
extraction: (builder) =>
builder
.entity("Product")
.field("title", "Title", "STRING", { example: "example" }),
})
.bypassPreview() // Skip review step
.create();
// Workflow is immediately active
Next Steps