Create Workflows

Overview

This guide shows you how to create workflows programmatically using either the Kadoa SDK or REST API. You’ll learn how to:

Create workflows with different navigation modes
Use existing schemas or define custom ones
Set up AI Navigation with natural language instructions
Configure monitoring and scheduling options

Prerequisites

Before you begin, you’ll need:

A Kadoa account
Your API key
For SDK: npm install @kadoa/node-sdk or yarn add @kadoa/node-sdk or pip install kadoa-sdk

Authentication

import { KadoaClient } from '@kadoa/node-sdk';

const client = new KadoaClient({
  apiKey: 'your-api-key'
});

const status = await client.status();
console.log(status);
console.log(status.user);

Extraction Methods

Choose how you want to extract data from websites:

Auto-Detection

Auto-detect uses AI to detect and extract what’s on the page. If you’re using the REST API directly, auto-detection isn’t available and you need to pass a data schema.

// SDK: AI automatically detects and extracts data
const result = await client.extraction.run({
  urls: ["https://sandbox.kadoa.com/ecommerce"],
  name: "Auto Product Extraction",
  limit: 10,
});

console.log(result.data);

Custom Schema

Define exactly what fields you want to extract for precise control:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce"],
    name: "Structured Product Extraction",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Product name", "STRING", {
          example: "iPhone 15 Pro",
        })
        .field("price", "Price in USD", "MONEY")
        .field("inStock", "Availability", "BOOLEAN")
        .field("rating", "Rating 1-5", "NUMBER")
        .field("releaseDate", "Launch date", "DATE"),
  })
  .create();

const result = await workflow.run({ limit: 10 });

// Use destructuring for cleaner access
const { data } = await result.fetchData({});
console.log(data);

Available Data Types:

STRING - Text content
NUMBER - Numeric values
BOOLEAN - True/false values
DATE - Date values
DATETIME - Date and time values
MONEY / CURRENCY - Monetary values
IMAGE - Image URLs
LINK - Hyperlinks
OBJECT - Nested objects
ARRAY - Lists of items

See all data types →

PDF Page Selection

When extracting from PDF URLs, you can specify which pages to process:

API

// POST https://api.kadoa.com/v4/workflows
{
  "urls": ["https://example.com/report.pdf"],
  "name": "PDF Extraction",
  "entity": "Data",
  "fields": [
    {
      "name": "content",
      "dataType": "STRING",
      "description": "Extracted content"
    }
  ],
  "pageNumbers": [1, 2, 3]  // Extract only pages 1, 2, and 3
}

If pageNumbers is omitted, all pages are processed.

Raw Content Extraction

Extract unstructured content as HTML, Markdown, or plain text:

// Extract as Markdown
const extraction = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/news"],
    name: "Article Content",
    extraction: (builder) => builder.raw("MARKDOWN"),
  })
  .create();

const run = await extraction.run({ limit: 10 });
const data = await run.fetchData({});
console.log(data);

Available Formats:

HTML - Raw HTML content
MARKDOWN - Markdown formatted text
PAGE_URL - URLs of extracted pages

Classification

Automatically categorize content into predefined classes:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/news"],
    name: "Article Classifier",
    extraction: (builder) =>
      builder
        .entity("Article")
        .field("title", "Headline", "STRING", {
          example: "Tech Company Announces New Product",
        })
        .field("content", "Article text", "STRING", {
          example: "The article discusses the latest innovations...",
        })
        .classify("sentiment", "Content tone", [
          { title: "Positive", definition: "Optimistic tone" },
          { title: "Negative", definition: "Critical tone" },
          { title: "Neutral", definition: "Balanced tone" },
        ])
        .classify("category", "Article topic", [
          { title: "Technology", definition: "Tech news" },
          { title: "Business", definition: "Business news" },
          { title: "Politics", definition: "Political news" },
        ]),
  })
  .create();
//Note: 'limit' here is limiting number of extracted records not fetched
const result = await workflow.run({ limit: 10, variables: {} });
console.log(result.jobId);
const data = result.fetchData({ limit: 10 });
console.log(data);

Kadoa supports five navigation modes to handle different website structures:

Mode	Value	Best For
Single Page	`single-page`	Extract data from a single page
List	`paginated-page`	Navigate through lists with pagination
List + Details	`page-and-detail`	Navigate lists then open each item for details
All Pages	`all-pages`	Crawl all pages or up to `maxPages` pages and extract matching entities
AI Navigation	`agentic-navigation`	AI-driven navigation using natural language

Learn more about Navigation Modes →

Single Page Extraction

Extract data from a single page, such as a job posting or product page:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/careers-simple"],
    name: "Job Posting Monitor",
    navigationMode: "single-page",
    extraction: (builder) =>
      builder
        .entity("Job Posting")
        .field("jobTitle", "Job title", "STRING", {
          example: "Senior Software Engineer",
        })
        .field("department", "Department or team", "STRING", {
          example: "Engineering",
        })
        .field("location", "Job location", "STRING", {
          example: "San Francisco, CA",
        }),
  })
  .setInterval({ interval: "DAILY" })
  .create();

console.log("Workflow created:", workflow.workflowId);
const result = await workflow.run({ limit: 10, variables: {} });
console.log(result.jobId);

Navigate through paginated lists to extract multiple items:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce"],
    name: "Product Catalog Monitor",
    navigationMode: "paginated-page",
    extraction: () => ({ schemaId }),
  })
  .setInterval({ interval: "HOURLY" })
  .create();

// Run the workflow
const result = await workflow.run({ limit: 10 });
const response = await result.fetchData({});
console.log("Extracted items:", response.data);

Navigate through a list and then open each item for detailed extraction:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce"],
    name: "Product Details Extractor",
    navigationMode: "page-and-detail",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Product name", "STRING", {
          example: "Wireless Headphones",
        })
        .field("price", "Product price", "MONEY")
        .field("description", "Full description", "STRING", {
          example: "Premium noise-cancelling headphones...",
        })
        .field("specifications", "Technical specs", "STRING", {
          example: "Battery life: 30 hours, Bluetooth 5.0...",
        }),
  })
  .create();

const result = await workflow.run({ limit: 10 });
const productDetails = await result.fetchData({});
console.log(productDetails.data);

Crawl all pages or up to maxPages pages (if specified) and extract matching entities from discovered pages.

The starting URL must display the entity you want to extract.

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce"],
    name: "Product Catalog Crawler",
    navigationMode: "all-pages",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Product name", "STRING", {
          example: "Sennheiser HD 6XX",
        })
        .field("price", "Product price", "MONEY")
        .field("reviews", "Number of reviews", "STRING", {
          example: "155 reviews",
        }),
  })
  .create();

const result = await workflow.run({ limit: 10 });
const response = await result.fetchData({});
console.log(response.data);

All URLs must share the exact same hostname. For example, https://example.com and https://example.com/products are valid, but mixing https://example.com with https://www.example.com or https://shop.example.com will be rejected.

Crawler parameters:

Parameter	Description	Default
`maxPages`	Maximum pages to crawl (1-100,000). Crawling stops when reached.	10,000
`maxDepth`	Maximum crawl depth from starting URL (1-200)	50
`pathsFilterIn`	Regex patterns to include specific paths (e.g., `["/products/.*"]`)	None
`pathsFilterOut`	Regex patterns to exclude specific paths (e.g., `["/admin/.*"]`)	None

The crawler will crawl all pages or up to maxPages pages (if specified) and extract entities matching your schema from those pages.

Raw Data Mode (No Schema)

Crawl a website and retrieve raw page artifacts (HTML, Markdown, screenshots) without defining an entity or schema. Useful for LLM ingestion, site archival, or content analysis.

// POST https://api.kadoa.com/v4/workflows
{
  "urls": ["https://example.com"],
  "name": "Site Archive",
  "outputOptions": {
    "includeHtml": true,
    "includeMarkdown": true,
    "includeScreenshots": false,
    "includeJson": false
  },
  "maxPages": 500,
  "maxDepth": 5
}

Output options:

Field	Type	Default	Description
`includeHtml`	boolean	`true`	Include raw HTML content
`includeMarkdown`	boolean	`false`	Include Markdown-converted content
`includeScreenshots`	boolean	`false`	Include Base64-encoded screenshots
`includeJson`	boolean	`false`	Include JSON extraction results

At least one of includeHtml or includeMarkdown must be true.

Crawler parameters (same as structured crawling):

Parameter	Description	Default
`maxPages`	Maximum pages to crawl (1-100,000)	10,000
`maxDepth`	Maximum crawl depth from starting URL (1-200)	50
`pathsFilterIn`	Regex patterns to include specific paths	None
`pathsFilterOut`	Regex patterns to exclude specific paths	None

Retrieving data: Use the standard /data endpoint. Each row represents one crawled page:

// GET https://api.kadoa.com/v4/workflows/{workflowId}/data
{
  "data": [
    {
      "url": "https://example.com/page-1",
      "html": "<!DOCTYPE html>...",
      "markdown": "# Page Title\n\nContent..."
    },
    {
      "url": "https://example.com/page-2",
      "html": "<!DOCTYPE html>...",
      "markdown": "# Another Page\n\n..."
    }
  ],
  "pagination": { "page": 1, "totalPages": 5 }
}

AI Navigation enables autonomous website navigation through natural language instructions. The AI understands your intent and navigates complex websites automatically. Learn more about AI Navigation →

Schema Options

AI Navigation supports three approaches:

Existing Schema (schemaId) - Reference a pre-built schema from your account
Custom Schema (entity + fields) - Define specific fields and data types
Auto-Detected Schema (no schema) - Let AI determine what data to extract

Use a pre-built schema by referencing its ID:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/careers-directory"],
    name: "AI Job Scraper",
    navigationMode: "agentic-navigation",
    extraction: () => ({
      schemaId: schema.id,
    }),
    userPrompt: `Navigate to the careers section, find all
                 engineering job postings, and extract the job details
                 including requirements and benefits. Make sure to
                 click 'Load More' if present.`,
  })
  .create();

console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
  console.log(`Workflow finished. RunId: ${w}`);
});

Define your own schema for precise data extraction:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/careers-directory"],
    name: "AI Job Scraper with Schema",
    navigationMode: "agentic-navigation",
    extraction: (builder) =>
      builder
        .entity("Job Posting")
        .field("jobTitle", "Job title", "STRING", {
          example: "Product Manager",
        })
        .field("description", "Job description", "STRING", {
          example: "Lead product strategy and roadmap...",
        })
        .field("requirements", "Job requirements", "STRING", {
          example: "5+ years experience in product management",
        })
        .field("benefits", "Benefits offered", "STRING", {
          example: "Health insurance, 401k, remote work",
        }),
    userPrompt: `Navigate to the careers section and extract job details.`,
  })
  .create();

console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
  console.log(`Workflow finished. RunId: ${w}`);
});

Let AI determine what data to extract based on your instructions:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/news"],
    name: "AI Blog Scraper",
    navigationMode: "agentic-navigation",
    userPrompt: `Find all blog posts from 2024. For each post,
      extract the title, author, publication date, and content.`,
  })
  .create();

console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
  console.log(`Workflow finished. RunId: ${w}`);
});

Variables allow dynamic workflows that reference values defined in your dashboard. Create variables in the UI first, then reference them in API requests:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce"],
    name: "Dynamic Product Search",
    navigationMode: "agentic-navigation",
    userPrompt: `Navigate to search and loop through
      '@productTypes', press search, and extract
      product details for all results.`,
  })
  .create();

console.log(`Workflow ${workflow.workflowId} started`);
// Note: AI Navigation flows typically take ~1 hour to complete.
// We recommend using webhooks to receive notifications when finished.
workflow.run().then((w) => {
  console.log(`Workflow finished. RunId: ${w}`);
});

Variable Workflow:

Create variables in the dashboard UI (e.g., productTypes)
Reference them using @variableName syntax in your prompt
The backend automatically interpolates variables using account values

Scheduling & Running Workflows

Scheduling Options

Configure when your workflow runs:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce/pagination"],
    name: "Scheduled Extraction",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Product name", "STRING", { example: "Sample" }),
  })
  .setInterval({
    schedules: ["0 9 * * MON-FRI", "0 18 * * MON-FRI"],
  })
  .create();

// Workflow runs automatically on schedule
console.log("Scheduled workflow:", workflow.workflowId);

Available intervals:

ONLY_ONCE - Run once
HOURLY, DAILY, WEEKLY, MONTHLY - Standard intervals
REAL_TIME - Continuous monitoring (Self-service limited to 10 workflows; Enterprise is unlimited)
CUSTOM - Use cron expressions

Manual Execution

Run workflows on demand:

const workflow = await client.workflow.get(workflowId);
console.log(`Current workflows state: ${workflow.displayState}`);

const result = await client.workflow.runWorkflow(workflowId, {
  limit: 10,
});
console.log(`Workflow scheduled with runId: ${result.jobId}`);

Checking Workflow Status

When using the API, poll the workflow status to know when extraction is complete:

const extraction = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/ecommerce/pagination"],
    name: "Paginated Extraction",
    navigationMode: "paginated-page",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Product name", "STRING", {
          example: "Sennheiser HD 6XX",
        })
        .field("price", "Product price", "MONEY"),
  })
  .create();

const result = await extraction.run({ limit: 10 });

// Fetch a single page with pagination info
const page = await result.fetchData({ page: 1, limit: 5 });
console.log("Page data:", page.data);
console.log("Pagination:", page.pagination);

// Or get all data at once
const allData = await result.fetchAllData({});
console.log("All data:", allData);

Workflow States:

IN_PROGRESS - Extraction is running
COMPLETED - Data is ready to retrieve
FAILED - Extraction failed (check errors field)

Pagination Handling

Automatically navigate through multiple pages of results:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/magic"],
    name: "Geo-located Extraction",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Title", "STRING", { example: "example" }),
  })
  .setLocation({
    type: "manual",
    isoCode: "US",
  })
  .create();

Advanced Configuration

Proxy Locations

Specify geographic location for scraping:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/magic"],
    name: "Direct Activation",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Title", "STRING", { example: "example" }),
  })
  .bypassPreview() // Skip review step
  .create();

// Workflow is immediately active

Available locations:

US - United States
GB - United Kingdom
DE - Germany
NL - Netherlands
CA - Canada
auto - Automatic selection

Preview Mode

Skip manual review and activate workflows immediately:

const workflow = await client
  .extract({
    urls: ["https://sandbox.kadoa.com/magic"],
    name: "Direct Activation",
    extraction: (builder) =>
      builder
        .entity("Product")
        .field("title", "Title", "STRING", { example: "example" }),
  })
  .bypassPreview() // Skip review step
  .create();

// Workflow is immediately active

Quick Start

Build with UI

Build with SDK/API

Integrations

Overview

Prerequisites

Authentication

Extraction Methods

Auto-Detection

Custom Schema

PDF Page Selection

Raw Content Extraction

Classification

Navigation Modes

Navigation Mode Examples

Single Page Extraction

List Navigation

List + Details Navigation

All Pages (Crawler) Navigation

Raw Data Mode (No Schema)

AI Navigation

Schema Options

AI Navigation with Existing Schema

AI Navigation with Custom Schema

AI Navigation with Auto-Detected Schema

Using Variables in AI Navigation

Scheduling & Running Workflows

Scheduling Options

Manual Execution

Checking Workflow Status

Advanced Configuration

Proxy Locations

Preview Mode

Next Steps

Quick Start

Build with UI

Build with SDK/API

Integrations

​Overview

​Prerequisites

​Authentication

​Extraction Methods

​Auto-Detection

​Custom Schema

​PDF Page Selection

​Raw Content Extraction

​Classification

​Navigation Modes

​Navigation Mode Examples

​Single Page Extraction

​List Navigation

​List + Details Navigation

​All Pages (Crawler) Navigation

​Raw Data Mode (No Schema)

​AI Navigation

​Schema Options

​AI Navigation with Existing Schema

​AI Navigation with Custom Schema

​AI Navigation with Auto-Detected Schema

​Using Variables in AI Navigation

​Scheduling & Running Workflows

​Scheduling Options

​Manual Execution

​Checking Workflow Status

​Pagination Handling

​Advanced Configuration

​Proxy Locations

​Preview Mode

​Next Steps

Overview

Prerequisites

Authentication

Extraction Methods

Auto-Detection

Custom Schema

PDF Page Selection

Raw Content Extraction

Classification

Navigation Modes

Navigation Mode Examples

Single Page Extraction

List Navigation

List + Details Navigation

All Pages (Crawler) Navigation

Raw Data Mode (No Schema)

AI Navigation

Schema Options

AI Navigation with Existing Schema

AI Navigation with Custom Schema

AI Navigation with Auto-Detected Schema

Using Variables in AI Navigation

Scheduling & Running Workflows

Scheduling Options

Manual Execution

Checking Workflow Status

Pagination Handling

Advanced Configuration

Proxy Locations

Preview Mode

Next Steps