Skip to main content

Working with Schemas

Define the structure of data you want to extract using the builder API:
const extraction = await client
  .extract({
    urls: ['https://sandbox.kadoa.com/ecommerce'],
    name: 'Product Extraction',
    extraction: builder => builder
      .entity('Product')
      .field('title', 'Product name', 'STRING', { example: 'Laptop' })
      .field('price', 'Product price', 'MONEY')
      .field('inStock', 'Availability', 'BOOLEAN')
      .field('rating', 'Star rating 1-5', 'NUMBER')
  })
  .create();

Reusable Schemas

For consistent data extraction across multiple workflows, you can create and manage schemas separately using the Schema Management API.

Schema Management API

The Schema Management API allows you to create, retrieve, and delete schemas programmatically. Saved schemas can be reused across multiple extractions, ensuring consistent data structure.

When to Use Saved Schemas

Use saved schemas when you:
  • Extract the same data structure from multiple websites
  • Want to maintain consistent field definitions across workflows
  • Need to programmatically manage schema lifecycle
  • Share schemas across different parts of your application
For one-off extractions, inline schema definitions (shown above) are simpler and don’t require separate schema management.

Create a Schema

const schema = await client.schema.create({
  name: 'Product Schema',
  entity: 'Product',
  fields: [
    {
      name: 'title',
      description: 'Product name',
      fieldType: 'SCHEMA',
      dataType: 'STRING',
      example: 'iPhone 15 Pro'
    },
    {
      name: 'price',
      description: 'Product price',
      fieldType: 'SCHEMA',
      dataType: 'MONEY'
    },
    {
      name: 'inStock',
      description: 'Availability',
      fieldType: 'SCHEMA',
      dataType: 'BOOLEAN'
    },
    {
      name: 'rating',
      description: 'Star rating',
      fieldType: 'SCHEMA',
      dataType: 'NUMBER'
    }
  ]
});

console.log('Schema created:', schema.id);

Get a Schema

Retrieve an existing schema by ID:
const schema = await client.schema.get('schema-id-123');

console.log(schema.name);     // 'Product Schema'
console.log(schema.entity);   // 'Product'
console.log(schema.fields);   // Array of field definitions

Delete a Schema

Remove a schema when it’s no longer needed:
await client.schema.delete('schema-id-123');
Deleting a schema does not affect existing workflows or extractions that were created using it.

Use a Saved Schema

Reference a saved schema in your extraction:
const extraction = await client
  .extract({
    urls: ['https://sandbox.kadoa.com/ecommerce'],
    name: 'Product Extraction',
    extraction: { schemaId: schema.id }
  })
  .create();

const result = await extraction.run();

Field Types

Schemas support three types of fields:
  1. Regular fields - Structured data extraction (shown above)
  2. Classification fields - Categorize content into predefined labels
  3. Metadata fields - Extract raw page content (HTML, Markdown, URLs)

Available Data Types

For regular fields, specify the dataType: STRINGNUMBERBOOLEANDATEDATETIMEMONEYIMAGELINKOBJECTARRAY See data type details and examples →

Classification Fields

Categorize extracted content into predefined labels:
const schema = await client.schema.create({
  name: 'Article Schema',
  entity: 'Article',
  fields: [
    {
      name: 'title',
      description: 'Article headline',
      fieldType: 'SCHEMA',
      dataType: 'STRING',
      example: 'Breaking News'
    },
    {
      name: 'category',
      description: 'Article category',
      fieldType: 'CLASSIFICATION',
      categories: [
        { title: 'Technology', definition: 'Tech news and updates' },
        { title: 'Business', definition: 'Business and finance' },
        { title: 'Sports', definition: 'Sports coverage' }
      ]
    }
  ]
});

Metadata Fields (Raw Content)

Extract raw page content alongside structured data:
const schema = await client.schema.create({
  name: 'Article with Raw Content',
  entity: 'Article',
  fields: [
    {
      name: 'title',
      description: 'Article headline',
      fieldType: 'SCHEMA',
      dataType: 'STRING',
      example: 'Latest News'
    },
    {
      name: 'rawMarkdown',
      description: 'Page content as Markdown',
      fieldType: 'METADATA',
      metadataKey: 'MARKDOWN'
    },
    {
      name: 'rawHtml',
      description: 'Page HTML source',
      fieldType: 'METADATA',
      metadataKey: 'HTML'
    },
    {
      name: 'pageUrl',
      description: 'Page URL',
      fieldType: 'METADATA',
      metadataKey: 'PAGE_URL'
    }
  ]
});
Available options: HTMLMARKDOWNPAGE_URL
I