Working with Schemas
Define the structure of data you want to extract using the builder API:
const extraction = await client
.extract({
urls: ['https://sandbox.kadoa.com/ecommerce'],
name: 'Product Extraction',
extraction: builder => builder
.entity('Product')
.field('title', 'Product name', 'STRING', { example: 'Laptop' })
.field('price', 'Product price', 'MONEY')
.field('inStock', 'Availability', 'BOOLEAN')
.field('rating', 'Star rating 1-5', 'NUMBER')
})
.create();
Reusable Schemas
For consistent data extraction across multiple workflows, you can create and manage schemas separately using the Schema Management API.
Schema Management API
The Schema Management API allows you to create, retrieve, and delete schemas programmatically. Saved schemas can be reused across multiple extractions, ensuring consistent data structure.
When to Use Saved Schemas
Use saved schemas when you:
- Extract the same data structure from multiple websites
- Want to maintain consistent field definitions across workflows
- Need to programmatically manage schema lifecycle
- Share schemas across different parts of your application
For one-off extractions, inline schema definitions (shown above) are simpler and don’t require separate schema management.
Create a Schema
const schema = await client.schema.create({
name: 'Product Schema',
entity: 'Product',
fields: [
{
name: 'title',
description: 'Product name',
fieldType: 'SCHEMA',
dataType: 'STRING',
example: 'iPhone 15 Pro'
},
{
name: 'price',
description: 'Product price',
fieldType: 'SCHEMA',
dataType: 'MONEY'
},
{
name: 'inStock',
description: 'Availability',
fieldType: 'SCHEMA',
dataType: 'BOOLEAN'
},
{
name: 'rating',
description: 'Star rating',
fieldType: 'SCHEMA',
dataType: 'NUMBER'
}
]
});
console.log('Schema created:', schema.id);
Get a Schema
Retrieve an existing schema by ID:
const schema = await client.schema.get('schema-id-123');
console.log(schema.name); // 'Product Schema'
console.log(schema.entity); // 'Product'
console.log(schema.fields); // Array of field definitions
Delete a Schema
Remove a schema when it’s no longer needed:
await client.schema.delete('schema-id-123');
Deleting a schema does not affect existing workflows or extractions that were created using it.
Use a Saved Schema
Reference a saved schema in your extraction:
const extraction = await client
.extract({
urls: ['https://sandbox.kadoa.com/ecommerce'],
name: 'Product Extraction',
extraction: { schemaId: schema.id }
})
.create();
const result = await extraction.run();
Field Types
Schemas support three types of fields:
- Regular fields - Structured data extraction (shown above)
- Classification fields - Categorize content into predefined labels
- Metadata fields - Extract raw page content (HTML, Markdown, URLs)
Available Data Types
For regular fields, specify the dataType
:
STRING
• NUMBER
• BOOLEAN
• DATE
• DATETIME
• MONEY
• IMAGE
• LINK
• OBJECT
• ARRAY
See data type details and examples →
Classification Fields
Categorize extracted content into predefined labels:
const schema = await client.schema.create({
name: 'Article Schema',
entity: 'Article',
fields: [
{
name: 'title',
description: 'Article headline',
fieldType: 'SCHEMA',
dataType: 'STRING',
example: 'Breaking News'
},
{
name: 'category',
description: 'Article category',
fieldType: 'CLASSIFICATION',
categories: [
{ title: 'Technology', definition: 'Tech news and updates' },
{ title: 'Business', definition: 'Business and finance' },
{ title: 'Sports', definition: 'Sports coverage' }
]
}
]
});
Metadata Fields (Raw Content)
Extract raw page content alongside structured data:
const schema = await client.schema.create({
name: 'Article with Raw Content',
entity: 'Article',
fields: [
{
name: 'title',
description: 'Article headline',
fieldType: 'SCHEMA',
dataType: 'STRING',
example: 'Latest News'
},
{
name: 'rawMarkdown',
description: 'Page content as Markdown',
fieldType: 'METADATA',
metadataKey: 'MARKDOWN'
},
{
name: 'rawHtml',
description: 'Page HTML source',
fieldType: 'METADATA',
metadataKey: 'HTML'
},
{
name: 'pageUrl',
description: 'Page URL',
fieldType: 'METADATA',
metadataKey: 'PAGE_URL'
}
]
});
Available options: HTML
• MARKDOWN
• PAGE_URL