Extraction Schema
The Schema tells the AI exactly which fields to extract and what data type each field should contain. Ocriva uses standard JSON Schema syntax.
Basic Structure
Every schema should be a top-level object type with a properties map:
{
"type": "object",
"properties": {
"field_name": {
"type": "string",
"description": "What this field contains"
}
}
}Supported Field Types
| Type | Description | Example Value |
|---|---|---|
string | Text value | "INV-2024-001" |
number | Numeric value (int or float) | 1500.00 |
boolean | True/false flag | true |
array | List of items | ["item1", "item2"] |
object | Nested object | { "street": "...", "city": "..." } |
TIP
Add a description to every field, even simple ones. The AI uses these descriptions as extraction instructions. A field with "description": "Invoice issue date in YYYY-MM-DD format" produces far more consistent results than a bare date field with no description.
Example: Invoice Schema
A comprehensive invoice extraction schema with nested line items:
{
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "Unique invoice identifier, e.g. INV-2024-001"
},
"invoice_date": {
"type": "string",
"description": "Invoice issue date in YYYY-MM-DD format"
},
"due_date": {
"type": "string",
"description": "Payment due date in YYYY-MM-DD format"
},
"vendor": {
"type": "object",
"description": "Seller/vendor details",
"properties": {
"name": { "type": "string" },
"tax_id": { "type": "string" },
"address": { "type": "string" }
}
},
"customer": {
"type": "object",
"description": "Buyer/customer details",
"properties": {
"name": { "type": "string" },
"tax_id": { "type": "string" },
"address": { "type": "string" }
}
},
"line_items": {
"type": "array",
"description": "List of products or services on the invoice",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"total": { "type": "number" }
}
}
},
"subtotal": {
"type": "number",
"description": "Amount before tax"
},
"vat_amount": {
"type": "number",
"description": "VAT/tax amount (7% in Thailand)"
},
"total_amount": {
"type": "number",
"description": "Final total including all taxes"
},
"currency": {
"type": "string",
"description": "Currency code, e.g. THB, USD"
}
}
}Tips for Writing Good Schemas
- Add
descriptionto every field — The AI reads these descriptions as instructions. A field nameddatewith description"Invoice issue date in YYYY-MM-DD format"will produce far more consistent results than a baredatefield. - Use snake_case for field names — Consistent naming prevents confusion downstream.
- Keep arrays flat when possible — Deep nesting (3+ levels) can reduce AI accuracy.
- Specify format in description — For dates, phone numbers, and IDs, tell the AI the expected format.
WARNING
Deeply nested schemas (objects inside objects inside arrays, 3+ levels deep) are harder for AI models to fill accurately. If you find extraction quality dropping on nested fields, flatten the schema or break the document type into multiple simpler templates.
