Ocriva Logo

Documents

Extraction Schema

Define JSON Schema for structured data extraction.

templatesschemajson

Published: 3/31/2026

Extraction Schema

The Schema tells the AI exactly which fields to extract and what data type each field should contain. Ocriva uses standard JSON Schema syntax.

Basic Structure

Every schema should be a top-level object type with a properties map:

{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "What this field contains"
    }
  }
}

Supported Field Types

TypeDescriptionExample Value
stringText value"INV-2024-001"
numberNumeric value (int or float)1500.00
booleanTrue/false flagtrue
arrayList of items["item1", "item2"]
objectNested object{ "street": "...", "city": "..." }

TIP

Add a description to every field, even simple ones. The AI uses these descriptions as extraction instructions. A field with "description": "Invoice issue date in YYYY-MM-DD format" produces far more consistent results than a bare date field with no description.

Example: Invoice Schema

A comprehensive invoice extraction schema with nested line items:

{
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string",
      "description": "Unique invoice identifier, e.g. INV-2024-001"
    },
    "invoice_date": {
      "type": "string",
      "description": "Invoice issue date in YYYY-MM-DD format"
    },
    "due_date": {
      "type": "string",
      "description": "Payment due date in YYYY-MM-DD format"
    },
    "vendor": {
      "type": "object",
      "description": "Seller/vendor details",
      "properties": {
        "name": { "type": "string" },
        "tax_id": { "type": "string" },
        "address": { "type": "string" }
      }
    },
    "customer": {
      "type": "object",
      "description": "Buyer/customer details",
      "properties": {
        "name": { "type": "string" },
        "tax_id": { "type": "string" },
        "address": { "type": "string" }
      }
    },
    "line_items": {
      "type": "array",
      "description": "List of products or services on the invoice",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": {
      "type": "number",
      "description": "Amount before tax"
    },
    "vat_amount": {
      "type": "number",
      "description": "VAT/tax amount (7% in Thailand)"
    },
    "total_amount": {
      "type": "number",
      "description": "Final total including all taxes"
    },
    "currency": {
      "type": "string",
      "description": "Currency code, e.g. THB, USD"
    }
  }
}

Tips for Writing Good Schemas

  • Add description to every field — The AI reads these descriptions as instructions. A field named date with description "Invoice issue date in YYYY-MM-DD format" will produce far more consistent results than a bare date field.
  • Use snake_case for field names — Consistent naming prevents confusion downstream.
  • Keep arrays flat when possible — Deep nesting (3+ levels) can reduce AI accuracy.
  • Specify format in description — For dates, phone numbers, and IDs, tell the AI the expected format.

WARNING

Deeply nested schemas (objects inside objects inside arrays, 3+ levels deep) are harder for AI models to fill accurately. If you find extraction quality dropping on nested fields, flatten the schema or break the document type into multiple simpler templates.