Ocriva Logo

Documents

Data Files & RAG

Upload reference documents for retrieval-augmented generation.

templatesragdata-filesvector-store

Published: 3/31/2026

Data Files & RAG

Data Files let you attach reference documents to a template. The AI uses these documents as additional context when processing your uploads — this is called Retrieval-Augmented Generation (RAG).

Supported File Formats

FormatUse Case
.jsonStructured reference data (product catalogs, code tables)
.mdMarkdown documentation, glossaries, rules
.txtPlain text reference material, lookup tables

IMPORTANT

Data files are uploaded to the AI provider's servers (e.g., OpenAI) for vector search. Ensure your files don't contain sensitive information that shouldn't leave your systems.

NOTE

Only .json, .md, and .txt files are supported as data files. Other formats (PDF, DOCX, CSV, images) cannot be attached as reference documents. If your reference data is in another format, convert it to one of these three supported formats first.

How It Works

  1. You upload one or more data files to a template.
  2. When a document is processed, Ocriva's vector store indexes the data files.
  3. The AI retrieves the most relevant chunks from your data files alongside the document content.
  4. The AI can reference this context to improve extraction accuracy.

IMPORTANT

Data files are uploaded to the AI provider's vector store to enable retrieval. Do not upload files containing sensitive personal data, credentials, or confidential information as data files. Use them only for non-sensitive reference material such as product catalogs, glossaries, or extraction rule documents.

Use Cases for Data Files

  • Product catalog — Attach a JSON list of product codes and names so the AI can resolve abbreviations found in invoices.
  • Customer list — Provide a CSV/JSON of known customer names and IDs so the AI can normalize vendor names.
  • Extraction rules — A Markdown file describing business-specific rules, such as "if the department code starts with 'MKT', tag it as Marketing".
  • Glossary — Define terms specific to your industry so the AI understands domain jargon.

Data File Example

product-catalog.json:

[
  { "code": "PRD-001", "name": "Wireless Keyboard TH Layout", "category": "Electronics" },
  { "code": "PRD-002", "name": "USB-C Hub 7-Port", "category": "Electronics" },
  { "code": "PRD-003", "name": "Ergonomic Office Chair", "category": "Furniture" }
]

When an invoice references PRD-002, the AI will automatically populate the product name as "USB-C Hub 7-Port" rather than leaving it as a bare code.

Extraction Rules Example

Markdown files are well-suited for encoding business logic that would otherwise require custom post-processing. The AI reads the rules during retrieval and applies them inline while extracting field values.

extraction-rules.md:

# Invoice Extraction Rules
 
## Department Codes
- Codes starting with `MKT` → tag as "Marketing"
- Codes starting with `ENG` → tag as "Engineering"
- Codes starting with `FIN` → tag as "Finance"
 
## Currency Handling
- Always convert amounts to THB
- If no currency symbol is present, assume THB
- Round to 2 decimal places
 
## Vendor Name Normalization
- "บ." or "บจ." → expand to "บริษัท ... จำกัด"
- Remove trailing spaces and special characters

When the AI encounters a department code such as MKT-04 on an invoice, it retrieves the relevant rule chunk and tags the extracted field as "Marketing" automatically — no post-processing step required.

Lookup Table Example

Plain text files work well for simple key-value lookups. Keep each entry on its own line so the vector store can retrieve individual rows efficiently.

country-codes.txt:

Country Code Lookup
TH = Thailand
US = United States
JP = Japan
SG = Singapore
GB = United Kingdom
DE = Germany
CN = China
AU = Australia

If a document contains a two-letter country code such as SG, the AI retrieves the matching row and expands it to "Singapore" in the extracted output.

Practical Scenario: Invoice Processing with RAG

The following walkthrough shows how combining multiple data files produces richer, more accurate extraction results.

Setup

  • Template: Invoice extractor with fields vendor_name, line_items, department, total_thb
  • Data files attached:
    • product-catalog.json — maps product codes to full names and categories
    • extraction-rules.md — department tagging rules and currency normalisation rules

Upload

You upload a scanned invoice PDF. The invoice contains:

  • Vendor written as "บ. เทคโนโลยี จก."
  • Line items listed as PRD-001 x2, PRD-003 x1
  • A subtotal in USD: $120.00
  • A department code: ENG-12

What RAG does

Raw value on invoiceData file consultedExtracted result
บ. เทคโนโลยี จก.extraction-rules.mdบริษัท เทคโนโลยี จำกัด
PRD-001product-catalog.jsonWireless Keyboard TH Layout
PRD-003product-catalog.jsonErgonomic Office Chair
$120.00extraction-rules.mdconverted and rounded to THB
ENG-12extraction-rules.mdtagged as "Engineering"

Without the data files, the AI would return the raw codes and abbreviations. With RAG, each value is resolved and normalised in a single pass.

Tips for Data Files

  • Keep files small and focused — aim for under 1 MB per file. Large files increase retrieval latency and may cause the vector store to return less relevant chunks. Split a large catalog into domain-specific files (e.g., catalog-electronics.json, catalog-furniture.json) rather than uploading one monolithic file.
  • Update data files when reference data changes — if your product catalog is updated, re-upload the file so the vector store reflects the latest entries. Stale data files will cause the AI to return outdated values.
  • Choose the right format for each purpose — use .md for rules and narrative instructions, .json for structured lookups with multiple fields, and .txt for simple one-to-one mappings. Mixing concerns in a single file reduces retrieval precision.
  • Avoid sensitive content — data files are sent to the AI provider's vector store. Use them only for non-sensitive reference material such as product catalogs, glossaries, or rule documents.