Data Files & RAG
Data Files let you attach reference documents to a template. The AI uses these documents as additional context when processing your uploads — this is called Retrieval-Augmented Generation (RAG).
Supported File Formats
| Format | Use Case |
|---|---|
.json | Structured reference data (product catalogs, code tables) |
.md | Markdown documentation, glossaries, rules |
.txt | Plain text reference material, lookup tables |
IMPORTANT
Data files are uploaded to the AI provider's servers (e.g., OpenAI) for vector search. Ensure your files don't contain sensitive information that shouldn't leave your systems.
NOTE
Only .json, .md, and .txt files are supported as data files. Other formats (PDF, DOCX, CSV, images) cannot be attached as reference documents. If your reference data is in another format, convert it to one of these three supported formats first.
How It Works
- You upload one or more data files to a template.
- When a document is processed, Ocriva's vector store indexes the data files.
- The AI retrieves the most relevant chunks from your data files alongside the document content.
- The AI can reference this context to improve extraction accuracy.
IMPORTANT
Data files are uploaded to the AI provider's vector store to enable retrieval. Do not upload files containing sensitive personal data, credentials, or confidential information as data files. Use them only for non-sensitive reference material such as product catalogs, glossaries, or extraction rule documents.
Use Cases for Data Files
- Product catalog — Attach a JSON list of product codes and names so the AI can resolve abbreviations found in invoices.
- Customer list — Provide a CSV/JSON of known customer names and IDs so the AI can normalize vendor names.
- Extraction rules — A Markdown file describing business-specific rules, such as "if the department code starts with 'MKT', tag it as Marketing".
- Glossary — Define terms specific to your industry so the AI understands domain jargon.
Data File Example
product-catalog.json:
[
{ "code": "PRD-001", "name": "Wireless Keyboard TH Layout", "category": "Electronics" },
{ "code": "PRD-002", "name": "USB-C Hub 7-Port", "category": "Electronics" },
{ "code": "PRD-003", "name": "Ergonomic Office Chair", "category": "Furniture" }
]When an invoice references PRD-002, the AI will automatically populate the product name as "USB-C Hub 7-Port" rather than leaving it as a bare code.
Extraction Rules Example
Markdown files are well-suited for encoding business logic that would otherwise require custom post-processing. The AI reads the rules during retrieval and applies them inline while extracting field values.
extraction-rules.md:
# Invoice Extraction Rules
## Department Codes
- Codes starting with `MKT` → tag as "Marketing"
- Codes starting with `ENG` → tag as "Engineering"
- Codes starting with `FIN` → tag as "Finance"
## Currency Handling
- Always convert amounts to THB
- If no currency symbol is present, assume THB
- Round to 2 decimal places
## Vendor Name Normalization
- "บ." or "บจ." → expand to "บริษัท ... จำกัด"
- Remove trailing spaces and special charactersWhen the AI encounters a department code such as MKT-04 on an invoice, it retrieves the relevant rule chunk and tags the extracted field as "Marketing" automatically — no post-processing step required.
Lookup Table Example
Plain text files work well for simple key-value lookups. Keep each entry on its own line so the vector store can retrieve individual rows efficiently.
country-codes.txt:
Country Code Lookup
TH = Thailand
US = United States
JP = Japan
SG = Singapore
GB = United Kingdom
DE = Germany
CN = China
AU = AustraliaIf a document contains a two-letter country code such as SG, the AI retrieves the matching row and expands it to "Singapore" in the extracted output.
Practical Scenario: Invoice Processing with RAG
The following walkthrough shows how combining multiple data files produces richer, more accurate extraction results.
Setup
- Template: Invoice extractor with fields
vendor_name,line_items,department,total_thb - Data files attached:
product-catalog.json— maps product codes to full names and categoriesextraction-rules.md— department tagging rules and currency normalisation rules
Upload
You upload a scanned invoice PDF. The invoice contains:
- Vendor written as
"บ. เทคโนโลยี จก." - Line items listed as
PRD-001 x2,PRD-003 x1 - A subtotal in USD:
$120.00 - A department code:
ENG-12
What RAG does
| Raw value on invoice | Data file consulted | Extracted result |
|---|---|---|
บ. เทคโนโลยี จก. | extraction-rules.md | บริษัท เทคโนโลยี จำกัด |
PRD-001 | product-catalog.json | Wireless Keyboard TH Layout |
PRD-003 | product-catalog.json | Ergonomic Office Chair |
$120.00 | extraction-rules.md | converted and rounded to THB |
ENG-12 | extraction-rules.md | tagged as "Engineering" |
Without the data files, the AI would return the raw codes and abbreviations. With RAG, each value is resolved and normalised in a single pass.
Tips for Data Files
- Keep files small and focused — aim for under 1 MB per file. Large files increase retrieval latency and may cause the vector store to return less relevant chunks. Split a large catalog into domain-specific files (e.g.,
catalog-electronics.json,catalog-furniture.json) rather than uploading one monolithic file. - Update data files when reference data changes — if your product catalog is updated, re-upload the file so the vector store reflects the latest entries. Stale data files will cause the AI to return outdated values.
- Choose the right format for each purpose — use
.mdfor rules and narrative instructions,.jsonfor structured lookups with multiple fields, and.txtfor simple one-to-one mappings. Mixing concerns in a single file reduces retrieval precision. - Avoid sensitive content — data files are sent to the AI provider's vector store. Use them only for non-sensitive reference material such as product catalogs, glossaries, or rule documents.
