Routing Parameters
| Parameter | Values | Description |
|---|---|---|
| Document type | PDF, DOCX, PLAINTEXT, IMAGE, XLSX, CSV, PPTX | Detected from file extension at upload time |
| Ingestion mode | standard (default), high_accuracy, single_chunk | Controls conversion backend and chunking granularity |
POST /v1/documents/ingest or POST /v1/documents/{id}/ingest.
Smart Defaults
When you omit the ingestion mode, the system picks the best default for your file type:| Document Type | Default Mode |
|---|---|
high_accuracy | |
| IMAGE, CSV | single_chunk |
| All others | standard |
Routing Decision Tree
Processing Paths
1. High-Accuracy Path
Best for: PDFs and images where maximum extraction quality is needed. What happens:- Preparation — Download source, remove watermarks (PDF)
- High-accuracy conversion — Advanced extraction using a secondary engine that produces detailed content lists, images, and markdown
- Smart chunking — Walks through extracted content, builds section hierarchy, creates text/image/table chunks with token-based merging
2. Single-Chunk Path — Simple Image
Best for: Standalone images (photos, screenshots, diagrams). The uploaded image becomes a single IMAGE chunk directly. No preparation, conversion, or chunking is needed.3. Single-Chunk Path — Text-Based (CSV / Plain Text)
Best for: CSV files and plain text documents you want treated as a single unit.- Preparation — Download source file
- Single chunk creation — CSV files become a single TABLE chunk (with a head+tail preview of the data); plain text files become a single TEXT chunk with full content
4. Single-Chunk Path — PDF
Best for: Short PDFs you want treated as a single retrievable unit (e.g., a one-page form or receipt).- Preparation — Extract page screenshots and text
- Single chunk creation — Creates one chunk from page screenshots and extracted text
5. Excel Path
Best for:.xlsx and .xlsm spreadsheet files.
- Preparation — Download source file
- Excel conversion — Parse workbook structure (sheets, cells, formulas, dependencies) into structured JSON
- Excel chunking — Create one section per sheet, with table and text chunks preserving spreadsheet structure
6. Standard Path (Default)
Best for: PDFs, DOCX, PPTX, and plain text under standard processing.- Preparation — Download source, remove watermarks (PDF)
- Conversion — Vision-language model pipeline: OCR, table detection, image extraction, structured JSON output
- Chunking — Hybrid chunking with interleaved text, table, and image chunks in document order
Common Tail Steps
After the format-specific processing, all paths converge to the same finishing steps:| Step | Purpose |
|---|---|
| Vector cleanup | Remove stale vectors from prior failed runs |
| Enrichment | Generate LLM descriptions for images and summaries for tables |
| Embedding | Generate vector embeddings and store in the vector database |
| Statistics | Calculate token counts, section and chunk statistics |
| Version activation | For version upgrades: switch the active version |
| Completion | Mark the pipeline as completed |
Validation Rules
These rules are checked before the workflow starts:| Rule | Error |
|---|---|
Explicit chunk type requires single_chunk mode | 400 |
| Secondary taxonomy requires IMAGE chunk type | 400 |
| Chunk type cannot be overridden for CSV/plain text | 400 |
| Each ingestion mode only supports specific document types | 400 |
| Custom page DPI only applies to PDFs | 400 |
