Workflow Pipeline

Workflow Overview

When you upload a document, the pipeline executes these steps in order:

Ingestion Workflow
  |
  +-- Step 1: Document Preparation
  |
  +-- Step 2: Conversion (standard, high-accuracy, or Excel)
  |
  +-- Step 3: Chunking
  |
  +-- Step 3.5: Enrichment (parallel per chunk)
  |
  +-- Step 4: Embedding (parallel per batch)

Key characteristics:

Sequential pipeline: Steps 1-3 run one after another; steps 3.5 and 4 run as sub-workflows with internal parallelism
Fan-out pattern: Both enrichment and embedding discover work items first, then process them in parallel
Safe retries: All operations can be safely retried without producing duplicate data
Smart error handling: Transient errors (rate limits, temporary outages) are retried automatically; permanent errors fail immediately

Workflow Trigger

The workflow starts when you call POST /v1/documents/ingest:

You upload a source document with a parent folder
The API generates unique IDs for the document, version, and workflow
The source file is uploaded to storage
Document and version records are created in the database
The ingestion workflow starts (with a 30-minute execution timeout)
You receive the workflow_id, document_id, and document_version_id

Pipeline Steps

Step 1: Document Preparation

Downloads the source file and prepares it for conversion.

For PDF files: runs watermark removal and uploads the cleaned version
For other file types: passes the original source through unchanged
Updates pipeline status metadata

Step 2: Conversion

Converts the document into a structured format with extracted visual assets.

Submits the document to the conversion service
Extracts and uploads visual assets (page screenshots, images, tables) as WEBP files
Produces a structured JSON representation of the document
See Pipeline Routing for how different file types are handled

Conversion options include:

OCR with multi-language support (Chinese and English)
Accurate table structure detection
Image extraction at 2x scale
VLM (vision-language model) pipeline for PDFs

Step 3: Chunking

Breaks the converted document into searchable chunks organized by section.

Deletes any existing content (for safe retry)
Parses the structured JSON from the conversion step
Creates a section hierarchy based on document headings
Produces text, table, and image chunks in document order
See Chunk Handling for details on the chunking algorithm

Section hierarchy: Headings in your document are used to build a nested section tree. For example, a document with headings “Chapter 1” > “Background” creates sections that preserve the document’s logical structure. Chunk types:

Text chunks: Regular text content with detected type classification
Table chunks: HTML representation for DOCX tables; placeholder for PDF tables (enriched in the next step)
Image chunks: Placeholder content (enriched in the next step)

Step 3.5: Enrichment

Uses LLM vision to generate searchable content for image and table chunks. This step runs as a sub-workflow that:

Identifies which chunks need enrichment (images and tables with placeholder content)
Enriches each chunk in parallel

For image chunks:

The image is analyzed by a vision LLM
A natural-language description (2-4 sentences) replaces the placeholder content
The description is optimized for search retrieval

For table chunks (from PDFs):

The table image is analyzed by a vision LLM to extract semantic HTML
A text summary is generated from the HTML
The placeholder is replaced with structured HTML content

For table chunks (from DOCX):

A text summary is generated from the existing HTML content
The summary is stored as metadata for improved search

Step 4: Embedding

Generates vector embeddings and stores them in the vector database. This step runs as a sub-workflow that:

Collects all chunk IDs, deactivates old version vectors, and splits chunks into batches
Embeds and upserts each batch in parallel

Each chunk’s content is enriched before embedding with:

Section heading context (e.g., “Section: Architecture > Subsystem”)
Overlap context from adjacent chunks in the same section
For tables: HTML content plus the LLM-generated summary
For images: the LLM-generated description

See Chunk Handling for full details on how embedding text is constructed.

Re-Embed Folder

The re-embed workflow regenerates embeddings for all documents in a folder and its subfolders. Triggered by POST /v1/folders/{folder_id}?action=reembed:

Lists all documents in the folder tree (up to 30 levels deep)
Starts an embedding sub-workflow for each document in parallel
Returns the total number of vectors upserted

When to use: Re-embedding is needed when your embedding model or its configuration changes and existing vectors need to be regenerated.

Configuration

Pipeline behavior is controlled by these settings:

document:
  max_pages: 150
  supported_formats: [pdf, docx, pptx, md, txt]

chunking:
  max_tokens: 512
  overlap_tokens: 64

embedding:
  chunks_per_batch: 200

Processing Timeouts

Step	Timeout	Heartbeat
Document Preparation	60s	—
Conversion	2 hours	—
Chunking	10 minutes	—
Enrichment (filter)	60s	—
Enrichment (per chunk)	2 minutes	60s
Embedding (split)	60s	—
Embedding (per batch)	2 minutes	60s
Folder document listing	2 minutes	30s

Overall workflow execution timeout: 30 minutes.

Error Handling

Errors are classified and handled automatically:

Retryable: HTTP 429 (rate limit), 502, 503, and network errors — retried with exponential backoff
Non-retryable: HTTP 400, 401, 403, 404, 500 — fail immediately

Retry policy: up to 3 retries, starting at 5 seconds and backing off up to 60 seconds.

Reliability

All pipeline operations are idempotent (safe to retry):

Chunking: Clears existing content before re-creating
Storage uploads: Overwrites existing objects
Content deduplication: Identical content is stored once per tenant
Metadata updates: Safely overwrite existing values
Enrichment: Already-enriched chunks are skipped on retry
Embedding: Old vectors are deactivated before upserting new ones

Get Started

SDKs & MCP

Cookbook

Concepts

Ingestion Pipeline

Agent

Infrastructure

Design

Operations

Workflow Overview

Workflow Trigger

Pipeline Steps

Step 1: Document Preparation

Step 2: Conversion

Step 3: Chunking

Step 3.5: Enrichment

Step 4: Embedding

Re-Embed Folder

Configuration

Processing Timeouts

Error Handling

Reliability

Get Started

SDKs & MCP

Cookbook

Concepts

Ingestion Pipeline

Agent

Infrastructure

Design

Operations

Documentation Index

​Workflow Overview

​Workflow Trigger

​Pipeline Steps

​Step 1: Document Preparation

​Step 2: Conversion

​Step 3: Chunking

​Step 3.5: Enrichment

​Step 4: Embedding

​Re-Embed Folder

​Configuration

​Processing Timeouts

​Error Handling

​Reliability

Workflow Overview

Workflow Trigger

Pipeline Steps

Step 1: Document Preparation

Step 2: Conversion

Step 3: Chunking

Step 3.5: Enrichment

Step 4: Embedding

Re-Embed Folder

Configuration

Processing Timeouts

Error Handling

Reliability