Documentation Index
Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt
Use this file to discover all available pages before exploring further.
Workflow Overview
When you upload a document, the pipeline executes these steps in order:- Sequential pipeline: Steps 1-3 run one after another; steps 3.5 and 4 run as sub-workflows with internal parallelism
- Fan-out pattern: Both enrichment and embedding discover work items first, then process them in parallel
- Safe retries: All operations can be safely retried without producing duplicate data
- Smart error handling: Transient errors (rate limits, temporary outages) are retried automatically; permanent errors fail immediately
Workflow Trigger
The workflow starts when you callPOST /v1/documents/ingest:
- You upload a source document with a parent folder
- The API generates unique IDs for the document, version, and workflow
- The source file is uploaded to storage
- Document and version records are created in the database
- The ingestion workflow starts (with a 30-minute execution timeout)
- You receive the
workflow_id,document_id, anddocument_version_id
Pipeline Steps
Step 1: Document Preparation
Downloads the source file and prepares it for conversion.- For PDF files: runs watermark removal and uploads the cleaned version
- For other file types: passes the original source through unchanged
- Updates pipeline status metadata
Step 2: Conversion
Converts the document into a structured format with extracted visual assets.- Submits the document to the conversion service
- Extracts and uploads visual assets (page screenshots, images, tables) as WEBP files
- Produces a structured JSON representation of the document
- See Pipeline Routing for how different file types are handled
- OCR with multi-language support (Chinese and English)
- Accurate table structure detection
- Image extraction at 2x scale
- VLM (vision-language model) pipeline for PDFs
Step 3: Chunking
Breaks the converted document into searchable chunks organized by section.- Deletes any existing content (for safe retry)
- Parses the structured JSON from the conversion step
- Creates a section hierarchy based on document headings
- Produces text, table, and image chunks in document order
- See Chunk Handling for details on the chunking algorithm
- Text chunks: Regular text content with detected type classification
- Table chunks: HTML representation for DOCX tables; placeholder for PDF tables (enriched in the next step)
- Image chunks: Placeholder content (enriched in the next step)
Step 3.5: Enrichment
Uses LLM vision to generate searchable content for image and table chunks. This step runs as a sub-workflow that:- Identifies which chunks need enrichment (images and tables with placeholder content)
- Enriches each chunk in parallel
- The image is analyzed by a vision LLM
- A natural-language description (2-4 sentences) replaces the placeholder content
- The description is optimized for search retrieval
- The table image is analyzed by a vision LLM to extract semantic HTML
- A text summary is generated from the HTML
- The placeholder is replaced with structured HTML content
- A text summary is generated from the existing HTML content
- The summary is stored as metadata for improved search
Step 4: Embedding
Generates vector embeddings and stores them in the vector database. This step runs as a sub-workflow that:- Collects all chunk IDs, deactivates old version vectors, and splits chunks into batches
- Embeds and upserts each batch in parallel
- Section heading context (e.g., “Section: Architecture > Subsystem”)
- Overlap context from adjacent chunks in the same section
- For tables: HTML content plus the LLM-generated summary
- For images: the LLM-generated description
Re-Embed Folder
The re-embed workflow regenerates embeddings for all documents in a folder and its subfolders. Triggered byPOST /v1/folders/{folder_id}?action=reembed:
- Lists all documents in the folder tree (up to 30 levels deep)
- Starts an embedding sub-workflow for each document in parallel
- Returns the total number of vectors upserted
Configuration
Pipeline behavior is controlled by these settings:Processing Timeouts
| Step | Timeout | Heartbeat |
|---|---|---|
| Document Preparation | 60s | — |
| Conversion | 2 hours | — |
| Chunking | 10 minutes | — |
| Enrichment (filter) | 60s | — |
| Enrichment (per chunk) | 2 minutes | 60s |
| Embedding (split) | 60s | — |
| Embedding (per batch) | 2 minutes | 60s |
| Folder document listing | 2 minutes | 30s |
Error Handling
Errors are classified and handled automatically:- Retryable: HTTP 429 (rate limit), 502, 503, and network errors — retried with exponential backoff
- Non-retryable: HTTP 400, 401, 403, 404, 500 — fail immediately
Reliability
All pipeline operations are idempotent (safe to retry):- Chunking: Clears existing content before re-creating
- Storage uploads: Overwrites existing objects
- Content deduplication: Identical content is stored once per tenant
- Metadata updates: Safely overwrite existing values
- Enrichment: Already-enriched chunks are skipped on retry
- Embedding: Old vectors are deactivated before upserting new ones
