Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt

Use this file to discover all available pages before exploring further.

Workflow Overview

When you upload a document, the pipeline executes these steps in order:
Ingestion Workflow
  |
  +-- Step 1: Document Preparation
  |
  +-- Step 2: Conversion (standard, high-accuracy, or Excel)
  |
  +-- Step 3: Chunking
  |
  +-- Step 3.5: Enrichment (parallel per chunk)
  |
  +-- Step 4: Embedding (parallel per batch)
Key characteristics:
  • Sequential pipeline: Steps 1-3 run one after another; steps 3.5 and 4 run as sub-workflows with internal parallelism
  • Fan-out pattern: Both enrichment and embedding discover work items first, then process them in parallel
  • Safe retries: All operations can be safely retried without producing duplicate data
  • Smart error handling: Transient errors (rate limits, temporary outages) are retried automatically; permanent errors fail immediately

Workflow Trigger

The workflow starts when you call POST /v1/documents/ingest:
  1. You upload a source document with a parent folder
  2. The API generates unique IDs for the document, version, and workflow
  3. The source file is uploaded to storage
  4. Document and version records are created in the database
  5. The ingestion workflow starts (with a 30-minute execution timeout)
  6. You receive the workflow_id, document_id, and document_version_id

Pipeline Steps

Step 1: Document Preparation

Downloads the source file and prepares it for conversion.
  • For PDF files: runs watermark removal and uploads the cleaned version
  • For other file types: passes the original source through unchanged
  • Updates pipeline status metadata

Step 2: Conversion

Converts the document into a structured format with extracted visual assets.
  • Submits the document to the conversion service
  • Extracts and uploads visual assets (page screenshots, images, tables) as WEBP files
  • Produces a structured JSON representation of the document
  • See Pipeline Routing for how different file types are handled
Conversion options include:
  • OCR with multi-language support (Chinese and English)
  • Accurate table structure detection
  • Image extraction at 2x scale
  • VLM (vision-language model) pipeline for PDFs

Step 3: Chunking

Breaks the converted document into searchable chunks organized by section.
  • Deletes any existing content (for safe retry)
  • Parses the structured JSON from the conversion step
  • Creates a section hierarchy based on document headings
  • Produces text, table, and image chunks in document order
  • See Chunk Handling for details on the chunking algorithm
Section hierarchy: Headings in your document are used to build a nested section tree. For example, a document with headings “Chapter 1” > “Background” creates sections that preserve the document’s logical structure. Chunk types:
  • Text chunks: Regular text content with detected type classification
  • Table chunks: HTML representation for DOCX tables; placeholder for PDF tables (enriched in the next step)
  • Image chunks: Placeholder content (enriched in the next step)

Step 3.5: Enrichment

Uses LLM vision to generate searchable content for image and table chunks. This step runs as a sub-workflow that:
  1. Identifies which chunks need enrichment (images and tables with placeholder content)
  2. Enriches each chunk in parallel
For image chunks:
  • The image is analyzed by a vision LLM
  • A natural-language description (2-4 sentences) replaces the placeholder content
  • The description is optimized for search retrieval
For table chunks (from PDFs):
  • The table image is analyzed by a vision LLM to extract semantic HTML
  • A text summary is generated from the HTML
  • The placeholder is replaced with structured HTML content
For table chunks (from DOCX):
  • A text summary is generated from the existing HTML content
  • The summary is stored as metadata for improved search

Step 4: Embedding

Generates vector embeddings and stores them in the vector database. This step runs as a sub-workflow that:
  1. Collects all chunk IDs, deactivates old version vectors, and splits chunks into batches
  2. Embeds and upserts each batch in parallel
Each chunk’s content is enriched before embedding with:
  • Section heading context (e.g., “Section: Architecture > Subsystem”)
  • Overlap context from adjacent chunks in the same section
  • For tables: HTML content plus the LLM-generated summary
  • For images: the LLM-generated description
See Chunk Handling for full details on how embedding text is constructed.

Re-Embed Folder

The re-embed workflow regenerates embeddings for all documents in a folder and its subfolders. Triggered by POST /v1/folders/{folder_id}?action=reembed:
  1. Lists all documents in the folder tree (up to 30 levels deep)
  2. Starts an embedding sub-workflow for each document in parallel
  3. Returns the total number of vectors upserted
When to use: Re-embedding is needed when your embedding model or its configuration changes and existing vectors need to be regenerated.

Configuration

Pipeline behavior is controlled by these settings:
document:
  max_pages: 150
  supported_formats: [pdf, docx, pptx, md, txt]

chunking:
  max_tokens: 512
  overlap_tokens: 64

embedding:
  chunks_per_batch: 200

Processing Timeouts

StepTimeoutHeartbeat
Document Preparation60s
Conversion2 hours
Chunking10 minutes
Enrichment (filter)60s
Enrichment (per chunk)2 minutes60s
Embedding (split)60s
Embedding (per batch)2 minutes60s
Folder document listing2 minutes30s
Overall workflow execution timeout: 30 minutes.

Error Handling

Errors are classified and handled automatically:
  • Retryable: HTTP 429 (rate limit), 502, 503, and network errors — retried with exponential backoff
  • Non-retryable: HTTP 400, 401, 403, 404, 500 — fail immediately
Retry policy: up to 3 retries, starting at 5 seconds and backing off up to 60 seconds.

Reliability

All pipeline operations are idempotent (safe to retry):
  • Chunking: Clears existing content before re-creating
  • Storage uploads: Overwrites existing objects
  • Content deduplication: Identical content is stored once per tenant
  • Metadata updates: Safely overwrite existing values
  • Enrichment: Already-enriched chunks are skipped on retry
  • Embedding: Old vectors are deactivated before upserting new ones