Ingestion Pipeline

When you POST /v1/documents/ingest, Knowledge Stack starts a durable Temporal workflow that prepares, converts, chunks, enriches, and embeds the file. The endpoint returns immediately with a workflow_id you can poll. Files become searchable as soon as the workflow completes.

Looking for the request/response schema? See /v1/documents/ingest in the API Reference.

Ingest a document

kscli documents ingest \
  --file ./report.pdf \
  --path-part-id <folder-path-part-id> \
  --name "Q4 Report"

Response (truncated):

{
  "document_id": "01J...",
  "document_version_id": "01J...",
  "workflow_id": "ingest-01J...",
  "status": "QUEUED"
}

Pipeline stages

Step	What happens	Deep dive
1. Preparation	Validate format/size, persist source file to S3, create `Document` + `DocumentVersion` rows, dewatermark PDFs.	PDF watermark, S3 storage
2. Conversion	Route to a converter based on MIME type (Docling for PDFs, Excel pipeline for XLSX, etc.) and extract text + visual assets.	Routing, Docling PDF, Excel
3. Chunking	Walk the document hierarchy, emit `Section` and `Chunk` rows with bounding boxes, page numbers, and roles.	Chunk handling
3.5. Enrichment	Generate captions for images and structured summaries for tables via an LLM.	Chunk handling
4. Embedding	Embed each chunk and upsert vectors into Qdrant; mark the workflow `COMPLETED`.	Qdrant, Temporal workflow

Supported formats

PDF, DOCX, PPTX, XLSX, Markdown, plaintext. Hard limits: 100 MB per file, 150 pages per document.

Watch a workflow

Each ingest returns a workflow_id. Poll it via /v1/workflows:

Method	Path	Description
GET	`/v1/workflows`	List recent workflows for your tenant
GET	`/v1/workflows/{id}`	Live status, per-activity timing, error traces
POST	`/v1/workflows/{id}?action=cancel`	Cancel (OWNER / ADMIN only)
POST	`/v1/workflows/{id}?action=rerun`	Rerun from scratch (OWNER / ADMIN only)

kscli workflows list --limit 5
kscli workflows describe <workflow-id>

Reliability model

Durable — each step runs as a Temporal activity with timeout + retry policy. The workflow survives worker crashes and infrastructure restarts.
Idempotent — chunking clears prior content before re-creating; storage uploads overwrite at the same paths; identical content is deduplicated per-tenant.
Observable — every step emits structured logs, metrics, and a span tree visible in the Temporal UI.

Retry classification

Class	Examples	Behavior
Retryable	429, 502, 503, network errors	Up to 3 retries with exponential backoff (5s → 60s)
Non-retryable	400, 401, 403, 404, 500	Fail immediately; surface error on the workflow

Per-step timeouts

Step	Timeout
Preparation	60s
Conversion	up to 2h
Chunking	10m
Enrichment	2m / chunk
Embedding	2m / batch
Workflow total	30m

Specialized task queues

Each stage is dispatched to its own queue so heavy work doesn’t block lighter work:

Queue	Purpose
`document-ingestion`	Preparation, chunking, orchestration
`document-conversion`	DOCX / PPTX / XLSX / MD conversion
`pdf-conversion`	VLM-backed PDF conversion (GPU-heavy)
`enrichment`	Image captions + table summaries
`embedding`	Embedding generation + Qdrant upsert
`vector-ops`	Reindex, re-embed, vector maintenance

Re-embedding a folder

Switching embedding models? Trigger a folder-wide re-embed:

curl -X POST $KS_BASE_URL/v1/folders/<folder_id>?action=reembed \
  -H "Authorization: Bearer $KS_API_KEY"

Recursive — applies to every document under the folder’s subtree.

Recipes

Bulk ingest from S3

Stream a whole bucket through /documents/ingest with backpressure.

CI ingest pipeline

Re-ingest changed docs on every PR using kscli.

Get Started

SDKs & MCP

Cookbook

Concepts

Agent

Infrastructure

Design

Operations

Ingestion Pipeline

Ingest a document

Pipeline stages

Supported formats

Watch a workflow

Reliability model

Retry classification

Per-step timeouts

Specialized task queues

Re-embedding a folder

Recipes

Bulk ingest from S3

CI ingest pipeline

Get Started

SDKs & MCP

Cookbook

Concepts

Ingestion Pipeline

Agent

Infrastructure

Design

Operations

Documentation Index

​Ingest a document

​Pipeline stages

​Supported formats

​Watch a workflow

​Reliability model

​Retry classification

​Per-step timeouts

​Specialized task queues

​Re-embedding a folder

​Recipes

Bulk ingest from S3

CI ingest pipeline

Ingest a document

Pipeline stages

Supported formats

Watch a workflow

Reliability model

Retry classification

Per-step timeouts

Specialized task queues

Re-embedding a folder

Recipes