Knowledge Stack extends the conversion engine with a postprocessor that infers and corrects heading hierarchy, preserving your document’s structure.Documentation Index
Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt
Use this file to discover all available pages before exploring further.
How Headings Are Detected
The postprocessor uses three strategies, applied in priority order:1. PDF Bookmarks / Table of Contents
If the PDF has a table of contents or bookmark tree embedded in its metadata, the postprocessor extracts this structure and maps it to the parsed headings. This is the most reliable source of heading hierarchy.2. Numbering Patterns
When bookmarks are unavailable, the postprocessor detects hierarchical numbering in heading text:- Decimal numbering: 1, 1.1, 1.2, 1.2.1
- Roman numerals: I, II, III
- Lettered outlines: A, B, C
3. Font Styling
As a fallback when numbering is also absent, the postprocessor clusters headings by font size and weight (bold/italic). Larger, bolder text is assigned higher heading levels. This uses statistical clustering to find natural groupings in the document’s typography.What This Means for Your Documents
The heading hierarchy directly affects how your documents are chunked and organized:- Better section structure: Chunks are organized under the correct heading hierarchy, making navigation intuitive
- Improved search: Section context (e.g., “Chapter 3 > Methods > Data Collection”) is included in embeddings, helping the AI understand where content lives
- Accurate breadcrumbs: When the agent cites content, the citation path reflects the real document structure
Format Compatibility
| Format | Hierarchy Support |
|---|---|
| PDF with bookmarks | Best — uses the embedded table of contents |
| PDF with numbered headings | Good — infers from numbering patterns |
| PDF with styled headings | Good — infers from font size and weight |
| Scanned PDFs | Limited — depends on OCR correctly identifying headings; only numbering patterns and bookmarks are usable (font styling data is unavailable) |
| DOCX, PPTX | Not needed — these formats natively preserve heading levels |
Limitations
- Scanned PDFs have limited font styling data, so hierarchy inference relies on numbering patterns and bookmarks only.
- Non-PDF formats (DOCX, PPTX, HTML, etc.) skip this postprocessing entirely because they typically preserve heading levels natively.
