Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt

Use this file to discover all available pages before exploring further.

Knowledge Stack extends the conversion engine with a postprocessor that infers and corrects heading hierarchy, preserving your document’s structure.

How Headings Are Detected

The postprocessor uses three strategies, applied in priority order:

1. PDF Bookmarks / Table of Contents

If the PDF has a table of contents or bookmark tree embedded in its metadata, the postprocessor extracts this structure and maps it to the parsed headings. This is the most reliable source of heading hierarchy.

2. Numbering Patterns

When bookmarks are unavailable, the postprocessor detects hierarchical numbering in heading text:
  • Decimal numbering: 1, 1.1, 1.2, 1.2.1
  • Roman numerals: I, II, III
  • Lettered outlines: A, B, C
The nesting level is inferred from the numbering depth.

3. Font Styling

As a fallback when numbering is also absent, the postprocessor clusters headings by font size and weight (bold/italic). Larger, bolder text is assigned higher heading levels. This uses statistical clustering to find natural groupings in the document’s typography.

What This Means for Your Documents

The heading hierarchy directly affects how your documents are chunked and organized:
  • Better section structure: Chunks are organized under the correct heading hierarchy, making navigation intuitive
  • Improved search: Section context (e.g., “Chapter 3 > Methods > Data Collection”) is included in embeddings, helping the AI understand where content lives
  • Accurate breadcrumbs: When the agent cites content, the citation path reflects the real document structure

Format Compatibility

FormatHierarchy Support
PDF with bookmarksBest — uses the embedded table of contents
PDF with numbered headingsGood — infers from numbering patterns
PDF with styled headingsGood — infers from font size and weight
Scanned PDFsLimited — depends on OCR correctly identifying headings; only numbering patterns and bookmarks are usable (font styling data is unavailable)
DOCX, PPTXNot needed — these formats natively preserve heading levels

Limitations

  • Scanned PDFs have limited font styling data, so hierarchy inference relies on numbering patterns and bookmarks only.
  • Non-PDF formats (DOCX, PPTX, HTML, etc.) skip this postprocessing entirely because they typically preserve heading levels natively.