PDF Heading Hierarchy - Knowledge Stack

Knowledge Stack extends the conversion engine with a postprocessor that infers and corrects heading hierarchy, preserving your document’s structure.

How Headings Are Detected

The postprocessor uses three strategies, applied in priority order:

1. PDF Bookmarks / Table of Contents

If the PDF has a table of contents or bookmark tree embedded in its metadata, the postprocessor extracts this structure and maps it to the parsed headings. This is the most reliable source of heading hierarchy.

2. Numbering Patterns

When bookmarks are unavailable, the postprocessor detects hierarchical numbering in heading text:

Decimal numbering: 1, 1.1, 1.2, 1.2.1
Roman numerals: I, II, III
Lettered outlines: A, B, C

The nesting level is inferred from the numbering depth.

3. Font Styling

As a fallback when numbering is also absent, the postprocessor clusters headings by font size and weight (bold/italic). Larger, bolder text is assigned higher heading levels. This uses statistical clustering to find natural groupings in the document’s typography.

What This Means for Your Documents

The heading hierarchy directly affects how your documents are chunked and organized:

Better section structure: Chunks are organized under the correct heading hierarchy, making navigation intuitive
Improved search: Section context (e.g., “Chapter 3 > Methods > Data Collection”) is included in embeddings, helping the AI understand where content lives
Accurate breadcrumbs: When the agent cites content, the citation path reflects the real document structure

Format Compatibility

Format	Hierarchy Support
PDF with bookmarks	Best — uses the embedded table of contents
PDF with numbered headings	Good — infers from numbering patterns
PDF with styled headings	Good — infers from font size and weight
Scanned PDFs	Limited — depends on OCR correctly identifying headings; only numbering patterns and bookmarks are usable (font styling data is unavailable)
DOCX, PPTX	Not needed — these formats natively preserve heading levels

Limitations

Scanned PDFs have limited font styling data, so hierarchy inference relies on numbering patterns and bookmarks only.
Non-PDF formats (DOCX, PPTX, HTML, etc.) skip this postprocessing entirely because they typically preserve heading levels natively.

​How Headings Are Detected

​1. PDF Bookmarks / Table of Contents

​2. Numbering Patterns

​3. Font Styling

​What This Means for Your Documents

​Format Compatibility

​Limitations