Documentation Index
Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
When you upload an Excel file, it goes through a specialized conversion and chunking path before joining the shared enrichment and embedding pipeline:
Upload .xlsx/.xlsm
|
v
Step 1: Preparation
-> Download the file from storage
|
v
Step 2: Excel Conversion
-> Parse workbook structure (sheets, cells, formulas, dependencies)
-> Extract embedded images and convert to WEBP
-> Store structured parse result as JSON
|
v
Step 3: Excel Chunking
-> Create one section per sheet
-> Create chunks for each content block (tables, text, images, charts)
-> Build enriched HTML with formula and formatting data
|
v
Steps 4-7: Shared Pipeline
-> Vector cleanup -> Enrichment -> Embedding -> Completion
Content Block to Chunk Mapping
Each block detected in the spreadsheet maps to a specific chunk type:
| Block Type | Chunk Type | Content |
|---|
| Table, calculation block, assumptions, results, label block, data block | TABLE | Plain-text rendering plus enriched HTML with formulas |
| Chart | IMAGE | Structured chart description (type, series, axes) |
| Embedded image | IMAGE | Alt text or placeholder; original image stored in S3 |
| Header, text block, mixed content, sparse cells | TEXT | Plain-text cell content |
| Empty blocks | Skipped | Not chunked |
Important: Tables are always kept as single chunks and never split across multiple chunks. This preserves structural integrity for AI reasoning over tabular data.
Enriched HTML for Tables
Every TABLE chunk carries an enriched HTML representation that preserves formulas and cell formatting lost in the plain-text version. This enriched HTML is stored as metadata on the chunk.
<table data-sheet="Sheet1" data-range="A1:D10" data-block-type="table">
<caption>LLM-parsed header: "Q4 Revenue by Region"</caption>
<thead>
<tr>
<th data-cell="A1">Product</th>
<th data-cell="B1">Q1</th>
<th data-cell="C1">Q2</th>
</tr>
</thead>
<tbody>
<tr>
<td data-cell="A2">Widget A</td>
<td data-cell="B2" data-formula="=SUM(B3:B5)" data-fmt="bold"
data-font-color="FF0000">150</td>
<td data-cell="C2" data-formula="=B2*1.1"
data-fill-color="FFFF00">165</td>
</tr>
</tbody>
</table>
Preserved Cell Attributes
| Attribute | When Present | Example |
|---|
data-cell | Always | A1 (cell address) |
data-formula | Cell has a formula | =SUM(B3:B5) |
data-font-color | Non-default font color | FF0000 |
data-fill-color | Non-default background color | FFFF00 |
data-fmt | Cell has formatting | bold, italic, underline |
Auto-Generated Table Titles
Every TABLE chunk gets a human-readable title generated by the LLM enrichment step:
- The LLM receives the enriched HTML plus surrounding context
- It generates a concise title (max 10 words) describing what the table contains
- The title is stored as a
<caption> in the enriched HTML and as metadata on the chunk
This makes every table self-describing for search — the title acts as a semantic anchor that improves both embedding quality and search relevance.
Chart Support
Charts extracted from Excel files are represented as IMAGE chunks with structured text describing the chart:
Chart: "Revenue by Region" (bar, 4 series)
Series 1: "North" -> data: Sheet1!B2:B10, categories: Sheet1!A2:A10
Series 2: "South" -> data: Sheet1!C2:C10, categories: Sheet1!A2:A10
X-Axis: "Region" | Y-Axis: "Revenue ($)"
Anchor: Sheet1!F1:L15
Chart metadata includes the chart type (bar, line, pie, scatter, etc.), title, data series with range references, and axis information.
Image and Shape Handling
Embedded images and shapes from Excel files are also extracted:
- Images become IMAGE chunks with alt text. The original image is extracted from the file and uploaded to storage as WEBP. LLM enrichment later generates a full description.
- Text boxes become TEXT chunks with the text box content.
Section Structure
Each Excel sheet becomes a section in the document hierarchy, analogous to a heading in a PDF. This means:
- You can browse a workbook sheet-by-sheet using the standard document navigation
- Each sheet’s content is organized under its own section
- The agent can read individual sheets or search across all sheets
Design Highlights
| Decision | Rationale |
|---|
| One section per sheet | Sheets are the natural structural unit in spreadsheets, analogous to headings in documents |
| Atomic table chunks | Tables are never split — this preserves structural integrity for AI reasoning |
| Formula and formatting preservation | Cell formulas, colors, and formatting are stored as HTML data attributes for full fidelity |
| Plain text for embedding, HTML for display | The plain-text rendering is ideal for vector search; the enriched HTML preserves the full spreadsheet context |
| Separate conversion queue | Excel parsing is CPU-intensive and is isolated to prevent blocking other document processing |