Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.knowledgestack.ai/llms.txt

Use this file to discover all available pages before exploring further.

How It Works

When you upload an Excel file, it goes through a specialized conversion and chunking path before joining the shared enrichment and embedding pipeline:
Upload .xlsx/.xlsm
  |
  v
Step 1: Preparation
  -> Download the file from storage
  |
  v
Step 2: Excel Conversion
  -> Parse workbook structure (sheets, cells, formulas, dependencies)
  -> Extract embedded images and convert to WEBP
  -> Store structured parse result as JSON
  |
  v
Step 3: Excel Chunking
  -> Create one section per sheet
  -> Create chunks for each content block (tables, text, images, charts)
  -> Build enriched HTML with formula and formatting data
  |
  v
Steps 4-7: Shared Pipeline
  -> Vector cleanup -> Enrichment -> Embedding -> Completion

Content Block to Chunk Mapping

Each block detected in the spreadsheet maps to a specific chunk type:
Block TypeChunk TypeContent
Table, calculation block, assumptions, results, label block, data blockTABLEPlain-text rendering plus enriched HTML with formulas
ChartIMAGEStructured chart description (type, series, axes)
Embedded imageIMAGEAlt text or placeholder; original image stored in S3
Header, text block, mixed content, sparse cellsTEXTPlain-text cell content
Empty blocksSkippedNot chunked
Important: Tables are always kept as single chunks and never split across multiple chunks. This preserves structural integrity for AI reasoning over tabular data.

Enriched HTML for Tables

Every TABLE chunk carries an enriched HTML representation that preserves formulas and cell formatting lost in the plain-text version. This enriched HTML is stored as metadata on the chunk.
<table data-sheet="Sheet1" data-range="A1:D10" data-block-type="table">
  <caption>LLM-parsed header: "Q4 Revenue by Region"</caption>
  <thead>
    <tr>
      <th data-cell="A1">Product</th>
      <th data-cell="B1">Q1</th>
      <th data-cell="C1">Q2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td data-cell="A2">Widget A</td>
      <td data-cell="B2" data-formula="=SUM(B3:B5)" data-fmt="bold"
          data-font-color="FF0000">150</td>
      <td data-cell="C2" data-formula="=B2*1.1"
          data-fill-color="FFFF00">165</td>
    </tr>
  </tbody>
</table>

Preserved Cell Attributes

AttributeWhen PresentExample
data-cellAlwaysA1 (cell address)
data-formulaCell has a formula=SUM(B3:B5)
data-font-colorNon-default font colorFF0000
data-fill-colorNon-default background colorFFFF00
data-fmtCell has formattingbold, italic, underline

Auto-Generated Table Titles

Every TABLE chunk gets a human-readable title generated by the LLM enrichment step:
  • The LLM receives the enriched HTML plus surrounding context
  • It generates a concise title (max 10 words) describing what the table contains
  • The title is stored as a <caption> in the enriched HTML and as metadata on the chunk
This makes every table self-describing for search — the title acts as a semantic anchor that improves both embedding quality and search relevance.

Chart Support

Charts extracted from Excel files are represented as IMAGE chunks with structured text describing the chart:
Chart: "Revenue by Region" (bar, 4 series)
  Series 1: "North" -> data: Sheet1!B2:B10, categories: Sheet1!A2:A10
  Series 2: "South" -> data: Sheet1!C2:C10, categories: Sheet1!A2:A10
  X-Axis: "Region" | Y-Axis: "Revenue ($)"
  Anchor: Sheet1!F1:L15
Chart metadata includes the chart type (bar, line, pie, scatter, etc.), title, data series with range references, and axis information.

Image and Shape Handling

Embedded images and shapes from Excel files are also extracted:
  • Images become IMAGE chunks with alt text. The original image is extracted from the file and uploaded to storage as WEBP. LLM enrichment later generates a full description.
  • Text boxes become TEXT chunks with the text box content.

Section Structure

Each Excel sheet becomes a section in the document hierarchy, analogous to a heading in a PDF. This means:
  • You can browse a workbook sheet-by-sheet using the standard document navigation
  • Each sheet’s content is organized under its own section
  • The agent can read individual sheets or search across all sheets

Design Highlights

DecisionRationale
One section per sheetSheets are the natural structural unit in spreadsheets, analogous to headings in documents
Atomic table chunksTables are never split — this preserves structural integrity for AI reasoning
Formula and formatting preservationCell formulas, colors, and formatting are stored as HTML data attributes for full fidelity
Plain text for embedding, HTML for displayThe plain-text rendering is ideal for vector search; the enriched HTML preserves the full spreadsheet context
Separate conversion queueExcel parsing is CPU-intensive and is isolated to prevent blocking other document processing