Excel Pipeline - Knowledge Stack

How It Works

When you upload an Excel file, it goes through a specialized conversion and chunking path before joining the shared enrichment and embedding pipeline:

Upload .xlsx/.xlsm
  |
  v
Step 1: Preparation
  -> Download the file from storage
  |
  v
Step 2: Excel Conversion
  -> Parse workbook structure (sheets, cells, formulas, dependencies)
  -> Extract embedded images and convert to WEBP
  -> Store structured parse result as JSON
  |
  v
Step 3: Excel Chunking
  -> Create one section per sheet
  -> Create chunks for each content block (tables, text, images, charts)
  -> Build enriched HTML with formula and formatting data
  |
  v
Steps 4-7: Shared Pipeline
  -> Vector cleanup -> Enrichment -> Embedding -> Completion

Content Block to Chunk Mapping

Each block detected in the spreadsheet maps to a specific chunk type:

Block Type	Chunk Type	Content
Table, calculation block, assumptions, results, label block, data block	TABLE	Plain-text rendering plus enriched HTML with formulas
Chart	IMAGE	Structured chart description (type, series, axes)
Embedded image	IMAGE	Alt text or placeholder; original image stored in S3
Header, text block, mixed content, sparse cells	TEXT	Plain-text cell content
Empty blocks	Skipped	Not chunked

Important: Tables are always kept as single chunks and never split across multiple chunks. This preserves structural integrity for AI reasoning over tabular data.

Enriched HTML for Tables

Every TABLE chunk carries an enriched HTML representation that preserves formulas and cell formatting lost in the plain-text version. This enriched HTML is stored as metadata on the chunk.

<table data-sheet="Sheet1" data-range="A1:D10" data-block-type="table">
  <caption>LLM-parsed header: "Q4 Revenue by Region"</caption>
  <thead>
    <tr>
      <th data-cell="A1">Product</th>
      <th data-cell="B1">Q1</th>
      <th data-cell="C1">Q2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td data-cell="A2">Widget A</td>
      <td data-cell="B2" data-formula="=SUM(B3:B5)" data-fmt="bold"
          data-font-color="FF0000">150</td>
      <td data-cell="C2" data-formula="=B2*1.1"
          data-fill-color="FFFF00">165</td>
    </tr>
  </tbody>
</table>

Preserved Cell Attributes

Attribute	When Present	Example
`data-cell`	Always	`A1` (cell address)
`data-formula`	Cell has a formula	`=SUM(B3:B5)`
`data-font-color`	Non-default font color	`FF0000`
`data-fill-color`	Non-default background color	`FFFF00`
`data-fmt`	Cell has formatting	`bold`, `italic`, `underline`

Auto-Generated Table Titles

Every TABLE chunk gets a human-readable title generated by the LLM enrichment step:

The LLM receives the enriched HTML plus surrounding context
It generates a concise title (max 10 words) describing what the table contains
The title is stored as a <caption> in the enriched HTML and as metadata on the chunk

This makes every table self-describing for search — the title acts as a semantic anchor that improves both embedding quality and search relevance.

Chart Support

Charts extracted from Excel files are represented as IMAGE chunks with structured text describing the chart:

Chart: "Revenue by Region" (bar, 4 series)
  Series 1: "North" -> data: Sheet1!B2:B10, categories: Sheet1!A2:A10
  Series 2: "South" -> data: Sheet1!C2:C10, categories: Sheet1!A2:A10
  X-Axis: "Region" | Y-Axis: "Revenue ($)"
  Anchor: Sheet1!F1:L15

Chart metadata includes the chart type (bar, line, pie, scatter, etc.), title, data series with range references, and axis information.

Image and Shape Handling

Embedded images and shapes from Excel files are also extracted:

Images become IMAGE chunks with alt text. The original image is extracted from the file and uploaded to storage as WEBP. LLM enrichment later generates a full description.
Text boxes become TEXT chunks with the text box content.

Section Structure

Each Excel sheet becomes a section in the document hierarchy, analogous to a heading in a PDF. This means:

You can browse a workbook sheet-by-sheet using the standard document navigation
Each sheet’s content is organized under its own section
The agent can read individual sheets or search across all sheets

Design Highlights

Decision	Rationale
One section per sheet	Sheets are the natural structural unit in spreadsheets, analogous to headings in documents
Atomic table chunks	Tables are never split — this preserves structural integrity for AI reasoning
Formula and formatting preservation	Cell formulas, colors, and formatting are stored as HTML data attributes for full fidelity
Plain text for embedding, HTML for display	The plain-text rendering is ideal for vector search; the enriched HTML preserves the full spreadsheet context
Separate conversion queue	Excel parsing is CPU-intensive and is isolated to prevent blocking other document processing

​How It Works

​Content Block to Chunk Mapping

​Enriched HTML for Tables

​Preserved Cell Attributes

​Auto-Generated Table Titles

​Chart Support

​Image and Shape Handling

​Section Structure

​Design Highlights