How It Works
When you upload an Excel file, it goes through a specialized conversion and chunking path before joining the shared enrichment and embedding pipeline:Content Block to Chunk Mapping
Each block detected in the spreadsheet maps to a specific chunk type:| Block Type | Chunk Type | Content |
|---|---|---|
| Table, calculation block, assumptions, results, label block, data block | TABLE | Plain-text rendering plus enriched HTML with formulas |
| Chart | IMAGE | Structured chart description (type, series, axes) |
| Embedded image | IMAGE | Alt text or placeholder; original image stored in S3 |
| Header, text block, mixed content, sparse cells | TEXT | Plain-text cell content |
| Empty blocks | Skipped | Not chunked |
Enriched HTML for Tables
Every TABLE chunk carries an enriched HTML representation that preserves formulas and cell formatting lost in the plain-text version. This enriched HTML is stored as metadata on the chunk.Preserved Cell Attributes
| Attribute | When Present | Example |
|---|---|---|
data-cell | Always | A1 (cell address) |
data-formula | Cell has a formula | =SUM(B3:B5) |
data-font-color | Non-default font color | FF0000 |
data-fill-color | Non-default background color | FFFF00 |
data-fmt | Cell has formatting | bold, italic, underline |
Auto-Generated Table Titles
Every TABLE chunk gets a human-readable title generated by the LLM enrichment step:- The LLM receives the enriched HTML plus surrounding context
- It generates a concise title (max 10 words) describing what the table contains
- The title is stored as a
<caption>in the enriched HTML and as metadata on the chunk
Chart Support
Charts extracted from Excel files are represented as IMAGE chunks with structured text describing the chart:Image and Shape Handling
Embedded images and shapes from Excel files are also extracted:- Images become IMAGE chunks with alt text. The original image is extracted from the file and uploaded to storage as WEBP. LLM enrichment later generates a full description.
- Text boxes become TEXT chunks with the text box content.
Section Structure
Each Excel sheet becomes a section in the document hierarchy, analogous to a heading in a PDF. This means:- You can browse a workbook sheet-by-sheet using the standard document navigation
- Each sheet’s content is organized under its own section
- The agent can read individual sheets or search across all sheets
Design Highlights
| Decision | Rationale |
|---|---|
| One section per sheet | Sheets are the natural structural unit in spreadsheets, analogous to headings in documents |
| Atomic table chunks | Tables are never split — this preserves structural integrity for AI reasoning |
| Formula and formatting preservation | Cell formulas, colors, and formatting are stored as HTML data attributes for full fidelity |
| Plain text for embedding, HTML for display | The plain-text rendering is ideal for vector search; the enriched HTML preserves the full spreadsheet context |
| Separate conversion queue | Excel parsing is CPU-intensive and is isolated to prevent blocking other document processing |
