
To convert PDF to Excel with AI you used to upload the invoice above to a website and hope the result was clean. It does not have to be that way anymore. Open the PDF, tell Lapu AI to write it back as Excel, and you get the workbook below — column headers in lower_snake_case, dates Excel reads as real dates, a live =SUM formula where the printed total used to be. The conversion runs locally on your machine. The PDF never leaves your disk. Every write to your filesystem waits for your approval.

That second image is a real .xlsx, produced from the first image by the agent on your own machine. No upload, no third-party converter, no API key. Download Lapu AI for macOS or Windows and try it on your next vendor PDF — or see the Microsoft Excel automation page for the full set of workflows the agent runs on spreadsheets.
The rest of this guide explains what a good PDF-to-Excel conversion actually looks like, where the hard parts hide, and how to run the whole thing without sending the document to a third party.
What "PDF to Excel AI" actually means
The phrase covers two distinct jobs that get conflated. The first is table extraction: a PDF contains one or more grids of data, and you want each grid as a sheet of cells with the right types. The second is structured rewrite: a PDF contains data in a non-tabular layout — labelled fields on an invoice header, line items spread across pages with running totals — and you want a single normalized table at the end.
Hosted PDF-to-Excel converters (Adobe, Smallpdf, iLovePDF, dozens of others) do the first job well for text-native PDFs and the second job by guessing. Microsoft's own Power Query PDF connector added native "Get Data → From PDF" to Excel itself, with the Pdf.Tables M function exposing the detected tables (Microsoft, 2025). Open-source options like Tabula — created at the Knight Foundation and still actively used in newsrooms — and Python's Camelot library cover the table-extraction case for text-PDFs with no cloud round trip (Camelot, 2025).
What an AI agent adds on top is judgment: which page region is the table, what to do when a column header wraps across two lines, when a row that looks like data is actually a sub-total, and what column types to coerce values into. The "AI" part is the layer that decides; the extraction primitives underneath have been solved for years.
Why the upload-to-cloud default is the wrong default
For a public report you downloaded from a government site, uploading to a cloud converter is fine — the document is already public. For an invoice with vendor banking details, a bank statement with account numbers, a benefits summary with employee identifiers, or a vendor reconciliation file that names the customers you serve, the upload is the privacy decision, not the conversion. The conversion is mechanical.
The defensible default on a regulated machine is local-first: read the PDF where it lives, extract the tables in process, write back to disk, never send the file body to a vendor's server. That is the same architecture we wrote up in our earlier guide on AI PDF extraction without third-party services. PDF-to-Excel is a special case of that pattern where the output is a workbook instead of a JSON blob.
The five steps of a good PDF-to-Excel conversion
Strip the marketing claims off any tool in this category and you find the same five-step pipeline.
- Classify the page. Is the page mostly a table, mostly prose, or a mixed layout (header block + table + footer)? Naive converters skip this and try to table-extract everything; the result is hundreds of one-cell "tables" lifted from running prose.
- Detect the table region. Within a page, where does the data grid start and end? Camelot's lattice parser uses ruling lines; its stream parser uses whitespace; the auto mode picks one. AI helps when neither signal is clean — a borderless table on a busy page, an invoice where the line-item grid is implied by alignment, not lines.
- Recover the column structure. A wrapped header ("Invoice / Number") becomes one column, not two. Currency symbols stay in the value or move into a separate column consistently. A column that contains "$1,240.00" in some rows and "1,240" in others needs to be coerced to a number, not stored as mixed strings.
- Stitch across pages. A 12-page statement is one logical table, not twelve. Headers repeat on each page; running totals appear at the bottom of each page; the final total appears once. A good pipeline merges, deduplicates the repeated headers, and keeps the final total as a check value rather than a row.
- Type-coerce and write. Dates become real Excel dates, amounts become numbers, references become text. The output is an .xlsx that opens cleanly — formulas where you want them, formatting where the eye expects it, no string-typed "12,300.00" amounts that block downstream SUM.
The hard step is the third one. Everything before it is mechanical; everything after it is bookkeeping. Column recovery is where every cheap converter falls down and where AI judgment actually pays off.
Edge cases that break naive converters
Cheap PDF-to-Excel tools work great on one-page text-PDFs with ruled lines. Real-world PDFs break them in six predictable ways.
- Scanned pages. Tabula explicitly notes it works only on text-based PDFs (Tabula, 2024). A scanned invoice needs an OCR pass first; a good agent runs OCR locally before extraction rather than refusing the file.
- Multi-page tables with repeated headers. Without page-stitching logic, you get the header row N times in the output. Power Query's
MultiPageTablesoption toggles this combine behavior (Microsoft, 2024). - Wrapped or merged column headers. A two-line header "Invoice / Number" is one logical column, not two. AI judgment beats heuristics here.
- Printed totals that look like data. The footer row "Total — $3,953.44" is not a transaction. A naive extractor includes it as row 8 and corrupts every SUMIFS downstream.
- Mixed types inside one column. Amounts as "$1,240.00", "1,240", "(1,240.00)" for negative. The output column should be numbers; the agent has to know that.
- Number formats that depend on locale. "1.234,56" in European exports, "1,234.56" in US. Detect once, coerce consistently, write the result in the format the user actually wants.
A workflow that handles those six cleanly is the difference between an agent that demos well and one you keep in your toolbox. For the broader data-cleanup mechanics that come after the extraction step, the best AI agent for data cleanup post walks through the dedup, normalization, and review patterns.
How Lapu AI handles PDF-to-Excel on the desktop
Lapu AI runs the full pipeline natively. The agent opens the PDF where it lives on your disk — Downloads, a project folder, an email attachment you saved — and the workbook is written back to your disk. The file never leaves your machine for storage. The Excel-specific tooling is documented on the Microsoft Excel automation integration page; for the PDF-to-Excel job specifically the flow is:
doc:pdfreads the PDF locally, runs OCR if needed, and extracts the table regions per page.- AI judgment classifies the page, recovers the column structure, identifies printed totals and other non-data rows, and proposes the canonical column types.
- You review the plan. The agent shows the proposed schema (column names, types, row count, the SUM formula it intends to insert), and you approve, edit, or reject before any write hits disk.
doc:excelwrites the .xlsx in Office Open XML directly — the same format Excel has used since 2007 — so the file opens cleanly with no "recovered file" warning.- Audit trail. Every step (which PDF, which pages, what was dropped, what was inserted) is recorded locally for replay and review.
An example prompt that exercises the whole chain:
Open
~/Downloads/vendor-statement-may.pdf, extract every line-item table, drop the printed footer total, combine pages into one sheet, and save as~/finance/vendor-statement-may.xlsx. Uselower_snake_casecolumn names and put aSUMformula on the amount column. Flag any rows where the amount could not be parsed as a number.
The agent reads the PDF, profiles each page, shows you the column schema and the proposed cleanup, applies the transformation after you confirm, and writes the workbook. The whole job runs in process; the file body is never uploaded.
When a desktop agent is the right tool — and when it is not
The decision framework is simple.
- Use a desktop AI agent when the PDF is sensitive (invoices, statements, benefits, HR exports), when the conversion has any AI-judgment step (column recovery, total detection, ambiguous types), and when you want the output dropped straight onto your filesystem without round-trips.
- Use Power Query "Get Data → From PDF" when you already live in Excel, the PDF is clean text with explicit table lines, and you want the query to refresh on every new monthly file. The connector is mature and free with Excel.
- Use Tabula or Camelot directly when you want to script the conversion as part of a Python pipeline, the PDFs are reliably text-native, and you do not need AI judgment for column recovery. Both are free and run locally.
- Use a hosted converter for one-off public documents where the file is already public and you value zero-install convenience over data residency.
PDF-to-Excel is one of the tasks that turned into a "you need a SaaS for that" decision somewhere in the 2010s. It does not have to be. With the right desktop tooling, the conversion is a local job you supervise rather than a file you ship to a vendor and hope comes back clean.
FAQ
Can Lapu AI convert a PDF to Excel without uploading the file?
Yes. The PDF is read where it lives on your disk and the workbook is written to your disk. The file body is not uploaded for storage. Only minimal context — column names, a small sample of rows, ambiguous values where the model has to choose a type — is sent to the AI model provider for reasoning, and you can see in the agent's plan exactly what context was used. The result is a real .xlsx file that opens in Excel with no warnings.
What if the PDF is scanned, not text?
The agent runs an OCR pass first using the local OCR engine, then proceeds with extraction. Tabula and most cheap converters refuse scanned PDFs outright; a desktop agent can chain OCR and extraction in one workflow. Accuracy on scans depends on the source quality — a clean office scan reads near-perfectly, a phone photo of a crumpled receipt is harder. The agent flags low-confidence cells so you can review them rather than silently writing wrong values.
Does it handle multi-page tables with repeated headers?
Yes. The agent detects when the header row repeats on every page and stitches the pages into one logical table. This is the same problem Microsoft addresses with the MultiPageTables option in Power Query's Pdf.Tables function — it is well understood, but worth verifying on the first run of any new PDF format you handle regularly.
What happens to printed totals at the bottom of a statement?
The agent removes them from the row set and writes a real Excel SUM formula in their place. The intent is that the printed total becomes a check value: if the formula's result does not match what was printed, the agent flags the mismatch in the workbook (e.g. by colouring the cell or adding a comment), rather than silently overwriting the discrepancy.
Can I review the conversion before any file is written?
Yes — that is the default, not a setting. The agent generates a proposed schema (column names, types, row count, any rows it plans to drop, any formulas it plans to insert) and waits for your approval. You can edit the schema (rename a column, change a type, keep or drop the totals row) before the write happens. Nothing touches your disk until you confirm.
How is this different from Microsoft 365 Copilot doing the same job in Excel?
Microsoft 365 Copilot operates on files in your Microsoft 365 tenant (OneDrive, SharePoint) using Microsoft's cloud models. That is a fine fit if you are already standardized on M365 and your data residency requirements are met by Microsoft's enterprise commitments. A desktop AI agent like Lapu AI runs the conversion locally on whatever PDF sits on your filesystem — OneDrive or not — and writes back to whichever folder you specify. The two complement each other: M365 Copilot for the tenant-bound workflow, Lapu AI for everything else on your machine.
Will the resulting workbook keep formulas and formatting I add later?
Yes. The agent writes a standard .xlsx file using the same Office Open XML format Excel ships. You can open it, add formulas, apply formatting, link it to other workbooks, anything Excel supports. The next time you re-run the conversion on a fresh PDF, you can choose to overwrite the data sheet only (preserving your formatting and downstream formulas) or write a new file alongside the old one — your call, not the agent's.
Sources
- Power Query PDF connector — Microsoft (2025-08-01) · accessed 2026-06-05
- Pdf.Tables — Power Query M reference — Microsoft (2024-12-12) · accessed 2026-06-05
- Tabula — Extract Tables from PDFs — Tabula (2024-01-01) · accessed 2026-06-05
- Camelot: PDF Table Extraction for Humans — Camelot Project (2025-06-01) · accessed 2026-06-05




