Skip to main content
ai-pdf-extractionlocal-first-aidesktop-ai-agentdocument-processingprivacy

AI PDF Extraction Local: Run It on Your Desktop

Lapu AI Team10 min read

If your PDFs contain anything sensitive — invoices with bank details, contracts under NDA, medical records, HR files — every cloud "AI PDF extractor" you upload them to becomes part of your data perimeter. This guide explains how to do AI PDF extraction local to your desktop, what each architecture actually keeps on your machine, and where a desktop AI agent like Lapu AI fits.

What AI PDF extraction local actually means

"Local" is a marketing word and a technical word. In the technical sense, AI PDF extraction local means three things happen on your computer:

  • The PDF file is read from disk by software running on your CPU or GPU.
  • Layout analysis and OCR (if the file is a scan) run on your machine.
  • The language model that turns the raw text into structured data runs on your machine — or, in a defensible hybrid, only the cleaned text leaves, never the file or the page images.

That third clause is where most "local" tools quietly cheat. A parser that runs on your laptop but then ships the whole PDF base64-encoded to a vendor API is not local — it is a local upload script. The honest test is the network tab: with the tool running, are PDF bytes leaving your machine? If yes, it is hybrid at best and cloud at worst.

This matters for the same reason permission-based execution matters for any other desktop AI workflow: the cost of being wrong is real files with real consequences.

Why keep PDFs off vendor servers

The OWASP Gen AI Security Project's Top 10 for LLM Applications ranks Sensitive Information Disclosure as LLM02:2025 — second on the list, up from LLM06 in the 2023/24 edition. The 2025 description names the exact categories most business PDFs contain: personally identifiable information, financial details, health records, confidential business data, security credentials, and legal documents.

Three concrete reasons to keep these off vendor servers:

  1. Retention. Anthropic's PDF support documentation explicitly notes that the feature is eligible for Zero Data Retention "only when your organization has a ZDR arrangement." Without one, files are subject to standard retention. Most consumer accounts do not have ZDR.
  2. Surface area. Every cloud upload adds a copy in a place you do not control: the vendor's logs, possibly their evaluation pipelines, sometimes their human-review queues. The 2025 OWASP entry calls this out as a deployment risk, not a theoretical one.
  3. Compliance. GDPR, HIPAA, and most enterprise data-classification policies treat document upload as a transfer. A local pipeline does not transfer; a cloud one does.

None of this is an argument that cloud PDF extraction is wrong. It is an argument that the default of "drop file in chat" is the wrong default for anything regulated.

The three architectures for local PDF AI

There are three working architectures today. Each trades off accuracy, cost, and how much actually stays on your machine.

ArchitectureWhat runs locallyWhat leaves the machineWhen to use
Fully localParser + LLM (Ollama, LM Studio)NothingSensitive documents, air-gapped work, regulated industries
Local-parse, cloud-modelParser (MinerU, OpenDataLoader, Docling)Cleaned text onlyMost business cases — invoices, reports, research
Cloud upload (baseline)NothingFull PDF + image of every pagePublic documents, demos, one-off tasks

The fully-local stack is more accessible than it was even a year ago. MinerU ships a 109-language OCR engine, runs on macOS, Linux, and Windows, and works on CPU — GPU acceleration is optional. OpenDataLoader PDF advertises 60-plus pages per second on a single CPU and Apache 2.0 licensing, and its docs note that there are no API calls and no data transmission. Pair either with a 7B-13B local model in Ollama, and you have a complete pipeline that never touches a vendor.

Local-parse, cloud-model is the pragmatic middle. The PDF and its page images never leave, but a few kilobytes of cleaned, redactable text do. If you redact the obvious PII before the call, you have meaningfully shrunk the disclosure surface compared to uploading the file.

The cloud-upload default is fine for the cases where it is fine — a press release, a public 10-K, a paper from arXiv. It is the wrong default for the cases it is the wrong default for.

A desktop-agent recipe for local PDF extraction

A desktop AI agent fits this workflow because it is the only layer that can both see the files and run multi-step logic. Here is a prompt you can hand to Lapu AI on macOS or Windows to do invoice extraction without leaving the machine:

Read every PDF in ~/Documents/invoices/2026/.
For each one, extract the invoice number, vendor name, issue date,
due date, currency, and total amount.
Write the results to ~/Documents/invoices/2026/invoices.xlsx
with one row per invoice. Move processed PDFs into a sibling
"extracted" folder. Use the local model I configured.
Stop and ask before deleting anything.

The agent does the loop on your machine:

  1. Lists PDFs in the folder.
  2. Reads each one with a local parser (or, for short text-only PDFs, the OS preview/text layer).
  3. Sends the cleaned text to the local model with a structured prompt that asks for JSON.
  4. Appends a row to the spreadsheet.
  5. Moves the original after the row is written, and logs every step.

If you would rather use a frontier model in the cloud — Claude or GPT — you change one line: the local parser still runs, but only the cleaned text goes out. This is the local-parse, cloud-model row above. You still get a full audit trail of every file the agent opened and every action it took.

For broader file work, the same desktop-agent pattern covers automated file organization and other multi-step jobs across your local apps.

Where the hybrid cloud model is still defensible

There are real reasons to use a frontier model in the cloud even on a local-first stack:

  • Hard layouts. A 40-page scanned contract with handwritten margin notes is still easier for Claude or GPT-4-class models than for a 7B local model on your laptop. The honest call is: run a small redaction pass locally, then hybridize.
  • Volume bursts. A one-time job of 5,000 PDFs is faster on a hosted API than a quantized 13B model running on a single GPU. Once the burst is over, you can keep the recurring weekly job local.
  • Multi-format. A workflow that mixes PDF, DOCX, scanned images, and audio benefits from one model that handles all of them. The trade-off is the data leaving — make that trade with eyes open.

The desktop AI agent matters here too because it routes the request. The same pipeline can call a local model for normal invoices and a cloud model for the one weird 80-page contract, with the choice logged on a per-file basis.

Limits and honest trade-offs

Local PDF AI is not a finished product. A few things to expect:

  • Accuracy ceilings. Local 7B models hallucinate fields more often than frontier models. The fix is a constrained-output schema and a second-pass validator (regex on totals, dates, IBANs). Plan for a verification step; do not ship blind extracts into accounting.
  • Hardware cost. Running a 13B model with reasonable latency wants 16-32 GB of RAM on a Mac with Apple Silicon, or a discrete GPU with 12 GB+ of VRAM on Windows. A baseline machine still works, just slower.
  • Setup friction. A fully-local stack means installing a parser, a model runtime, and wiring them up. A desktop agent removes most of the wiring but not all of it — the first hour is real.
  • Not a privacy guarantee. Local extraction prevents upload, not exfiltration. If your machine is compromised, local does not help. Pair this with the usual disk encryption, OS update, and least-privilege hygiene a desktop AI workflow already needs.

If you want the on-the-desktop story without building the stack yourself, download Lapu AI and point it at a PDF folder. The agent runs natively on macOS and Windows, asks for permission before touching files, and lets you choose between a local model and a frontier API per task. If you want the full picture of how it fits with other local AI work, the local-first AI overview covers the broader trade-offs.

FAQ

What is AI PDF extraction local to my desktop?
AI PDF extraction local means the parsing, OCR, and structured-data extraction happens on your own macOS or Windows computer rather than on a vendor's servers. The PDF file is never uploaded; an open-source parser like MinerU, OpenDataLoader PDF, or Docling reads it on disk, optionally with a local LLM (Ollama, LM Studio) handling the language work. A desktop AI agent like Lapu AI can orchestrate the steps — open the file, extract, write the result to a sheet — with permission and an audit log.
Can Claude or ChatGPT extract data from PDFs without uploading the file?
No, not by themselves. Claude's PDF support and ChatGPT both require sending the document to the vendor. Anthropic's PDF documentation describes three intake paths — URL, base64-encoded inline, or the Files API — and notes that this feature is eligible for Zero Data Retention only when your organization has a ZDR arrangement. Without ZDR, your file is processed and may be retained according to the standard retention policy. To keep the file on your machine you need a local parser plus a local model, or a desktop agent that runs the extraction on disk before any model call.
Is local PDF parsing accurate enough for invoices, contracts, and tables?
Modern open-source parsers handle the common cases well. MinerU extracts text, tables (as HTML), and formulas (as LaTeX) and supports 109-language OCR. OpenDataLoader PDF extracts Markdown, JSON with bounding boxes, and HTML at 60-plus pages per second on a CPU. Complex multi-column layouts, scanned documents with degraded quality, and unusual fonts still need careful prompt-and-check passes — accuracy is not a finished problem at any tier.
Why do desktop AI agents matter for PDF workflows specifically?
A desktop AI agent can do the whole loop: find the PDFs in a folder, run a local parser, send only the cleaned text to a model, write results into the user's spreadsheet, and move the originals to an 'extracted' folder. Cloud agents can do the model call but cannot reach the files, and one-shot parsers cannot reason about what to do with the output. The agent on your desktop is the only layer that can both see the files and apply judgment.
What about hybrid setups — local parsing, cloud model?
Hybrid is the common compromise. The PDF is parsed on the desktop, the resulting text plus a structured prompt is sent to a frontier model in the cloud, and only the extracted JSON comes back. Your raw PDF and the image of the page never leave your machine. This is meaningfully safer than uploading the file, though it still sends the readable content to a third party — review your provider's retention and training-data policies before you ship this on regulated documents.
Does OWASP say anything about PDFs and LLMs?
OWASP raised sensitive-information disclosure from rank LLM06 in the 2023/24 Top 10 for LLM Applications to LLM02 in the 2025 edition. PDFs are a common channel: invoices, medical records, legal contracts, and HR documents are exactly the categories listed in the 2025 description. The recommended controls — minimize what leaves your environment, sanitize, and log — line up with a local-first PDF workflow.
What does Lapu AI do for local PDF extraction?
Lapu AI is a desktop AI agent for macOS and Windows. For PDFs you point it at a folder, give it a prompt like 'pull invoice numbers, dates, and totals from every PDF here and write them to invoices.xlsx,' and it executes the steps on your machine — opens each PDF, extracts text with a local parser or the OS preview, calls the model you configured, writes the sheet — and logs every action. You decide whether the model call goes to a frontier API or a locally-hosted model.

Sources

  1. PDF support — Claude APIAnthropic (2024-11-04) · accessed 2026-06-01
  2. LLM02:2025 Sensitive Information DisclosureOWASP Gen AI Security Project (2024-11-17) · accessed 2026-06-01
  3. LLM06: Sensitive Information Disclosure (2023/24)OWASP Gen AI Security Project (2023-10-16) · accessed 2026-06-01
  4. MinerU — Document parsing engine for PDFs, images, DOCX, PPTX, XLSXOpenDataLab (2024-11-15) · accessed 2026-06-01
  5. OpenDataLoader PDF — local PDF parser for AI-ready dataOpenDataLoader Project (2024-12-10) · accessed 2026-06-01
ShareXLinkedIn

Lapu AI Team

Building the future of desktop AI agents. Lapu AI combines frontier language models with native system access to automate real tasks on your computer.

Related articles

Automate the work between you and outcomes

Lapu AI handles the repetitive work between you and outcomes. One desktop agent, zero tab-switching. Available now on macOS and Windows.

  • 1-click uninstall
  • Cancel anytime
  • Files never leave your computer

Free to start. Cancel in 1 click. Files stay on your machine.

Lapu AI agent chat with conversation, tool calls, and execution log