LiteParse: Open-Source PDF Parser by LlamaIndex

The run-llama team, developers of the LlamaIndex framework, published LiteParse, a standalone open-source document parser built for fully local, offline use without cloud dependencies. The repository had accumulated 3,500 GitHub stars and 221 forks as of early April 2026, indicating broad early adoption among developers building document-processing workflows.

LiteParse runs entirely on-device using PDF.js for spatial text extraction and Tesseract.js for OCR — no internet connection or API keys required
Text output includes precise bounding box coordinates per text block, enabling layout-aware downstream processing
Supports PDF, DOCX, XLSX, PPTX, and image formats, with CLI-based batch processing for entire directories
External OCR engines including EasyOCR and PaddleOCR can replace the built-in Tesseract.js via a standardized HTTP API

What Happened

The run-llama organization, maintainers of the LlamaIndex framework, released LiteParse as a dedicated open-source document parsing tool, hosted at github.com/run-llama/liteparse under the Apache 2.0 license. The project documentation describes it as “a standalone OSS PDF parsing tool focused exclusively on fast and light parsing.” Author details for the initial release were not available at time of publication.

The repository contains 294 commits and reached 3,500 stars and 221 forks as of early April 2026. A dataset_eval_utils module is present in the repository structure, suggesting the team built evaluation tooling to assess parsing quality, though the specific benchmarks used were not detailed in the publicly available documentation.

Why It Matters

PDF parsing is a foundational step in retrieval-augmented generation (RAG) pipelines: poor text extraction degrades chunk quality and retrieval accuracy downstream. Most widely-used parsers either call cloud APIs — including the run-llama team’s own LlamaParse service — or require complex server-side dependencies that complicate local deployment.

LiteParse positions itself as a zero-dependency local alternative. By using PDF.js, Mozilla’s open-source PDF rendering engine, as its text extraction layer, the tool avoids proprietary rendering dependencies and gains broad platform compatibility. For developers operating under data-residency constraints, building air-gapped systems, or seeking to eliminate per-page parsing costs, a fully on-device parser addresses a gap that cloud-first tools leave open.

Technical Details

LiteParse extracts text with bounding box coordinates — per-block x, y, width, and height values — allowing applications to reconstruct document layout rather than treating output as a flat text stream. Output is available in JSON format, which preserves spatial metadata, or as plain text. Users can target specific page ranges rather than parsing entire documents, reducing processing time for large files.

The built-in OCR pipeline uses Tesseract.js and requires no external installation. The project documentation states that OCR support “works out of the box” with zero setup. Users can disable OCR entirely for text-native PDFs to reduce overhead, select OCR languages, configure the number of parallel worker threads for throughput, and specify custom DPI values for page screenshot generation.

Beyond PDF, LiteParse parses DOCX, XLSX, PPTX, and image files. An OCR_API_SPEC.md file in the repository defines a standardized HTTP interface allowing external OCR servers — such as EasyOCR and PaddleOCR — to substitute for Tesseract.js. The CLI also supports batch processing of entire directories in a single command.

Who’s Affected

JavaScript and TypeScript developers building RAG pipelines or document ingestion systems are the primary audience. LiteParse ships as both a standalone CLI and an npm library, enabling programmatic integration into Node.js applications. Teams subject to data-residency requirements — in healthcare, financial services, or government — gain a parsing option that keeps document contents within their own infrastructure rather than transiting a third-party API.

The tool is cross-platform, supporting Linux, macOS (Intel and ARM64), and Windows. macOS and Linux users can install via Homebrew; others can use npm global install or compile from source.

What’s Next

The repository listed six open issues and five open pull requests as of early April 2026, reflecting an active but still-maturing project. The presence of a formal OCR_API_SPEC.md indicates the team intends to support a broader set of pluggable OCR backends beyond Tesseract.js. The project accepts contributions under its CONTRIBUTING.md guidelines and is maintained under the run-llama GitHub organization alongside LlamaIndex and LlamaParse.

LlamaIndex Releases LiteParse: Local PDF Parser With OCR and Bounding Boxes

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

LlamaIndex Releases LiteParse: Local PDF Parser With OCR and Bounding Boxes

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Related Reading

Enjoyed this story?

China Just Forced 345 Million People to Say Goodbye to Their AI Companions — The World’s First AI Companion Law Is Brutal

Claude Sonnet 5 Just Launched at $2/$10 — The Price That Ends the AI Budget Crisis [Migration Guide]

OpenAI Is Offering the US Government a 5% Stake — A $50 Billion Gift That Makes the Referee a Shareholder