LiteParse: A Fast, Local Document Parser for Developers

Published: 3 days ago (June 7, 2026 at 12:03 AM EDT)

4 min read

Source: Dev.to

LiteParse is a fast, local document parser for extracting text from clean, well-structured files. It handles PDFs, DOCX, HTML, and more, with minimal setup and no API calls. Everything runs locally, so your documents never leave your environment. The project is honest about its scope, which is refreshing. It’s a strong fit when: Your documents are relatively straightforward, without complex tables, mixed layouts, or scanned pages. You want parsing to run locally rather than sending data to an external service. You’re prototyping or building a lightweight pipeline and don’t need enterprise-grade accuracy. For the genuinely hard stuff (dense tables, multi-column layouts, charts, handwriting, scanned PDFs), the maintainers point you toward LlamaParse, their cloud product. LiteParse deliberately stays in the “fast and light” lane rather than trying to be everything. Under the hood, LiteParse leans on PDF.js for spatial text parsing and gives you a few things that matter for AI pipelines: Text extraction with precise bounding boxes, so you know where each piece of text sits on the page. A flexible OCR system: Tesseract.js works out of the box with zero setup, and you can plug in HTTP OCR servers like EasyOCR or PaddleOCR for higher accuracy. Screenshot generation, which produces page images that LLM agents can use to capture visual information text alone misses. Output in either JSON or plain text. A standalone binary that runs across Linux, macOS (Intel and ARM), and Windows. LiteParse ships as both a CLI and a library. Here’s the fast path for each. The recommended approach is a global npm install, which gives you the lit command everywhere: npm i -g @llamaindex/liteparse

On macOS and Linux you can also use Homebrew: brew tap run-llama/liteparse brew install llamaindex-liteparse

Basic parsing (OCR is on by default via Tesseract)

lit parse document.pdf

Output JSON to a file

lit parse document.pdf —format json -o output.md

Parse only specific pages

lit parse document.pdf —target-pages “1-5,10,15-20”

Skip OCR entirely

lit parse document.pdf —no-ocr

For pipelines, batch mode reuses the PDF engine across files for efficiency: lit batch-parse ./input-directory ./output-directory

All pages

lit screenshot document.pdf -o ./screenshots

Specific pages at higher resolution

lit screenshot document.pdf —target-pages “1,3,5” —dpi 300 -o ./screenshots

If you’d rather call it from code, install it as a dependency: npm install @llamaindex/liteparse

or

pnpm add @llamaindex/liteparse

Then parsing is a few lines: import { LiteParse } from ‘@llamaindex/liteparse’;

const parser = new LiteParse({ ocrEnabled: true }); const result = await parser.parse(‘document.pdf’); console.log(result.text);

One thing that sets LiteParse apart from PDF-only tools is automatic format conversion. Point it at an Office document or an image and it will convert to PDF first, provided you have the right helper installed. For Office documents (Word, PowerPoint, spreadsheets), install LibreOffice:

macOS

brew install —cask libreoffice

Ubuntu/Debian

apt-get install libreoffice

For images (JPG, PNG, GIF, BMP, TIFF, WebP, SVG), install ImageMagick:

macOS

brew install imagemagick

Ubuntu/Debian

apt-get install imagemagick

Once these are present, LiteParse handles the conversion behind the scenes. You can drive everything from CLI flags, or set defaults in a liteparse.config.json file: { “ocrLanguage”: “en”, “ocrEnabled”: true, “maxPages”: 1000, “dpi”: 150, “outputFormat”: “json”, “preciseBoundingBox”: true, “preserveVerySmallText”: false }

To point at an external OCR server, add an ocrServerUrl: { “ocrServerUrl”: “http://localhost:8828/ocr”, “ocrLanguage”: “en”, “outputFormat”: “json” }

Then run: lit parse document.pdf —config liteparse.config.json

The default Tesseract.js engine needs no setup, but if you want better accuracy you can wire in any OCR service that implements LiteParse’s simple API specification. The contract is minimal: a POST /ocr endpoint that accepts a file and a language, and returns JSON with each result’s text, bounding box, and confidence score. The repo includes ready-made example wrappers for EasyOCR and PaddleOCR you can use as templates. LiteParse is purpose-built and clear about its boundaries, which makes it easy to reason about. If you need local, fast text extraction from clean documents for a RAG pipeline, an agent, or a quick prototype, it’s a solid, dependency-light choice. If your documents are messy, scanned, or table-heavy, the maintainers are upfront that you’ll want a heavier solution.

LiteParse: A Fast, Local Document Parser for Developers

Basic parsing (OCR is on by default via Tesseract)

Output JSON to a file

Parse only specific pages

Skip OCR entirely

All pages

Specific pages at higher resolution

or

macOS

Ubuntu/Debian

macOS

Ubuntu/Debian

Related posts

Cats Disappear at Sunset. Can You Find Them in Time?

I'm so tired to code. Not even Vibe Coding... D:

I Built a Multi-Platform Publishing CLI for AI Agents

I built an embedded scheduler in Rust because I was tired of adding Redis just to run a background job