LiteParse: A Fast, Local Document Parser for Developers
Source: Dev.to
LiteParse is a fast, local document parser for extracting text from clean, well-structured files. It handles PDFs, DOCX, HTML, and more, with minimal setup and no API calls. Everything runs locally, so your documents never leave your environment. The project is honest about its scope, which is refreshing. It’s a strong fit when: Your documents are relatively straightforward, without complex tables, mixed layouts, or scanned pages. You want parsing to run locally rather than sending data to an external service. You’re prototyping or building a lightweight pipeline and don’t need enterprise-grade accuracy. For the genuinely hard stuff (dense tables, multi-column layouts, charts, handwriting, scanned PDFs), the maintainers point you toward LlamaParse, their cloud product. LiteParse deliberately stays in the “fast and light” lane rather than trying to be everything. Under the hood, LiteParse leans on PDF.js for spatial text parsing and gives you a few things that matter for AI pipelines: Text extraction with precise bounding boxes, so you know where each piece of text sits on the page. A flexible OCR system: Tesseract.js works out of the box with zero setup, and you can plug in HTTP OCR servers like EasyOCR or PaddleOCR for higher accuracy. Screenshot generation, which produces page images that LLM agents can use to capture visual information text alone misses. Output in either JSON or plain text. A standalone binary that runs across Linux, macOS (Intel and ARM), and Windows. LiteParse ships as both a CLI and a library. Here’s the fast path for each. The recommended approach is a global npm install, which gives you the lit command everywhere: npm i -g @llamaindex/liteparse
On macOS and Linux you can also use Homebrew: brew tap run-llama/liteparse brew install llamaindex-liteparse
Basic parsing (OCR is on by default via Tesseract)
lit parse document.pdf
Output JSON to a file
lit parse document.pdf —format json -o output.md
Parse only specific pages
lit parse document.pdf —target-pages “1-5,10,15-20”
Skip OCR entirely
lit parse document.pdf —no-ocr
For pipelines, batch mode reuses the PDF engine across files for efficiency: lit batch-parse ./input-directory ./output-directory
All pages
lit screenshot document.pdf -o ./screenshots
Specific pages at higher resolution
lit screenshot document.pdf —target-pages “1,3,5” —dpi 300 -o ./screenshots
If you’d rather call it from code, install it as a dependency: npm install @llamaindex/liteparse
or
pnpm add @llamaindex/liteparse
Then parsing is a few lines: import { LiteParse } from ‘@llamaindex/liteparse’;
const parser = new LiteParse({ ocrEnabled: true }); const result = await parser.parse(‘document.pdf’); console.log(result.text);
One thing that sets LiteParse apart from PDF-only tools is automatic format conversion. Point it at an Office document or an image and it will convert to PDF first, provided you have the right helper installed. For Office documents (Word, PowerPoint, spreadsheets), install LibreOffice:
macOS
brew install —cask libreoffice
Ubuntu/Debian
apt-get install libreoffice
For images (JPG, PNG, GIF, BMP, TIFF, WebP, SVG), install ImageMagick:
macOS
brew install imagemagick
Ubuntu/Debian
apt-get install imagemagick
Once these are present, LiteParse handles the conversion behind the scenes. You can drive everything from CLI flags, or set defaults in a liteparse.config.json file: { “ocrLanguage”: “en”, “ocrEnabled”: true, “maxPages”: 1000, “dpi”: 150, “outputFormat”: “json”, “preciseBoundingBox”: true, “preserveVerySmallText”: false }
To point at an external OCR server, add an ocrServerUrl: { “ocrServerUrl”: “http://localhost:8828/ocr”, “ocrLanguage”: “en”, “outputFormat”: “json” }
Then run: lit parse document.pdf —config liteparse.config.json
The default Tesseract.js engine needs no setup, but if you want better accuracy you can wire in any OCR service that implements LiteParse’s simple API specification. The contract is minimal: a POST /ocr endpoint that accepts a file and a language, and returns JSON with each result’s text, bounding box, and confidence score. The repo includes ready-made example wrappers for EasyOCR and PaddleOCR you can use as templates. LiteParse is purpose-built and clear about its boundaries, which makes it easy to reason about. If you need local, fast text extraction from clean documents for a RAG pipeline, an agent, or a quick prototype, it’s a solid, dependency-light choice. If your documents are messy, scanned, or table-heavy, the maintainers are upfront that you’ll want a heavier solution.