[Paper] Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Published: (February 18, 2026 at 12:29 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16640v1

Overview

The paper presents Quecto‑V1, a 124 M‑parameter small language model (SLM) that’s been trained from scratch on Indian legal texts and then aggressively 8‑bit quantized so it can run offline on a typical laptop or even a low‑power edge device. By marrying domain‑specific training with extreme model compression, the authors show that high‑quality legal retrieval is possible without the massive cloud‑based LLMs that dominate the market today.

Key Contributions

  • Domain‑focused SLM: First‑ever Indian‑law‑only model built on a GPT‑2‑style architecture (124 M parameters).
  • Full‑precision to 8‑bit quantization pipeline: Uses the GGUF format to shrink the model to < 150 MB (≈ 74 % size reduction).
  • Empirical evaluation on legal retrieval: Exact‑match benchmarks on statutes, IPC, CrPC, and the Constitution demonstrate superior performance over generic SLMs.
  • Quantization impact analysis: Ablation shows only a 3.5 % drop in retrieval accuracy after 8‑bit quantization.
  • On‑device inference: Demonstrates real‑time inference on consumer‑grade CPUs with no internet connection, addressing data‑sovereignty concerns.

Methodology

  1. Data Curation – The authors scraped and cleaned the full text of Indian statutes, creating a ~2 GB corpus that emphasizes legal terminology and definitions.
  2. Model Architecture – A vanilla GPT‑2 decoder stack (12 layers, 768 hidden size) was trained from scratch, avoiding any pre‑training on general‑purpose corpora to keep the lexical density high for legal language.
  3. Training Regimen – Standard next‑token prediction with AdamW optimizer, learning‑rate warm‑up, and a total of 300 k steps on a single GPU.
  4. Post‑Training Quantization – After convergence, the model weights were quantized to 8‑bit integers using the GGUF toolchain, which includes per‑channel scaling to preserve numeric fidelity.
  5. Evaluation Suite – A set of exact‑match retrieval tasks (e.g., “What is the definition of ‘homicide’ in IPC?”) and a broader zero‑shot QA benchmark to compare against generic SLMs of similar size.

Results & Findings

ModelSize (MB)Exact‑Match AccuracyRetrieval Latency (CPU)
Quecto‑V1 (FP32)47092.1 %1.8 s
Quecto‑V1 (8‑bit)14888.6 %0.9 s
Generic GPT‑2 (124 M)47071.4 %1.9 s
TinyBERT‑Legal (30 M)11565.2 %0.7 s
  • Size reduction: 8‑bit quantization cuts the footprint by ~74 % while keeping accuracy within 3.5 % of the full‑precision model.
  • Domain advantage: Even the quantized Quecto‑V1 outperforms a generic GPT‑2 by > 17 % absolute accuracy on statutory definition retrieval.
  • Latency: Quantization also speeds up inference roughly 2× on a mid‑range CPU (Intel i5‑10400).

These numbers confirm that aggressive quantization does not cripple a model when the task is tightly scoped to a specialized knowledge base.

Practical Implications

  • Offline legal assistants – Law firms, NGOs, or government agencies can embed Quecto‑V1 in desktop tools, mobile apps, or edge devices, guaranteeing that sensitive case data never leaves the premises.
  • Cost‑effective deployment – No need for expensive GPU‑powered inference servers; a single CPU can serve dozens of concurrent users for routine statutory look‑ups.
  • Data sovereignty – Particularly relevant in jurisdictions with strict data‑privacy regulations (e.g., India’s Personal Data Protection Bill), as the model runs entirely locally.
  • Rapid prototyping for niche domains – The workflow (domain‑specific corpus → small transformer → 8‑bit quantization) can be replicated for other regulated fields such as healthcare, finance, or compliance.
  • Open‑source potential – If released under a permissive license, the model could become a community‑maintained legal knowledge base, reducing reliance on proprietary cloud APIs.

Limitations & Future Work

  • Scope of knowledge – Quecto‑V1 only covers statutory text; it lacks case law, commentary, and evolving jurisprudence, limiting its usefulness for complex legal reasoning.
  • Evaluation breadth – Benchmarks focus on exact‑match retrieval; more nuanced QA, reasoning, or multi‑turn dialogue assessments are absent.
  • Quantization trade‑offs – While 8‑bit works well for retrieval, tasks requiring fine‑grained probability estimates (e.g., confidence scoring) may suffer.
  • Future directions – Authors suggest extending the corpus to include judicial opinions, exploring mixed‑precision (4‑bit) quantization, and integrating retrieval‑augmented generation to combine the SLM with external knowledge bases.

Authors

  • Subrit Dikshit

Paper Information

  • arXiv ID: 2602.16640v1
  • Categories: cs.CL
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »