[Paper] Qualitative Coding Analysis through Open-Source Large Language Models: A User Study and Design Recommendations
Source: arXiv - 2602.18352v1
Overview
The paper presents ChatQDA, an on‑device framework that leverages open‑source large language models (LLMs) to assist researchers with qualitative coding while keeping raw data local. By sidestepping commercial APIs, the system aims to eliminate the privacy concerns that often block the use of powerful LLMs in sensitive, human‑centred research.
Key Contributions
- Privacy‑first architecture: A fully local pipeline that runs open‑source LLMs on the user’s machine, avoiding any network traffic of raw interview or survey text.
- Chat‑style coding interface: An interactive UI that lets analysts pose natural‑language prompts (e.g., “extract themes about user frustration”) and receive suggested codes in real time.
- Mixed‑methods user study: 30 participants from social‑science and HCI backgrounds evaluated the tool, providing quantitative usability scores and qualitative feedback.
- “Conditional trust” insight: Users trusted the system for surface‑level extraction but remained skeptical about deeper interpretive judgments and consistency across runs.
- Design recommendations: Six actionable guidelines for building local‑first, LLM‑augmented analysis tools that balance verifiable privacy with methodological rigor.
Methodology
- System Construction – The authors bundled a lightweight, open‑source transformer (e.g., LLaMA‑7B) with a custom prompt‑engineering layer that translates typical qualitative‑analysis tasks (open coding, memoing, theme generation) into model queries. All components run inside a Docker container on the analyst’s workstation.
- User Study Design – A mixed‑methods approach combined:
- Quantitative: SUS (System Usability Scale) and NASA‑TLX workload questionnaires after a 45‑minute coding session.
- Qualitative: Semi‑structured interviews probing participants’ trust, perceived accuracy, and privacy concerns.
- Data Collection – Participants coded a publicly available interview dataset (≈2 k words) using both ChatQDA and a baseline manual spreadsheet workflow.
- Analysis – The authors performed statistical comparisons of SUS scores and coded the interview transcripts from the study itself, applying thematic analysis to surface emergent user attitudes.
Results & Findings
- Usability: ChatQDA achieved an average SUS score of 82.4, indicating “excellent” usability, and participants reported a 30 % reduction in perceived workload versus the manual baseline.
- Trust Profile: Users expressed conditional trust—they were comfortable letting the model suggest surface codes (e.g., keyword tags) but doubted its ability to capture nuanced, context‑dependent meanings. Consistency checks (re‑running the same prompt) sometimes yielded divergent code sets, reinforcing this skepticism.
- Privacy Perception: Even though the system never transmitted data, 70 % of participants voiced lingering “epistemic uncertainty” about whether their data could be inadvertently exposed, highlighting a gap between technical guarantees and user confidence.
- Efficiency Gains: On average, participants completed the coding task 22 minutes faster with ChatQDA, attributing the speedup to instant suggestion generation and reduced manual scrolling.
Practical Implications
- For Developers of Research Tools – The study demonstrates that local‑first LLM integration is technically feasible and can dramatically improve workflow efficiency without sacrificing data sovereignty.
- Enterprise & Compliance – Industries bound by GDPR, HIPAA, or internal data‑handling policies can adopt similar on‑device LLM pipelines to automate text‑analysis tasks (e.g., customer feedback mining) while staying within strict privacy envelopes.
- Product Design – The “conditional trust” finding suggests that UI/UX should surface confidence scores, version histories, and easy ways to override or edit model‑generated codes, thereby giving analysts a safety net.
- Open‑Source Ecosystem – By relying on community‑maintained models, organizations avoid vendor lock‑in and can audit the model weights, fostering greater transparency for auditors and ethics boards.
Limitations & Future Work
- Model Scale – The study used a 7‑b parameter model; larger models could improve nuance but would strain typical workstation resources.
- Dataset Scope – Only a single, publicly available interview corpus was tested; results may differ with longer, multilingual, or highly domain‑specific texts.
- Trust Calibration – The authors note the need for systematic methods (e.g., calibrated confidence metrics, explainability overlays) to bridge the gap between technical privacy guarantees and user‑perceived security.
- Future Directions – Planned extensions include (1) integrating differential privacy noise to further reassure users, (2) evaluating cross‑run reproducibility mechanisms, and (3) expanding the user study to professional qualitative analysts in health and legal sectors.
Authors
- Tung T. Ngo
- Dai Nguyen Van
- Anh-Minh Nguyen
- Phuong-Anh Do
- Anh Nguyen-Quoc
Paper Information
- arXiv ID: 2602.18352v1
- Categories: cs.HC, cs.CR, cs.SE
- Published: February 20, 2026
- PDF: Download PDF