[Paper] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Source: arXiv - 2512.08646v1
Overview
The paper presents QSTN, an open‑source Python framework that lets researchers and developers generate and evaluate questionnaire‑style responses from large language models (LLMs). By treating surveys as a systematic prompting task, QSTN makes it possible to run “in‑silico” surveys at massive scale while keeping the experiments reproducible and the results comparable to human answers.
Key Contributions
- Modular, open‑source library for building, running, and analysing questionnaire prompts with LLMs.
- Systematic evaluation pipeline that isolates the effects of question phrasing, presentation format, and response‑generation strategies.
- Large‑scale empirical study (over 40 M synthetic survey responses) showing how design choices influence alignment with human data.
- No‑code web UI that enables non‑programmers to set up robust LLM‑based survey experiments.
- Guidelines for cost‑effective, reliable LLM‑driven annotation that can replace or augment manual labeling in many workflows.
Methodology
- Prompt Construction – QSTN treats each questionnaire item as a prompt template. Researchers can vary wording, ordering, answer‑option layout, and even inject “noise” (e.g., synonyms, typos) to test robustness.
- Response Generation – The framework supports multiple LLM back‑ends (OpenAI, Anthropic, open‑source models) and several decoding strategies (temperature‑controlled sampling, beam search, top‑p).
- Evaluation Harness – Generated answers are automatically compared against a ground‑truth human dataset using metrics such as exact match, semantic similarity, and calibration error.
- Experiment Orchestration – QSTN’s pipeline runs thousands of prompt‑model combinations in parallel, logs costs, and stores results in a structured JSON/CSV format for downstream analysis.
- User Interface – A lightweight Flask‑based UI lets users drag‑and‑drop questionnaire files, select models, and launch experiments without writing code.
Results & Findings
- Question structure matters: Simple, single‑sentence questions with explicit answer options yielded the highest alignment (up to 92 % exact match) with human responses, while multi‑sentence or ambiguous phrasing dropped alignment by 15‑20 %.
- Decoding strategy is critical: Low‑temperature deterministic sampling (temp ≤ 0.2) consistently outperformed high‑temperature or nucleus sampling for factual survey items.
- Model size vs. cost trade‑off: Mid‑size models (≈13 B parameters) achieved near‑human agreement at a fraction (≈30 %) of the compute cost of the largest models.
- Robustness to perturbations: Introducing minor lexical variations (synonyms, shuffled options) reduced alignment by only ~5 %, indicating that well‑designed prompts are resilient to surface noise.
- No‑code UI usability: Pilot users (social scientists, product managers) could set up a full experiment in under 10 minutes, confirming the accessibility goal.
Practical Implications
- Rapid prototyping of LLM‑based surveys – Product teams can test user‑experience questions or market‑research polls without recruiting participants, saving time and budget.
- Scalable data annotation – When building training sets for classification or sentiment analysis, QSTN can generate high‑quality labeled data at scale, reducing reliance on costly human annotators.
- A/B testing of prompt designs – Developers can systematically compare different wording or UI layouts for chatbots and voice assistants, ensuring the final prompt yields the most reliable model behavior.
- Compliance & reproducibility – The framework’s versioned pipelines and cost logs make it easier to meet audit requirements for AI‑generated content, a growing concern in regulated industries.
- Education & research – Instructors can expose students to real‑world LLM evaluation without deep programming expertise, fostering a more data‑driven curriculum.
Limitations & Future Work
- Domain specificity – The current evaluation focuses on general‑knowledge and opinion surveys; performance on highly technical or niche domains (e.g., medical questionnaires) remains untested.
- Human benchmark quality – Alignment metrics depend on the quality and diversity of the human reference dataset; biases in that data can propagate to the evaluation.
- Model access constraints – While QSTN supports open‑source models, many high‑performing LLM APIs are gated, limiting reproducibility for researchers without commercial access.
- Future directions include extending the framework to multimodal prompts (image‑plus‑text surveys), integrating active‑learning loops for iterative prompt refinement, and publishing a benchmark suite covering more specialized questionnaire domains.
Authors
- Maximilian Kreutner
- Jens Rupprecht
- Georg Ahnert
- Ahmed Salem
- Markus Strohmaier
Paper Information
- arXiv ID: 2512.08646v1
- Categories: cs.CL, cs.CY
- Published: December 9, 2025
- PDF: Download PDF