[Paper] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Published: 2 months ago (December 9, 2025 at 09:35 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.08646v1

Overview

The paper presents QSTN, an open‑source Python framework that lets researchers and developers generate and evaluate questionnaire‑style responses from large language models (LLMs). By treating surveys as a systematic prompting task, QSTN makes it possible to run “in‑silico” surveys at massive scale while keeping the experiments reproducible and the results comparable to human answers.

Key Contributions

Modular, open‑source library for building, running, and analysing questionnaire prompts with LLMs.
Systematic evaluation pipeline that isolates the effects of question phrasing, presentation format, and response‑generation strategies.
Large‑scale empirical study (over 40 M synthetic survey responses) showing how design choices influence alignment with human data.
No‑code web UI that enables non‑programmers to set up robust LLM‑based survey experiments.
Guidelines for cost‑effective, reliable LLM‑driven annotation that can replace or augment manual labeling in many workflows.

Methodology

Prompt Construction – QSTN treats each questionnaire item as a prompt template. Researchers can vary wording, ordering, answer‑option layout, and even inject “noise” (e.g., synonyms, typos) to test robustness.
Response Generation – The framework supports multiple LLM back‑ends (OpenAI, Anthropic, open‑source models) and several decoding strategies (temperature‑controlled sampling, beam search, top‑p).
Evaluation Harness – Generated answers are automatically compared against a ground‑truth human dataset using metrics such as exact match, semantic similarity, and calibration error.
Experiment Orchestration – QSTN’s pipeline runs thousands of prompt‑model combinations in parallel, logs costs, and stores results in a structured JSON/CSV format for downstream analysis.
User Interface – A lightweight Flask‑based UI lets users drag‑and‑drop questionnaire files, select models, and launch experiments without writing code.

Results & Findings

Question structure matters: Simple, single‑sentence questions with explicit answer options yielded the highest alignment (up to 92 % exact match) with human responses, while multi‑sentence or ambiguous phrasing dropped alignment by 15‑20 %.
Decoding strategy is critical: Low‑temperature deterministic sampling (temp ≤ 0.2) consistently outperformed high‑temperature or nucleus sampling for factual survey items.
Model size vs. cost trade‑off: Mid‑size models (≈13 B parameters) achieved near‑human agreement at a fraction (≈30 %) of the compute cost of the largest models.
Robustness to perturbations: Introducing minor lexical variations (synonyms, shuffled options) reduced alignment by only ~5 %, indicating that well‑designed prompts are resilient to surface noise.
No‑code UI usability: Pilot users (social scientists, product managers) could set up a full experiment in under 10 minutes, confirming the accessibility goal.

Practical Implications

Rapid prototyping of LLM‑based surveys – Product teams can test user‑experience questions or market‑research polls without recruiting participants, saving time and budget.
Scalable data annotation – When building training sets for classification or sentiment analysis, QSTN can generate high‑quality labeled data at scale, reducing reliance on costly human annotators.
A/B testing of prompt designs – Developers can systematically compare different wording or UI layouts for chatbots and voice assistants, ensuring the final prompt yields the most reliable model behavior.
Compliance & reproducibility – The framework’s versioned pipelines and cost logs make it easier to meet audit requirements for AI‑generated content, a growing concern in regulated industries.
Education & research – Instructors can expose students to real‑world LLM evaluation without deep programming expertise, fostering a more data‑driven curriculum.

Limitations & Future Work

Domain specificity – The current evaluation focuses on general‑knowledge and opinion surveys; performance on highly technical or niche domains (e.g., medical questionnaires) remains untested.
Human benchmark quality – Alignment metrics depend on the quality and diversity of the human reference dataset; biases in that data can propagate to the evaluation.
Model access constraints – While QSTN supports open‑source models, many high‑performing LLM APIs are gated, limiting reproducibility for researchers without commercial access.
Future directions include extending the framework to multimodal prompts (image‑plus‑text surveys), integrating active‑learning loops for iterative prompt refinement, and publishing a benchmark suite covering more specialized questionnaire domains.

Authors

Maximilian Kreutner
Jens Rupprecht
Georg Ahnert
Ahmed Salem
Markus Strohmaier

Paper Information

arXiv ID: 2512.08646v1
Categories: cs.CL, cs.CY
Published: December 9, 2025
PDF: Download PDF

[Paper] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling