[Paper] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Published: (December 9, 2025 at 09:35 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08646v1

Overview

The paper presents QSTN, an open‑source Python framework that lets researchers and developers generate and evaluate questionnaire‑style responses from large language models (LLMs). By treating surveys as a systematic prompting task, QSTN makes it possible to run “in‑silico” surveys at massive scale while keeping the experiments reproducible and the results comparable to human answers.

Key Contributions

  • Modular, open‑source library for building, running, and analysing questionnaire prompts with LLMs.
  • Systematic evaluation pipeline that isolates the effects of question phrasing, presentation format, and response‑generation strategies.
  • Large‑scale empirical study (over 40 M synthetic survey responses) showing how design choices influence alignment with human data.
  • No‑code web UI that enables non‑programmers to set up robust LLM‑based survey experiments.
  • Guidelines for cost‑effective, reliable LLM‑driven annotation that can replace or augment manual labeling in many workflows.

Methodology

  1. Prompt Construction – QSTN treats each questionnaire item as a prompt template. Researchers can vary wording, ordering, answer‑option layout, and even inject “noise” (e.g., synonyms, typos) to test robustness.
  2. Response Generation – The framework supports multiple LLM back‑ends (OpenAI, Anthropic, open‑source models) and several decoding strategies (temperature‑controlled sampling, beam search, top‑p).
  3. Evaluation Harness – Generated answers are automatically compared against a ground‑truth human dataset using metrics such as exact match, semantic similarity, and calibration error.
  4. Experiment Orchestration – QSTN’s pipeline runs thousands of prompt‑model combinations in parallel, logs costs, and stores results in a structured JSON/CSV format for downstream analysis.
  5. User Interface – A lightweight Flask‑based UI lets users drag‑and‑drop questionnaire files, select models, and launch experiments without writing code.

Results & Findings

  • Question structure matters: Simple, single‑sentence questions with explicit answer options yielded the highest alignment (up to 92 % exact match) with human responses, while multi‑sentence or ambiguous phrasing dropped alignment by 15‑20 %.
  • Decoding strategy is critical: Low‑temperature deterministic sampling (temp ≤ 0.2) consistently outperformed high‑temperature or nucleus sampling for factual survey items.
  • Model size vs. cost trade‑off: Mid‑size models (≈13 B parameters) achieved near‑human agreement at a fraction (≈30 %) of the compute cost of the largest models.
  • Robustness to perturbations: Introducing minor lexical variations (synonyms, shuffled options) reduced alignment by only ~5 %, indicating that well‑designed prompts are resilient to surface noise.
  • No‑code UI usability: Pilot users (social scientists, product managers) could set up a full experiment in under 10 minutes, confirming the accessibility goal.

Practical Implications

  • Rapid prototyping of LLM‑based surveys – Product teams can test user‑experience questions or market‑research polls without recruiting participants, saving time and budget.
  • Scalable data annotation – When building training sets for classification or sentiment analysis, QSTN can generate high‑quality labeled data at scale, reducing reliance on costly human annotators.
  • A/B testing of prompt designs – Developers can systematically compare different wording or UI layouts for chatbots and voice assistants, ensuring the final prompt yields the most reliable model behavior.
  • Compliance & reproducibility – The framework’s versioned pipelines and cost logs make it easier to meet audit requirements for AI‑generated content, a growing concern in regulated industries.
  • Education & research – Instructors can expose students to real‑world LLM evaluation without deep programming expertise, fostering a more data‑driven curriculum.

Limitations & Future Work

  • Domain specificity – The current evaluation focuses on general‑knowledge and opinion surveys; performance on highly technical or niche domains (e.g., medical questionnaires) remains untested.
  • Human benchmark quality – Alignment metrics depend on the quality and diversity of the human reference dataset; biases in that data can propagate to the evaluation.
  • Model access constraints – While QSTN supports open‑source models, many high‑performing LLM APIs are gated, limiting reproducibility for researchers without commercial access.
  • Future directions include extending the framework to multimodal prompts (image‑plus‑text surveys), integrating active‑learning loops for iterative prompt refinement, and publishing a benchmark suite covering more specialized questionnaire domains.

Authors

  • Maximilian Kreutner
  • Jens Rupprecht
  • Georg Ahnert
  • Ahmed Salem
  • Markus Strohmaier

Paper Information

  • arXiv ID: 2512.08646v1
  • Categories: cs.CL, cs.CY
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »