The problem with dialogue datasets

Published: 1 month ago (March 7, 2026 at 07:25 PM EST)

4 min read

Source: Dev.to

Source: Dev.to

The Problem with Dialogue Datasets

Most dialogue datasets used to train and evaluate language models contain only text: a speaker label, a message, and sometimes a sentiment tag. This format works for many tasks, but it falls short when building systems that need to reason about people, not just respond to them.

Real conversations are driven by internal states that never appear in the transcript:

Beliefs about the other person that evolve with each exchange
Goals behind each message (e.g., seek validation, assert control, repair trust)
Relationship dynamics that shift across the conversation (trust, tension, connection)
Psychological identity that shapes how someone communicates under pressure

When a speaker says:

“I’m not upset about the meeting, I’m upset you didn’t tell me earlier.”

the text is visible, but the underlying drivers are not:

Belief that the other person withholds information (confidence: 0.74)
Goal to seek validation rather than escalate
Relationship state where trust has been eroding over the last four turns

Without this information, a dataset can only tell you what happened, not why.

Training a conversational model on text‑only data leads it to imitate surface patterns—learning what responses look like, not what drives them. This works for simple tasks but creates a ceiling for anything that requires:

Tracking beliefs across multi‑turn conversations
Understanding how trust changes during conflict
Simulating how different personalities handle the same situation
Evaluating whether an agent’s internal reasoning matches its output

For these tasks, datasets need the internal structure to be explicit, not inferred after the fact.

StrataSynth: A Structured Approach

We are exploring a different approach with a project called StrataSynth. Instead of prompting an LLM to generate a conversation directly, the system first simulates a minimal cognitive model. The language model is used only at the final step to render decisions into natural language.

Pipeline Overview

PsycheGraph        → identity, attachment style, biases, voice
Belief Engine      → evolving beliefs with confidence scores
Relationship State → trust, tension, connection, dominance
Decision Engine    → intent, goal, communication act
LLM Rendering      → natural language

Key constraint: The LLM cannot decide what to believe or how to relate to the other agent; those decisions are made upstream by the state model. The LLM merely renders the decision into text. This separation ensures that the internal state is always explicit—it is the input that produced the output.

Example Turn (JSON)

{
  "speaker": "A",
  "text": "I'm not upset about the meeting. I'm upset you didn't tell me.",
  "intent": "reveal",
  "goal": "seek_validation",
  "communication_act": "accusation",
  "belief_delta": {
    "trust_other": -0.07
  },
  "relationship_state": {
    "trust": 0.62,
    "tension": 0.44,
    "connection": 0.38
  }
}

Across a full conversation, this produces trajectories such as:

Belief trajectory – how each belief changes turn by turn
Relationship trajectory – how trust and tension evolve across the arc
Behavioral entropy – how varied the speaker’s communication acts are

Evaluation Without LLM Self‑Scoring

We wanted to avoid evaluating synthetic data with the same LLM that generated it. LLM self‑evaluation can hide problems; a model that produces structurally inconsistent data may still rate it as high quality.

All quality metrics in StrataSynth are computed deterministically:

belief_consistency – correlation between communication acts and belief deltas (NumPy)
identity_stability – cosine similarity of communication distributions across turns (sentence‑transformers)
behavioral_entropy – Shannon entropy over communication act distributions
noise_rejection_rate – fraction of injected noise correctly isolated

No LLM scoring. No circular evaluation.

Published Datasets

We have released three initial prototype datasets on Hugging Face (each contains 15 conversations):

stratasynth-social‑reasoning – family conflict, romantic trust repair, caregiver stress
stratasynth-agent‑stress‑test – jealousy escalation, performance reviews, estrangement
stratasynth‑belief‑dynamics – career transitions, mentorship conflict, relationship dissolution

The structure, not the volume, is the contribution we wanted to share.

Potential Applications

Structured social datasets could be useful for:

Evaluating whether an agent tracks belief changes correctly
Training models that need to reason about trust and conflict
Stress‑testing conversational systems with psychologically defined personas
Alignment research that requires explicit internal state as ground truth

Open Questions

The cognitive model is intentionally minimal:

12 beliefs
4 relationship dimensions
10 communication acts

We are not sure whether this abstraction provides enough signal or is a crude approximation. If you have worked on structured dialogue datasets, agent evaluation, or social reasoning benchmarks, I would be very interested in hearing where this approach seems wrong or could be improved.

The problem with dialogue datasets

The Problem with Dialogue Datasets

StrataSynth: A Structured Approach

Pipeline Overview

Example Turn (JSON)

Evaluation Without LLM Self‑Scoring

Published Datasets

Potential Applications

Open Questions

Related posts

Understanding Word2Vec – Part 4: Visualizing Word Vectors

Summarize Text with AI: A Practical Guide

Why ChatGPT Keeps Cutting Off Your Writing: The Hidden AI System Called Truncation and How We Stopped It

Mastering AI Language Models: From NLP Foundations to 2025 Innovations