The problem with dialogue datasets
Source: Dev.to
The Problem with Dialogue Datasets
Most dialogue datasets used to train and evaluate language models contain only text: a speaker label, a message, and sometimes a sentiment tag. This format works for many tasks, but it falls short when building systems that need to reason about people, not just respond to them.
Real conversations are driven by internal states that never appear in the transcript:
- Beliefs about the other person that evolve with each exchange
- Goals behind each message (e.g., seek validation, assert control, repair trust)
- Relationship dynamics that shift across the conversation (trust, tension, connection)
- Psychological identity that shapes how someone communicates under pressure
When a speaker says:
“I’m not upset about the meeting, I’m upset you didn’t tell me earlier.”
the text is visible, but the underlying drivers are not:
- Belief that the other person withholds information (confidence: 0.74)
- Goal to seek validation rather than escalate
- Relationship state where trust has been eroding over the last four turns
Without this information, a dataset can only tell you what happened, not why.
Training a conversational model on text‑only data leads it to imitate surface patterns—learning what responses look like, not what drives them. This works for simple tasks but creates a ceiling for anything that requires:
- Tracking beliefs across multi‑turn conversations
- Understanding how trust changes during conflict
- Simulating how different personalities handle the same situation
- Evaluating whether an agent’s internal reasoning matches its output
For these tasks, datasets need the internal structure to be explicit, not inferred after the fact.
StrataSynth: A Structured Approach
We are exploring a different approach with a project called StrataSynth. Instead of prompting an LLM to generate a conversation directly, the system first simulates a minimal cognitive model. The language model is used only at the final step to render decisions into natural language.
Pipeline Overview
PsycheGraph → identity, attachment style, biases, voice
Belief Engine → evolving beliefs with confidence scores
Relationship State → trust, tension, connection, dominance
Decision Engine → intent, goal, communication act
LLM Rendering → natural language
Key constraint: The LLM cannot decide what to believe or how to relate to the other agent; those decisions are made upstream by the state model. The LLM merely renders the decision into text. This separation ensures that the internal state is always explicit—it is the input that produced the output.
Example Turn (JSON)
{
"speaker": "A",
"text": "I'm not upset about the meeting. I'm upset you didn't tell me.",
"intent": "reveal",
"goal": "seek_validation",
"communication_act": "accusation",
"belief_delta": {
"trust_other": -0.07
},
"relationship_state": {
"trust": 0.62,
"tension": 0.44,
"connection": 0.38
}
}
Across a full conversation, this produces trajectories such as:
- Belief trajectory – how each belief changes turn by turn
- Relationship trajectory – how trust and tension evolve across the arc
- Behavioral entropy – how varied the speaker’s communication acts are
Evaluation Without LLM Self‑Scoring
We wanted to avoid evaluating synthetic data with the same LLM that generated it. LLM self‑evaluation can hide problems; a model that produces structurally inconsistent data may still rate it as high quality.
All quality metrics in StrataSynth are computed deterministically:
belief_consistency– correlation between communication acts and belief deltas (NumPy)identity_stability– cosine similarity of communication distributions across turns (sentence‑transformers)behavioral_entropy– Shannon entropy over communication act distributionsnoise_rejection_rate– fraction of injected noise correctly isolated
No LLM scoring. No circular evaluation.
Published Datasets
We have released three initial prototype datasets on Hugging Face (each contains 15 conversations):
- stratasynth-social‑reasoning – family conflict, romantic trust repair, caregiver stress
- stratasynth-agent‑stress‑test – jealousy escalation, performance reviews, estrangement
- stratasynth‑belief‑dynamics – career transitions, mentorship conflict, relationship dissolution
The structure, not the volume, is the contribution we wanted to share.
Potential Applications
Structured social datasets could be useful for:
- Evaluating whether an agent tracks belief changes correctly
- Training models that need to reason about trust and conflict
- Stress‑testing conversational systems with psychologically defined personas
- Alignment research that requires explicit internal state as ground truth
Open Questions
The cognitive model is intentionally minimal:
- 12 beliefs
- 4 relationship dimensions
- 10 communication acts
We are not sure whether this abstraction provides enough signal or is a crude approximation. If you have worked on structured dialogue datasets, agent evaluation, or social reasoning benchmarks, I would be very interested in hearing where this approach seems wrong or could be improved.