The problem with dialogue datasets

Published: (March 7, 2026 at 07:25 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Problem with Dialogue Datasets

Most dialogue datasets used to train and evaluate language models contain only text: a speaker label, a message, and sometimes a sentiment tag. This format works for many tasks, but it falls short when building systems that need to reason about people, not just respond to them.

Real conversations are driven by internal states that never appear in the transcript:

  • Beliefs about the other person that evolve with each exchange
  • Goals behind each message (e.g., seek validation, assert control, repair trust)
  • Relationship dynamics that shift across the conversation (trust, tension, connection)
  • Psychological identity that shapes how someone communicates under pressure

When a speaker says:

“I’m not upset about the meeting, I’m upset you didn’t tell me earlier.”

the text is visible, but the underlying drivers are not:

  • Belief that the other person withholds information (confidence: 0.74)
  • Goal to seek validation rather than escalate
  • Relationship state where trust has been eroding over the last four turns

Without this information, a dataset can only tell you what happened, not why.

Training a conversational model on text‑only data leads it to imitate surface patterns—learning what responses look like, not what drives them. This works for simple tasks but creates a ceiling for anything that requires:

  • Tracking beliefs across multi‑turn conversations
  • Understanding how trust changes during conflict
  • Simulating how different personalities handle the same situation
  • Evaluating whether an agent’s internal reasoning matches its output

For these tasks, datasets need the internal structure to be explicit, not inferred after the fact.

StrataSynth: A Structured Approach

We are exploring a different approach with a project called StrataSynth. Instead of prompting an LLM to generate a conversation directly, the system first simulates a minimal cognitive model. The language model is used only at the final step to render decisions into natural language.

Pipeline Overview

PsycheGraph        → identity, attachment style, biases, voice
Belief Engine      → evolving beliefs with confidence scores
Relationship State → trust, tension, connection, dominance
Decision Engine    → intent, goal, communication act
LLM Rendering      → natural language

Key constraint: The LLM cannot decide what to believe or how to relate to the other agent; those decisions are made upstream by the state model. The LLM merely renders the decision into text. This separation ensures that the internal state is always explicit—it is the input that produced the output.

Example Turn (JSON)

{
  "speaker": "A",
  "text": "I'm not upset about the meeting. I'm upset you didn't tell me.",
  "intent": "reveal",
  "goal": "seek_validation",
  "communication_act": "accusation",
  "belief_delta": {
    "trust_other": -0.07
  },
  "relationship_state": {
    "trust": 0.62,
    "tension": 0.44,
    "connection": 0.38
  }
}

Across a full conversation, this produces trajectories such as:

  • Belief trajectory – how each belief changes turn by turn
  • Relationship trajectory – how trust and tension evolve across the arc
  • Behavioral entropy – how varied the speaker’s communication acts are

Evaluation Without LLM Self‑Scoring

We wanted to avoid evaluating synthetic data with the same LLM that generated it. LLM self‑evaluation can hide problems; a model that produces structurally inconsistent data may still rate it as high quality.

All quality metrics in StrataSynth are computed deterministically:

  • belief_consistency – correlation between communication acts and belief deltas (NumPy)
  • identity_stability – cosine similarity of communication distributions across turns (sentence‑transformers)
  • behavioral_entropy – Shannon entropy over communication act distributions
  • noise_rejection_rate – fraction of injected noise correctly isolated

No LLM scoring. No circular evaluation.

Published Datasets

We have released three initial prototype datasets on Hugging Face (each contains 15 conversations):

  1. stratasynth-social‑reasoning – family conflict, romantic trust repair, caregiver stress
  2. stratasynth-agent‑stress‑test – jealousy escalation, performance reviews, estrangement
  3. stratasynth‑belief‑dynamics – career transitions, mentorship conflict, relationship dissolution

The structure, not the volume, is the contribution we wanted to share.

Potential Applications

Structured social datasets could be useful for:

  • Evaluating whether an agent tracks belief changes correctly
  • Training models that need to reason about trust and conflict
  • Stress‑testing conversational systems with psychologically defined personas
  • Alignment research that requires explicit internal state as ground truth

Open Questions

The cognitive model is intentionally minimal:

  • 12 beliefs
  • 4 relationship dimensions
  • 10 communication acts

We are not sure whether this abstraction provides enough signal or is a crude approximation. If you have worked on structured dialogue datasets, agent evaluation, or social reasoning benchmarks, I would be very interested in hearing where this approach seems wrong or could be improved.

0 views
Back to Blog

Related posts

Read more »