[Paper] Kinship Data Benchmark for Multi-hop Reasoning
Source: arXiv - 2601.07794v1
Overview
The paper introduces KinshipQA, a new benchmark that tests large language models (LLMs) on multi‑hop reasoning by asking them to infer relationships within realistic family trees. By generating culture‑specific genealogies on demand, the authors can systematically vary difficulty, depth of reasoning, and cultural assumptions, giving developers a fine‑grained tool to probe where their models succeed or stumble.
Key Contributions
- Generative genealogy pipeline – a fully automated method that creates large, plausible family trees respecting marriage rules of diverse kinship systems (e.g., patrilineal, matrilineal, polygamous).
- Scalable benchmark – KinshipQA can produce arbitrarily many inference instances, letting researchers stress‑test models at any size.
- Controlled difficulty – task parameters (relation depth, cultural constraints, number of hops) are tunable, enabling targeted evaluation of specific reasoning capabilities.
- Zero‑shot evaluation suite – six state‑of‑the‑art LLMs (both open‑source and commercial) are benchmarked under a uniform deterministic decoding protocol, with exact‑match and set‑based metrics.
- Empirical insights – the benchmark reveals systematic performance gaps across models and highlights cultural bias in multi‑hop reasoning.
Methodology
- Genealogy Generation
- The authors encode marriage and kinship rules for several cultural systems as logical constraints.
- A constraint‑satisfaction generator samples individuals, assigns genders, creates marriages, and links children, producing a fully connected family tree.
- Task Derivation
- From each tree, they automatically formulate natural‑language questions such as “Who is the great‑grandmother of X?” or “Is Y a cousin of Z?” that require traversing 1‑5 relational hops.
- Answers are expressed in a canonical form (e.g., “Alice”) and also as a set of acceptable synonyms to accommodate naming variations.
- Evaluation Protocol
- Six LLMs (GPT‑4, Claude‑2, Llama‑2‑70B, Mistral‑7B, etc.) receive the question and the raw genealogical description as context.
- Models are run zero‑shot (no fine‑tuning) with deterministic decoding (temperature = 0) to ensure reproducibility.
- Performance is measured using Exact Match (EM) and Set‑Based F1 to capture both strict correctness and partial credit for alternative valid answers.
Results & Findings
| Model | EM (avg.) | Set‑F1 (avg.) |
|---|---|---|
| GPT‑4 | 68 % | 81 % |
| Claude‑2 | 55 % | 73 % |
| Llama‑2‑70B | 42 % | 60 % |
| Mistral‑7B | 38 % | 57 % |
| … | … | … |
- Wide performance spread: The best commercial model (GPT‑4) outperforms open‑source counterparts by 20‑30 percentage points.
- Depth sensitivity: Accuracy drops sharply after three hops, indicating that current LLMs struggle with deeper relational chains.
- Cultural bias: Models trained primarily on Western text perform noticeably worse on kinship systems with non‑binary gender roles or polygamous marriage rules.
- Deterministic decoding matters: Even with temperature = 0, some models produce “hallucinated” relatives, highlighting gaps in internal world‑model consistency.
Practical Implications
- Debugging reasoning pipelines: KinshipQA can serve as a synthetic stress test for any system that needs to combine multiple facts (e.g., knowledge‑graph QA, recommendation engines).
- Fine‑tuning data selection: The benchmark’s ability to generate unlimited, culture‑specific examples makes it a valuable source of targeted fine‑tuning data for improving multi‑hop reasoning.
- Bias auditing: By swapping cultural rule sets, developers can expose and quantify cultural blind spots in their models before deployment.
- Prompt engineering: The zero‑shot results suggest that carefully crafted prompts (e.g., explicit “trace the relationship step‑by‑step”) could mitigate depth‑related errors, a useful insight for building robust LLM‑driven assistants.
Limitations & Future Work
- Synthetic realism: Although the genealogies obey logical constraints, they lack the messiness of real‑world family data (e.g., adoption, name changes), which may limit external validity.
- Limited cultural scope: The current implementation covers a handful of kinship systems; expanding to more diverse societies would strengthen bias analyses.
- Zero‑shot focus: The study does not explore few‑shot prompting or fine‑tuning, leaving open the question of how much performance can be recovered with modest adaptation.
- Evaluation metrics: Exact‑match and set‑based scores ignore reasoning process quality; future work could incorporate chain‑of‑thought verification or programmatic checks.
KinshipQA opens a new avenue for rigorously probing LLM reasoning across cultural contexts, giving developers a practical tool to benchmark, debug, and improve their models before they go live.
Authors
- Tianda Sun
- Dimitar Kazakov
Paper Information
- arXiv ID: 2601.07794v1
- Categories: cs.CL, cs.AI
- Published: January 12, 2026
- PDF: Download PDF