[Paper] Kinship Data Benchmark for Multi-hop Reasoning

Published: 1 week ago (January 12, 2026 at 01:07 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.07794v1

Overview

The paper introduces KinshipQA, a new benchmark that tests large language models (LLMs) on multi‑hop reasoning by asking them to infer relationships within realistic family trees. By generating culture‑specific genealogies on demand, the authors can systematically vary difficulty, depth of reasoning, and cultural assumptions, giving developers a fine‑grained tool to probe where their models succeed or stumble.

Key Contributions

Generative genealogy pipeline – a fully automated method that creates large, plausible family trees respecting marriage rules of diverse kinship systems (e.g., patrilineal, matrilineal, polygamous).
Scalable benchmark – KinshipQA can produce arbitrarily many inference instances, letting researchers stress‑test models at any size.
Controlled difficulty – task parameters (relation depth, cultural constraints, number of hops) are tunable, enabling targeted evaluation of specific reasoning capabilities.
Zero‑shot evaluation suite – six state‑of‑the‑art LLMs (both open‑source and commercial) are benchmarked under a uniform deterministic decoding protocol, with exact‑match and set‑based metrics.
Empirical insights – the benchmark reveals systematic performance gaps across models and highlights cultural bias in multi‑hop reasoning.

Methodology

Genealogy Generation
- The authors encode marriage and kinship rules for several cultural systems as logical constraints.
- A constraint‑satisfaction generator samples individuals, assigns genders, creates marriages, and links children, producing a fully connected family tree.
Task Derivation
- From each tree, they automatically formulate natural‑language questions such as “Who is the great‑grandmother of X?” or “Is Y a cousin of Z?” that require traversing 1‑5 relational hops.
- Answers are expressed in a canonical form (e.g., “Alice”) and also as a set of acceptable synonyms to accommodate naming variations.
Evaluation Protocol
- Six LLMs (GPT‑4, Claude‑2, Llama‑2‑70B, Mistral‑7B, etc.) receive the question and the raw genealogical description as context.
- Models are run zero‑shot (no fine‑tuning) with deterministic decoding (temperature = 0) to ensure reproducibility.
- Performance is measured using Exact Match (EM) and Set‑Based F1 to capture both strict correctness and partial credit for alternative valid answers.

Results & Findings

Model	EM (avg.)	Set‑F1 (avg.)
GPT‑4	68 %	81 %
Claude‑2	55 %	73 %
Llama‑2‑70B	42 %	60 %
Mistral‑7B	38 %	57 %
…	…	…

Wide performance spread: The best commercial model (GPT‑4) outperforms open‑source counterparts by 20‑30 percentage points.
Depth sensitivity: Accuracy drops sharply after three hops, indicating that current LLMs struggle with deeper relational chains.
Cultural bias: Models trained primarily on Western text perform noticeably worse on kinship systems with non‑binary gender roles or polygamous marriage rules.
Deterministic decoding matters: Even with temperature = 0, some models produce “hallucinated” relatives, highlighting gaps in internal world‑model consistency.

Practical Implications

Debugging reasoning pipelines: KinshipQA can serve as a synthetic stress test for any system that needs to combine multiple facts (e.g., knowledge‑graph QA, recommendation engines).
Fine‑tuning data selection: The benchmark’s ability to generate unlimited, culture‑specific examples makes it a valuable source of targeted fine‑tuning data for improving multi‑hop reasoning.
Bias auditing: By swapping cultural rule sets, developers can expose and quantify cultural blind spots in their models before deployment.
Prompt engineering: The zero‑shot results suggest that carefully crafted prompts (e.g., explicit “trace the relationship step‑by‑step”) could mitigate depth‑related errors, a useful insight for building robust LLM‑driven assistants.

Limitations & Future Work

Synthetic realism: Although the genealogies obey logical constraints, they lack the messiness of real‑world family data (e.g., adoption, name changes), which may limit external validity.
Limited cultural scope: The current implementation covers a handful of kinship systems; expanding to more diverse societies would strengthen bias analyses.
Zero‑shot focus: The study does not explore few‑shot prompting or fine‑tuning, leaving open the question of how much performance can be recovered with modest adaptation.
Evaluation metrics: Exact‑match and set‑based scores ignore reasoning process quality; future work could incorporate chain‑of‑thought verification or programmatic checks.

KinshipQA opens a new avenue for rigorously probing LLM reasoning across cultural contexts, giving developers a practical tool to benchmark, debug, and improve their models before they go live.

Authors

Tianda Sun
Dimitar Kazakov

Paper Information

arXiv ID: 2601.07794v1
Categories: cs.CL, cs.AI
Published: January 12, 2026
PDF: Download PDF

[Paper] Kinship Data Benchmark for Multi-hop Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models