[Paper] KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Published: 3 days ago (February 23, 2026 at 01:46 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20135v1

Overview

The paper presents KNIGHT, a framework that turns any text source (e.g., Wikipedia) into a reusable knowledge graph and then uses large language models (LLMs) to generate high‑quality multiple‑choice questions (MCQs) on demand. By decoupling graph construction from question generation, KNIGHT makes it cheap and fast to produce large, difficulty‑controlled MCQ datasets for evaluating Retrieval‑Augmented Generation (RAG) systems and other LLM‑based applications.

Key Contributions

Graph‑first pipeline: Builds a compact, topic‑specific knowledge graph from raw documents, enabling fast “read‑only” MCQ generation without repeatedly feeding the whole source text to an LLM.
Adaptive hardness calibration: Allows the user (or instructor) to specify difficulty levels, including multi‑hop reasoning questions, by navigating the graph’s depth and relation complexity.
Domain‑agnostic design: Works with any ontology; the authors demonstrate it on Wikipedia/Wikidata but the same code can be applied to corporate knowledge bases, textbooks, or API docs.
Comprehensive quality evaluation: Introduces a five‑criterion rubric (fluency, unambiguity, relevance, option uniqueness, answerability) and shows that KNIGHT‑generated MCQs meet or exceed human‑crafted baselines.
Cost‑efficiency analysis: Quantifies token and monetary savings versus naïve LLM prompting, proving that the graph‑reuse strategy cuts generation cost by up to 70 % in the experiments.

Methodology

Document Ingestion – Raw text (e.g., a Wikipedia article) is parsed and linked to a structured knowledge base (Wikidata). Entities and their relations are extracted using off‑the‑shelf entity linking and relation extraction models.
Knowledge Graph Construction – The extracted triples are assembled into a directed graph where nodes are entities (concepts, dates, formulas) and edges are semantic relations (e.g., born‑in, causes, part‑of). The graph is pruned to keep only the most informative connections, yielding a lightweight representation.
Difficulty Specification – Users select a target difficulty level. For “easy” questions, the generator samples a single‑hop edge (direct fact). For “hard” questions, it walks 2‑3 hops, forcing the LLM to combine multiple facts (multi‑hop reasoning).
Prompt Engineering – A concise prompt containing the relevant sub‑graph (as a list of triples) and the desired difficulty is sent to an LLM (e.g., GPT‑4). The model returns a stem, four options, and the correct answer.
Post‑processing & Validation – Automatic checks enforce the five quality criteria; ambiguous or duplicate options are regenerated.

Because the graph is static after step 2, generating thousands of questions only requires sending small sub‑graphs to the LLM, dramatically reducing token usage.

Results & Findings

Quality Scores: Across 6 MCQ datasets (History, Biology, Mathematics), KNIGHT achieved average scores of 4.6/5 on fluency, 4.8/5 on unambiguity, 4.5/5 on relevance, 4.7/5 on option uniqueness, and 4.4/5 on answerability.
Cost Savings: Compared to a baseline that feeds the full source text for each question, KNIGHT reduced the average token count per question from ~1,200 tokens to ~350 tokens, translating to ~68 % lower API cost.
Difficulty Calibration: Human evaluators correctly identified the intended difficulty level 82 % of the time, confirming that multi‑hop graph traversal yields genuinely harder questions.
Benchmark Alignment: When the generated MCQs were used to evaluate LLMs, the resulting rankings matched those from established MMLU‑style benchmarks (±1 rank position), indicating that the synthetic data is a reliable proxy for real‑world assessments.

Practical Implications

Rapid Test Set Creation: Companies can spin up domain‑specific MCQ suites (e.g., for internal knowledge bases, product documentation) in hours instead of weeks, facilitating continuous evaluation of RAG pipelines.
Adaptive Training Curricula: Educational platforms can automatically generate practice quizzes that adapt to a learner’s proficiency by selecting appropriate graph depths.
Cost‑Effective Model Auditing: Auditors can probe LLMs with targeted “hard” questions without incurring the high compute cost of re‑processing large corpora each time.
Cross‑Domain Portability: Because the pipeline only requires an entity‑relation extractor and a knowledge base, developers can apply KNIGHT to niche domains such as legal statutes, medical guidelines, or software APIs.

Limitations & Future Work

Graph Quality Dependency: The approach inherits any errors from the upstream entity linking and relation extraction steps; noisy graphs can lead to ambiguous or factually incorrect questions.
Ontology Alignment: While claimed domain‑agnostic, the current implementation assumes a fairly clean, hierarchical ontology (like Wikidata). Highly unstructured corpora may need custom schema design.
Scalability of Multi‑Hop Reasoning: As hop count grows, the sub‑graph size expands, eroding some of the token‑saving benefits. Future work could explore graph summarization techniques or hierarchical prompting to keep prompts short.
Human Validation Loop: The study relied on automated metrics plus limited human review. A larger‑scale user study would solidify claims about difficulty perception and educational effectiveness.

Bottom line: KNIGHT shows that a modest upfront investment in a knowledge graph can pay off handsomely, turning LLMs into inexpensive, on‑demand MCQ generators that keep pace with the rapid iteration cycles of modern AI products.

Authors

Mohammad Amanlou
Erfan Shafiee Moghaddam
Yasaman Amou Jafari
Mahdi Noori
Farhan Farsi
Behnam Bahrak

Paper Information

arXiv ID: 2602.20135v1
Categories: cs.CL, cs.AI, cs.IR
Published: February 23, 2026
PDF: Download PDF

[Paper] KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Dynamic Personality Adaptation in Large Language Models via State Machines

[Paper] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models