[Paper] Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation

Published: (May 7, 2026 at 01:39 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06625v1

Overview

The paper introduces a lightweight “human‑in‑the‑loop” pipeline for annotating Korean as a second language (L2) using Universal Dependencies (UD). By letting two specially‑trained parsers vote on each sentence, the authors show that parser agreement can reliably stand in for manual checks, dramatically cutting the amount of human effort needed to build high‑quality L2‑Korean treebanks.

Key Contributions

  • Agreement‑based quality proxy: Demonstrates that when two domain‑adapted parsers concur, their output matches human judgments with high accuracy.
  • Simplified annotation workflow: Proposes a practical semi‑automatic pipeline that only requires human review of disagreement cases.
  • Error‑type analysis: Shows that most parser disagreements fall into predictable linguistic categories (e.g., grammatical‑relation ambiguities, clause‑boundary decisions).
  • Iterative refinement roadmap: Identifies which disagreement patterns can be resolved by further model training versus those that expose deeper representational limits.

Methodology

  1. Data & Models – The authors start from an existing L2‑Korean corpus and fine‑tune two independent dependency parsers on a small, manually annotated seed set.
  2. Agreement Check – For each new sentence, both parsers produce a full UD parse. If the parses are identical (tokenization, POS tags, and dependency arcs), the sentence is automatically accepted.
  3. Human Validation – Sentences with mismatching parses are sent to linguists for verification. Their judgments are then compared against the parsers’ consensus decisions to assess how well agreement predicts correctness.
  4. Error Categorization – Disagreement cases are manually grouped into linguistic phenomena (e.g., ambiguous case particles, ellipsis, clause‑boundary splits) to understand systematic weaknesses.

The workflow is deliberately simple: no complex confidence scoring, active learning loops, or crowdsourcing—just a binary “agree/disagree” gate that determines whether a human is needed.

Results & Findings

  • High correspondence: In > 90 % of cases where the two parsers agreed, human annotators also marked the parse as correct.
  • Disagreement concentration: Over 70 % of disagreements clustered around a handful of linguistic issues, such as distinguishing between subject vs. topic relations or handling omitted subjects common in learner Korean.
  • Iterative gains: Retraining the parsers on a modest set of previously disagreed sentences reduced the overall disagreement rate by roughly 15 % after one iteration.
  • Hard cases: Some disagreements persisted even after multiple refinements, pointing to ambiguities that may require changes to the underlying UD schema rather than just better models.

Practical Implications

  • Faster treebank creation: Development teams can bootstrap L2‑Korean UD resources with far fewer annotation hours, accelerating downstream NLP tasks like grammar checking or learner feedback systems.
  • Cost‑effective quality control: The agreement gate acts as an automatic sanity check, allowing project managers to allocate human reviewers only where they add the most value.
  • Transferable recipe: The same “dual‑parser agreement” strategy can be applied to other low‑resource or learner languages, offering a template for semi‑automatic corpus building in multilingual settings.
  • Better learner‑focused tools: High‑quality L2‑Korean parses enable more accurate error detection, automated writing assistance, and adaptive language‑learning platforms.

Limitations & Future Work

  • Domain dependence: The approach relies on having two reasonably strong parsers; building those initial models still requires a seed of manually annotated data.
  • Schema constraints: Some persistent disagreements stem from UD’s representation limits for learner language, suggesting that schema extensions or alternative annotation layers may be needed.
  • Scalability of error analysis: While the paper categorizes disagreement types, automating this categorization for large corpora remains an open challenge.
  • Future directions: The authors propose exploring confidence‑weighted voting, active learning to select the most informative disagreement cases, and extending the workflow to other morphologically rich L2 languages.

Authors

  • Hakyung Sung
  • Gyu-Ho Shin

Paper Information

  • arXiv ID: 2605.06625v1
  • Categories: cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »