[Paper] Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

Published: (February 13, 2026 at 12:06 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.13102v1

Overview

This paper explores how natural‑language‑processing (NLP) can be used to automatically assess the proficiency level of learner‑written Estonian texts, mapping them to the CEFR scale (A2‑C1). By focusing on carefully chosen linguistic features, the authors build models that are both highly accurate (≈ 90 % accuracy) and more interpretable for educators and developers of language‑learning tools.

Key Contributions

  • Feature‑driven, interpretable modeling: Demonstrates that a compact set of lexical, morphological, surface‑level, and error‑type features can rival larger, black‑box models in accuracy while offering clearer insight into why a text is classified at a certain level.
  • High‑performing CEFR classifier for Estonian: Achieves ~0.9 accuracy on a modern exam corpus and ~0.8 on a historic corpus spanning a decade, showing robustness across time.
  • Longitudinal language‑development analysis: Shows a measurable increase in text complexity in Estonian learner writing over a 7‑10‑year period.
  • Open‑source integration: The classifier has been embedded into an existing Estonian language‑learning platform, providing real‑time feedback to learners.

Methodology

  1. Data collection: Essays from official Estonian proficiency exams (levels A2, B1, B2, C1) were gathered, along with a smaller, older exam set for temporal validation.
  2. Feature engineering:
    • Lexical: type‑token ratio, average word length, frequency of high‑level vocabulary.
    • Morphological: suffix richness, case/agreement errors.
    • Surface: sentence length, paragraph count, punctuation usage.
    • Error‑type: counts of spelling, grammar, and collocation mistakes detected by rule‑based error taggers.
  3. Model training: Classical machine‑learning classifiers (Logistic Regression, SVM, Random Forest) were trained on the pre‑selected feature set. For comparison, the same classifiers were also trained on a larger “all‑features” set that included raw n‑grams and embeddings.
  4. Evaluation: 5‑fold cross‑validation on the main corpus, plus out‑of‑sample testing on the older exam data. Accuracy, macro‑F1, and confusion matrices were reported.

Results & Findings

  • Accuracy: The best models (Random Forest & SVM) reached ≈ 0.90 accuracy on the contemporary test set. Using the compact feature set produced virtually the same performance as the full feature set.
  • Stability across genres: The pre‑selected features reduced variance when classifying different essay prompts, indicating better generalization.
  • Temporal shift: When applied to the older exam corpus, the same models still achieved ≈ 0.80 accuracy, while analysis of the feature values revealed a clear trend toward longer sentences, richer morphology, and fewer basic errors in newer writings.
  • Interpretability: Feature importance scores highlighted that error counts (especially agreement errors) and lexical diversity were the strongest predictors of higher CEFR levels.

Practical Implications

  • Automated assessment pipelines: Developers can plug the lightweight feature‑based classifier into existing learning management systems (LMS) or language‑learning apps to provide instant CEFR‑aligned scoring without heavy GPU‑dependent models.
  • Targeted feedback: Because the model’s decisions are traceable to specific linguistic features, feedback can be phrased in pedagogically meaningful terms (e.g., “increase lexical variety” or “watch case agreement”).
  • Curriculum design: Educators can use the longitudinal findings to adjust teaching materials, focusing on the aspects that historically lag behind (e.g., complex morphology).
  • Resource‑efficient scaling: The approach works well for low‑resource languages like Estonian, where large pretrained language models are scarce, demonstrating a viable path for other under‑represented languages.

Limitations & Future Work

  • Feature dependence on error taggers: The quality of error‑type features hinges on the accuracy of rule‑based error detectors, which may miss subtle learner mistakes.
  • Prompt‑specific bias: Although variance was reduced, some residual prompt effects remain; future work could explore prompt‑agnostic representations.
  • Generalization beyond exams: The models were trained on formal exam essays; applying them to informal learner writing (e.g., forum posts) may require additional adaptation.
  • Deep learning comparison: The study focused on classical ML; benchmarking against transformer‑based models (e.g., multilingual BERT) could clarify trade‑offs between interpretability and raw performance.

Bottom line: By marrying linguistically informed feature engineering with solid machine‑learning practices, this research delivers a practical, transparent solution for automated CEFR assessment in Estonian—an approach that can be replicated for other languages and integrated into real‑world language‑learning products.

Authors

  • Kais Allkivi

Paper Information

  • arXiv ID: 2602.13102v1
  • Categories: cs.CL
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »