[Paper] Automatic Essay Scoring and Feedback Generation in Basque Language Learning
Source: arXiv - 2512.08713v1
Overview
A new open‑source benchmark for Automatic Essay Scoring (AES) and feedback generation in the Basque language has been released. The authors provide a sizable, expert‑annotated corpus of 3,200 CEFR‑C1 essays and demonstrate that fine‑tuned Basque language models can outperform leading closed‑source LLMs on both scoring consistency and pedagogical feedback quality.
Key Contributions
- First public Basque AES dataset (3,200 essays) with multi‑dimensional scores (correctness, richness, coherence, cohesion, task alignment) plus detailed feedback and error examples.
- Fine‑tuned Basque models: RoBERTa‑EusCrawl and the large‑scale Latxa (8 B / 70 B) models adapted for scoring and feedback generation.
- Supervised fine‑tuning (SFT) pipeline that lifts Latxa’s performance above proprietary systems such as GPT‑5 and Claude Sonnet 4.5.
- Novel evaluation framework for feedback: combines automatic consistency checks with expert validation of extracted learner errors.
- Open‑source release of data, code, and trained checkpoints, enabling reproducible research on low‑resource languages.
Methodology
- Data collection & annotation – Essays were sourced from the HABE (Basque language proficiency) platform. Trained linguists scored each essay on five criteria and wrote targeted feedback, marking specific error spans.
- Model selection – Two families were explored:
- Encoder‑only (RoBERTa‑EusCrawl) for pure scoring.
- Decoder‑augmented (Latxa 8 B & 70 B) for joint scoring + feedback generation.
- Supervised fine‑tuning (SFT) – The models were trained on the annotated pairs (essay → scores + feedback) using a multi‑task loss that balances regression (score prediction) and sequence‑to‑sequence (feedback generation).
- Evaluation –
- Scoring: Pearson/Spearman correlation with human scores, Quadratic Weighted Kappa (QWK).
- Feedback: Automatic consistency (does feedback reference the annotated error spans?) plus a blind expert review of a sampled subset to rate pedagogical relevance and error coverage.
All steps are implemented with Hugging Face 🤗 Transformers and PyTorch, and the training scripts are containerized for easy replication.
Results & Findings
| Model | Scoring QWK | Avg. Pearson r | Feedback Consistency (auto) | Expert‑rated Pedagogical Score |
|---|---|---|---|---|
| RoBERTa‑EusCrawl (encoder) | 0.84 | 0.78 | – | – |
| Latxa‑8B (SFT) | 0.88 | 0.82 | 0.71 | 4.3 / 5 |
| Latxa‑70B (SFT) | 0.91 | 0.86 | 0.78 | 4.6 / 5 |
| GPT‑5 (closed) | 0.86 | 0.80 | 0.62 | 3.9 / 5 |
| Claude Sonnet 4.5 (closed) | 0.85 | 0.79 | 0.65 | 4.0 / 5 |
- Scoring: The fine‑tuned Latxa models achieve higher QWK and correlation than the best commercial LLMs, confirming that domain‑specific SFT beats generic prompting for low‑resource languages.
- Feedback: Latxa‑70B not only aligns its comments with the annotated error spans (78 % consistency) but also surfaces a broader variety of error types (grammar, lexical choice, discourse cohesion) that experts rated as highly pedagogically useful.
- Efficiency: Encoder‑only RoBERTa runs inference at ~150 ms per essay on a single V100, while Latxa‑70B needs ~1.2 s on an A100 – still feasible for batch processing in educational platforms.
Practical Implications
- EdTech platforms can integrate the released Latxa checkpoints to provide real‑time, criterion‑aligned scoring for Basque learners, reducing reliance on costly human raters.
- Feedback generation enables automated, actionable comments that help learners understand why they lost points, a step beyond raw scores.
- The open dataset serves as a training ground for other low‑resource languages; developers can adapt the same pipeline to Spanish, Catalan, or indigenous languages with modest annotation effort.
- Compliance & transparency: Because the models are open‑source, institutions can audit the scoring logic, address bias concerns, and comply with data‑privacy regulations that prohibit sending student text to proprietary APIs.
- Scalable deployment: The encoder model can be used for high‑throughput batch scoring (e.g., nightly grading of thousands of essays), while the larger Latxa model can be reserved for on‑demand feedback where richer explanations are needed.
Limitations & Future Work
- Domain coverage: Essays are limited to the CEFR‑C1 level and to topics used in the HABE exam; performance on lower‑proficiency or out‑of‑domain prompts remains untested.
- Error taxonomy: While the annotation schema is comprehensive, it may miss nuanced pragmatic errors (e.g., register mismatches) that learners often make.
- Model size vs. latency: The 70 B model delivers the best feedback but still incurs noticeable latency; future work could explore distillation or retrieval‑augmented generation to keep quality while cutting inference time.
- Cross‑lingual transfer: The authors suggest probing whether multilingual fine‑tuning (e.g., with Basque‑Spanish parallel data) could further boost performance, especially for learners who code‑switch.
Overall, this work establishes a solid, reproducible baseline for Basque AES and opens the door for practical, AI‑driven language assessment tools in low‑resource settings.
Authors
- Ekhi Azurmendi
- Xabier Arregi
- Oier Lopez de Lacalle
Paper Information
- arXiv ID: 2512.08713v1
- Categories: cs.CL, cs.AI
- Published: December 9, 2025
- PDF: Download PDF