[Paper] CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes
Source: arXiv - 2602.16607v1
Overview
The paper introduces CitiLink‑Summ, the first publicly‑available corpus of European Portuguese municipal meeting minutes, paired with thousands of manually written subject‑level summaries. By providing this resource and baseline experiments with modern summarization models, the authors open a new avenue for NLP research on dense, administrative texts that are otherwise hard for citizens to digest.
Key Contributions
- New Dataset: 100 municipal meeting minutes (≈ 2 M words) annotated with 2,322 high‑quality, hand‑crafted summaries, each aligned to a specific discussion subject.
- First Benchmark: Establishes the inaugural evaluation suite for subject‑level summarization in European Portuguese municipal documents.
- Baseline Experiments: Fine‑tunes and tests state‑of‑the‑art generative models (BART, PRIMERA) and large language models (LLMs) on the corpus.
- Comprehensive Evaluation: Reports results using lexical (ROUGE, BLEU, METEOR) and semantic (BERTScore) metrics, highlighting the gap between current models and human performance.
- Open‑Source Release: Publishes the corpus, preprocessing scripts, and training checkpoints under a permissive license to encourage reproducibility and community contributions.
Methodology
- Data Collection & Annotation
- Minutes were sourced from several Portuguese municipalities and digitized.
- Legal and linguistic experts manually extracted each discussion subject and wrote a concise, self‑contained summary (≈ 30–50 words).
- Pre‑processing
- Texts were cleaned, tokenized with a Portuguese‑specific tokenizer, and split into document → subject → summary triples.
- A train/validation/test split (80/10/10) was created, preserving subject distribution across municipalities.
- Model Fine‑tuning
- BART‑base and PRIMERA (a multi‑document summarizer) were fine‑tuned on the training set for 3 epochs, using the standard cross‑entropy loss.
- For LLMs, zero‑shot and few‑shot prompting were performed with GPT‑3.5‑turbo and LLaMA‑13B, feeding the full minute and a short instruction to “summarize each discussion subject”.
- Evaluation
- Generated summaries were compared against the human references using ROUGE‑1/2/L, BLEU, METEOR, and BERTScore (F1).
- Statistical significance was assessed with paired bootstrap resampling.
Results & Findings
| Model | ROUGE‑1 | ROUGE‑2 | ROUGE‑L | BERTScore‑F1 |
|---|---|---|---|---|
| BART‑base (fine‑tuned) | 38.7 | 15.2 | 35.9 | 71.4 |
| PRIMERA (fine‑tuned) | 41.3 | 17.0 | 38.2 | 73.1 |
| GPT‑3.5‑turbo (zero‑shot) | 32.5 | 11.8 | 30.1 | 66.2 |
| LLaMA‑13B (few‑shot) | 35.0 | 13.4 | 32.8 | 68.9 |
| Human reference (upper bound) | 100 | 100 | 100 | 100 |
- PRIMERA achieved the best lexical scores, indicating it can capture the salient phrases of a subject more effectively than a standard encoder‑decoder model.
- LLMs lag behind fine‑tuned models, especially on ROUGE‑2, suggesting they struggle with precise phrase overlap in this niche domain.
- All automatic scores are still far from the human upper bound, highlighting the difficulty of summarizing dense administrative language.
Practical Implications
- Civic Tech Platforms: Developers can integrate PRIMERA‑based pipelines to auto‑generate subject‑level digests, making minutes searchable and citizen‑friendly.
- Transparency & Accountability: Municipal websites could automatically publish concise summaries alongside full minutes, lowering the barrier for public oversight.
- Multilingual Extension: The dataset and codebase can serve as a template for building similar resources in other low‑resource languages (e.g., Galician, Catalan).
- Workflow Automation: City clerks can use the model to pre‑populate draft summaries, reducing manual effort and standardizing documentation.
- Search & Retrieval: Summaries improve indexing, enabling developers to build smarter Q&A bots that answer citizen queries like “What decisions were made about waste collection in March?” without scanning entire PDFs.
Limitations & Future Work
- Size & Diversity: Only 100 minutes from a limited set of municipalities were annotated; scaling to more regions and longer time spans is needed for broader generalization.
- Subject Granularity: Summaries target pre‑identified subjects; automatic subject detection (topic segmentation) remains an open challenge.
- Evaluation Scope: Metrics focus on n‑gram overlap; human evaluation (readability, factual correctness) is required to assess real‑world utility.
- Model Adaptation: Exploring domain‑adapted LLMs (e.g., fine‑tuning GPT‑NeoX on Portuguese legal text) could narrow the performance gap.
- Cross‑Lingual Transfer: Investigating whether models trained on CitiLink‑Summ can help summarize minutes in related Romance languages via multilingual transfer learning.
Authors
- Miguel Marques
- Ana Luísa Fernandes
- Ana Filipa Pacheco
- Rute Rebouças
- Inês Cantante
- José Isidro
- Luís Filipe Cunha
- Alípio Jorge
- Nuno Guimarães
- Sérgio Nunes
- António Leal
- Purificação Silvano
- Ricardo Campos
Paper Information
- arXiv ID: 2602.16607v1
- Categories: cs.CL
- Published: February 18, 2026
- PDF: Download PDF