[Paper] Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

Published: 1 day ago (April 23, 2026 at 01:46 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21897v1

Overview

This paper presents a scalable, data‑driven framework for dissecting parliamentary speeches, applied to more than 450 k remarks from Brazil’s Chamber of Deputies (2003‑2025). By looking beyond roll‑call votes, the authors reveal how legislators talk, what they talk about, and who talks alike, offering a richer picture of political dynamics that can be leveraged by developers building civic‑tech, NLP, or analytics tools.

Key Contributions

Multi‑dimensional analysis pipeline that fuses (i) diachronic stylometry, (ii) contextual topic modeling, and (iii) semantic clustering of speakers.
Large‑scale empirical study on a 22‑year corpus of Brazilian legislative speeches, demonstrating the pipeline’s scalability.
Empirical insights:
- A clear stylistic drift toward shorter, more direct utterances over time.
- Rapid agenda reshaping in response to national crises (e.g., economic shocks, pandemics).
- Discursive alignments driven more by region and gender than by party affiliation.
Open‑source toolkit (or at least a reproducible workflow) that can be adapted to other parliaments or deliberative bodies.

Methodology

Data Collection & Pre‑processing
- Scraped official transcripts, cleaned HTML, removed stop‑words, and lemmatized Portuguese text.
- Aligned each speech with metadata: deputy ID, party, state, gender, and timestamp.
Diachronic Stylometric Analysis
- Computed classic style metrics (sentence length, lexical richness, use of passive voice) per year.
- Tracked trends with simple time‑series models to spot long‑term shifts.
Contextual Topic Modeling
- Trained a multilingual BERT‑based encoder (e.g., bert-base-portuguese-cased) to obtain sentence embeddings.
- Applied a dynamic topic model (BERTopic) that clusters embeddings while allowing topics to evolve across years.
Semantic Speaker Clustering
- Aggregated each deputy’s speech embeddings into a single representation (average or attention‑weighted).
- Performed hierarchical clustering (e.g., HDBSCAN) to discover groups of deputies with similar rhetorical footprints.
Evaluation & Validation
- Compared stylistic trends against external events (e.g., 2014 economic recession, 2020 COVID‑19).
- Validated speaker clusters with known demographic attributes (region, gender) and party lines.

The pipeline is modular: swap the embedding model, change the clustering algorithm, or add a sentiment layer without breaking the overall workflow.

Results & Findings

Dimension	Core Finding	Interpretation
Style	Average sentence length dropped from ~23 words (2003) to ~15 words (2024).	Legislators are delivering briefer, more “tweet‑like” statements, possibly reflecting media pressure.
Topics	Sudden spikes in “public health”, “economic stimulus”, and “environment” topics coinciding with the 2020 pandemic and 2022 floods.	The agenda reacts quickly to crises, confirming that speech content is a leading indicator of policy focus.
Speaker Alignment	Clusters aligned strongly with geographic regions (Northeast vs. South) and gender; party affiliation explained only ~12 % of variance.	Identity cues (regional interests, gendered concerns) dominate rhetorical similarity, suggesting cross‑party coalitions on specific issues.

Overall, the study demonstrates that how deputies speak can be as informative as what they vote for, opening a new analytical dimension for political scientists and technologists alike.

Practical Implications

Civic‑Tech Platforms: Real‑time monitoring dashboards can flag emerging topics or stylistic shifts, alerting NGOs, journalists, and the public to policy pivots before votes occur.
Legislative Analytics SaaS: Companies can enrich vote‑based scoring systems with speech‑based similarity scores, offering clients a more nuanced risk assessment of legislative outcomes.
Bias & Representation Audits: The framework can surface under‑represented voices (e.g., women from certain regions) by quantifying discursive participation, supporting diversity initiatives.
NLP Model Benchmarking: The Brazilian parliamentary corpus is a valuable multilingual, domain‑specific dataset for testing language models on long‑form political text.
Policy Forecasting: Topic‑trend detection can feed into predictive models that anticipate budget allocations or regulatory focus, aiding strategic planning for businesses.

Developers can plug the open‑source pipeline into existing data pipelines (e.g., Apache Beam, Airflow) to process new legislative sessions automatically.

Limitations & Future Work

Language Specificity: The current implementation is tuned for Portuguese; cross‑lingual transfer may need additional tokenization and cultural adaptation.
Speaker Metadata Gaps: Missing or inconsistent demographic data can bias clustering results.
Causality vs. Correlation: While topics align with crises, the model does not prove that speeches drive policy changes.
Scalability to Real‑Time: Processing 450 k speeches took several hours on a modest GPU cluster; optimizing for streaming ingestion remains an open challenge.

Future research could integrate sentiment analysis, network‑based interaction graphs (who replies to whom), and multimodal data (e.g., video transcripts) to build an even richer portrait of parliamentary discourse.

Authors

Flávio Soriano
Victoria F. Mello
Pedro B. Rigueira
Gisele L. Pappa
Wagner Meira
Ana Paula Couto da Silva
Jussara M. Almeida

Paper Information

arXiv ID: 2604.21897v1
Categories: cs.CL, cs.CY
Published: April 23, 2026
PDF: Download PDF

[Paper] Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation