Nemotron-Personas-Brazil: Co-Designed Data for Sovereign AI

Published: 1 week ago (January 27, 2026 at 07:56 PM EST)

4 min read

Source: Hugging Face Blog

Grounding Brazil’s AI with Real Data

A compound AI approach to Brazilian Portuguese personas grounded in real-world distributions

Building AI systems that serve national populations requires data that reflects local language, demographics, and cultural context. For Brazil—home to more than 200 million people across diverse regions—this remains a persistent challenge, as much of today’s high‑quality training data is English‑centric or unavailable for commercial use.

Nemotron-Personas-Brazil helps close that gap. It is an open dataset (CC BY 4.0) of 6 million fully synthetic personas, statistically grounded in official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). Every persona is aligned to real demographic, geographic, and occupational distributions—but no real person is represented.

This release extends NVIDIA’s growing Nemotron-Personas Collection, which already includes the USA, Japan, India, and Singapore. Like others in the collection, the Brazil dataset covers attributes such as age, sex, education, occupation, and location.

The dataset is designed for Brazilian developers and researchers building sovereign AI, with data that is locally grounded, culturally informed, and commercially usable (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with deep experience supporting government and regulated‑sector AI deployments across Latin America.

What’s in the Dataset?

Dataset illustration

At a glance

6 million Brazilian personas (1 million records × 6 personas each)
~1.4 billion tokens total, including ~450 million persona tokens
20 fields per record: 6 persona fields + 14 contextual fields grounded in official statistics
Full geographic coverage: all 26 Brazilian states + the Federal District
~457 k unique Portuguese names
1 500+ occupation categories reflecting Brazil’s workforce
Multiple persona types including professional, sports, arts, travel, among others

Each persona is written in natural Brazilian Portuguese and includes cultural background, skills, goals, hobbies, and interests.

How We Built It

Data Generation Pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s compound AI system for synthetic data generation. The pipeline supports structured generation, validation, and retry mechanisms required to produce large‑scale, population‑aware datasets.

Key components

Probabilistic Graphical Model (Apache‑2.0) for statistical grounding
GPT‑OSS‑120B (Apache‑2.0) for narrative generation in Brazilian Portuguese

An extended version of Nemotron-Personas‑Brazil will be available directly within NeMo Data Designer, enabling developers to generate, refine, and extend Brazilian Portuguese personas as part of their own synthetic data pipelines.

Enhanced Cultural Context

To capture the socio‑demographic and geographic diversity of Brazil’s population, Nemotron-Personas‑Brazil leveraged census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

Geography – Personas are anchored at the state and municipality level, reflecting regional variation across Brazil’s five macro‑regions.
Occupation – Expands beyond job titles to include skills, expertise, and career trajectories, covering micro‑entrepreneurs and regional trades.
Life Stages – Incorporates student status, unemployment, and retirement to reflect real population dynamics.
Cultural Traits – Natural‑language personas capture Brazilian social norms, interests, and lifestyle dimensions such as arts, sports, and travel.
Language Fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

Private By Design

The dataset contains no personally identifiable information. While we use real‑world distributions of ages, names, and occupations from official public sources, nothing is tied to any real person, living or deceased. Every persona is fully synthetic, so you can train on authentic cultural patterns without compromising privacy.

Who This Data Is For

Nemotron-Personas‑Brazil is designed primarily for Brazilian developers and researchers building sovereign AI systems. By providing high‑quality, population‑representative data in Brazilian Portuguese, the dataset addresses gaps left by predominantly English‑language training corpora.

Global developers may also leverage the dataset to improve model performance and alignment in Brazilian cultural and linguistic contexts.

Practical AI Applications

Multi‑turn conversation – Use personas as seeds to generate authentic dialogue datasets.
Domain‑specific training – Build culturally aware AI assistants.
Bias testing & fairness – Evaluate model performance across rural vs. urban populations, age groups, and education levels, ensuring your AI works fairly across all segments of Brazilian society.

Why It Matters

AI model builders have long struggled with access to diverse, high‑quality training data that reflects real‑world populations. Proprietary datasets dominate enterprise AI, creating barriers for researchers, startups, and developers in under‑represented regions.

Data diversity – Prevents narrow training and model collapse by reflecting Brazil’s full population spectrum.
Cultural authenticity – Reduces reliance on Western‑centric datasets and supports sovereign AI development.
Privacy preservation – Designed to meet Brazil’s data protection requirements and emerging AI governance standards.

By releasing Nemotron-Personas‑Brazil under CC BY 4.0, we’re democratizing access to enterprise‑grade synthetic data—enabling anyone to build culturally authentic AI without barriers of cost, privacy concerns, or geography.

Start Building with Nemotron-Personas-Brazil

from datasets import load_dataset

dataset = load_dataset("nvidia/nemotron-personas-brazil")

Want to learn more about NVIDIA’s open data products, or interested in co‑designing a future dataset? Join the conversation on NVIDIA’s Discord.