Nemotron-Personas-Brazil: Co-Designed Data for Sovereign AI

Published: (January 27, 2026 at 07:56 PM EST)
5 min read

Source: Hugging Face Blog

Back to Articles

Grounding Brazil’s AI with Real Data

							[What’s in the Dataset?](#whats-in-the-dataset)
								

At a glance:

							[How We Built It](#how-we-built-it)
								

Data Generation Pipeline

										[Enhanced Cultural Context](#enhanced-cultural-context)
											

										[Private By Design](#private-by-design)
											

										

							[Who This Data Is For](#who-this-data-is-for)
								

							[Practical AI Applications](#practical-ai-applications)
								

							[Why It Matters](#why-it-matters)
								

							[Start Building with Nemotron-Personas-Brazil](#start-building-with-nemotron-personas-brazil)
								

							

A compound AI approach to Brazilian Portuguese personas grounded in real-world distributions

	Grounding Brazil’s AI with Real Data

Building AI systems that serve national populations requires data that reflects local language, demographics, and cultural context. For Brazil—home to more than 200 million people across diverse regions—this remains a persistent challenge, as much of today’s high-quality training data is English-centric or unavailable for commercial use.

Nemotron-Personas-Brazil helps close that gap. It is an open dataset (CC BY 4.0) of 6 million fully synthetic personas, statistically grounded in official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). Every persona is aligned to real demographic, geographic, and occupational distributions—but no real person is represented.

This release extends NVIDIA’s growing Nemotron-Personas Collection, which already includes the USA, Japan, India, and Singapore. Like others in the collection, the Brazil dataset covers attributes such as age, sex, education, occupation, and location.

The dataset is designed for Brazilian developers and researchers building sovereign AI, with data that is locally grounded, culturally informed, and commercially usable (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with deep experience supporting government and regulated-sector AI deployments across Latin America.

	What’s in the Dataset?

	At a glance:
  • 6 million Brazilian personas (1 million records × 6 personas each)

  • ~1.4 billion tokens total, including ~450 million persona tokens

  • 20 fields per record: 6 persona fields + 14 contextual fields grounded in official statistics

  • Full geographic coverage: all 26 Brazilian states + the Federal District

  • ~457k unique Portuguese names

  • 1,500+ occupation categories reflecting Brazil’s workforce

  • Multiple persona types including: professional, sports, arts, travel, among others.

Each persona is written in natural Brazilian Portuguese and includes cultural background, skills, goals, hobbies, and interests.

	How We Built It



	


	Data Generation Pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s compound AI system for synthetic data generation. The pipeline supports structured generation, validation, and retry mechanisms required to produce large-scale, population-aware datasets.

Key components include:

  • Probabilistic Graphical Model (Apache-2.0) for statistical grounding

  • GPT-OSS-120B (Apache-2.0) for narrative generation in Brazilian Portuguese

An extended version of Nemotron-Personas-Brazil will be available directly within NeMo Data Designer, enabling developers to generate, refine, and extend Brazilian Portuguese personas as part of their own synthetic data pipelines.

	Enhanced Cultural Context

In order to capture the socio-demographic and geographic diversity and complexity of Brazil’s population, Nemotron-Personas-Brazil leveraged population census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

  • Geography – Personas are anchored at the state and municipality level, reflecting regional variation across Brazil’s five macro-regions.

  • Occupation – Expands beyond job titles to include skills, expertise, and career trajectories, including micro-entrepreneurs and regional trades.

  • Life Stages – Incorporates student status, unemployment, and retirement to reflect real population dynamics.

  • Cultural Traits – Natural-language personas capture Brazilian social norms, interests, and l

ifestyle dimensions such as arts, sports, and travel.

  • Language Fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

	Private By Design

This dataset contains no personally identifiable information. While we use real-world distributions of ages, names, and occupations from official public sources, nothing is tied to any real person, living or deceased. Every persona is fully synthetic, so you can train on authentic cultural patterns without compromising privacy.

	Who This Data Is For

Nemotron-Personas-Brazil is designed primarily for Brazilian developers and researchers building sovereign AI systems. By providing high-quality, population-representative data in Brazilian Portuguese, the dataset addresses gaps left by predominantly English-language training corpora.

Global developers may also leverage the dataset to improve model performance and alignment in Brazilian cultural and linguistic contexts.

	Practical AI Applications
  • Multi-turn conversation: Use personas as seeds to generate authentic dialogue datasets

  • Domain-specific training: Build culturally aware AI assistants

  • Bias testing & fairness: Evaluate model performance across rural vs. urban populations, age groups, and education levels—ensuring your AI works fairly across all segments of Brazilian society

      Why It Matters

AI model builders have long struggled with access to diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating barriers for researchers, startups, and developers in underrepresented regions.

  • Data diversity: Prevents narrow training and model collapse by reflecting Brazil’s full population spectrum

  • Cultural authenticity: Reduces reliance on Western-centric datasets and supports sovereign AI development

  • Privacy-preservation: Designed to meet Brazil’s data protection requirements and emerging AI governance standards

By releasing Nemotron-Personas-Brazil under CC BY 4.0, we’re democratizing access to enterprise-grade synthetic data—enabling anyone to build culturally authentic AI without barriers of cost, privacy concerns, or geography.

	Start Building with Nemotron-Personas-Brazil

You can load the dataset directly from Hugging Face:

from datasets import load_dataset

dataset = load_dataset("nvidia/nemotron-personas-brazil")

Want to learn more about NVIDIA’s open data products, or interested in co-designing a future dataset? Join the conversation on NVIDIA’s Discord.

Back to Blog

Related posts

Read more »

Introducing Prism

Prism is a free LaTeX-native workspace with GPT-5.2 built in, helping researchers write, collaborate, and reason in one place....