I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

Published: 1 month ago (December 9, 2025 at 09:18 PM EST)

2 min read

Source: Dev.to

Cover image for I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

The Problem

Most AI models are trained on English data—over 90 % English. Portuguese accounts for less than 2 %, and Brazilian Portuguese is even rarer. This creates real issues:

Customer: "Tô de boa, só quero dar uma olhada"
(Translation: "I'm cool, just browsing")

GPT: "I don't understand. Could you rephrase?"

Customer: "Vocês aceitam PIX?"
(PIX = Brazil's instant payment system, used by 150 M+ people)

GPT: "What is PIX?"

The Solution: Yoshii IA

I fine‑tuned Mistral‑7B on real Brazilian customer‑service conversations to create a model that actually understands:

🇧🇷 Brazilian slang and expressions
💳 Local context (PIX, CPF, CEP)
🗣️ Natural conversational Portuguese
🤝 Customer‑service best practices

The Technical Journey

1. Data Collection

A dataset of 10 000+ real customer‑service conversations in Brazilian Portuguese was assembled, including:

WhatsApp business chats
Support tickets
E‑commerce interactions
Various industries (healthcare, retail, restaurants)

The dataset is open source:
HuggingFace Dataset – brazilian‑customer‑service‑conversations

2. Training Setup

Fine‑tuning was performed with QLoRA (Quantized LoRA) on consumer hardware:

# 4-bit quantization + LoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05
)

Hardware: Single RTX 3090 (24 GB VRAM)
Training time: ~4 hours
VRAM usage: ~6 GB with 4‑bit quantization

3. Results

Before (GPT‑4):

User: "E aí, tudo certo?"
Bot: "Olá! Como posso ajudá-lo hoje?"

After (Yoshii IA):

User: "E aí, tudo certo?"
Bot: "E aí! Tudo certinho sim! E você? 😊 Em que posso ajudar?"

Open Source Everything

🤗 Model: yoshii‑ai/Yoshii‑7B‑BR
📊 Dataset: brazilian‑customer‑service‑conversations
🌐 Live Demo: yoshii.sakaguchi.ia.br
💬 WhatsApp Bot: Production‑ready, handling real customers

What’s Next

📢 Voice support (STT + TTS)
📊 Sentiment analysis for Portuguese
🔮 Predictive customer support
🤖 Multi‑agent orchestration

Lessons Learned

Data quality > quantity – 10 K well‑curated samples beat 100 K noisy ones
Cultural context matters – PIX, CPF, CEP are institutions, not just words
QLoRA is magic – Fine‑tuning 7 B models on consumer GPUs is feasible
Portuguese isn’t one language – Brazilian Portuguese ≠ European Portuguese

Try It

# Quick inference
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("yoshii-ai/Yoshii-7B-BR")
tokenizer = AutoTokenizer.from_pretrained("yoshii-ai/Yoshii-7B-BR")

prompt = "Cliente: Oi, quero saber do meu pedido\nAtendente:"
output = model.generate(tokenizer.encode(prompt, return_tensors="pt"))
print(tokenizer.decode(output[0]))

Building for your local market? Open source your work. The community will thank you. 🙌

🔗 sakaguchi.ia.br
💬 WhatsApp

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

The Problem

The Solution: Yoshii IA

The Technical Journey

1. Data Collection

2. Training Setup

3. Results

Open Source Everything

What’s Next

Lessons Learned

Try It

Related posts

🔥Finally, I was able to build the model from scratch🔥

An Intro to Large Language Models and the Transformer Architecture: Talking to a calculator

BiasAwareFeedback: Detecting Textual Bias with NLP (Mini-Research Project)

From Prompts to Action: My Journey Through the Google & Kaggle AI Agents Bootcamp