🔥Finally, I was able to build the model from scratch🔥

Published: 1 month ago (December 15, 2025 at 11:58 AM EST)

2 min read

Source: Dev.to

After multiple iterations, experiments, and lessons learned, I finally built a 550 M‑parameter model completely from scratch.

This isn’t my first time building a small language model. I’ve built a few before, but they were trained on toy datasets like TinyStories. Some of the earlier projects are:

Qwen
Gemma
Rnj‑1
Meta LLaMA

This time, I made a deliberate choice: to build something meaningful, using real data, not a toy dataset.

Dataset

Pretraining:
Mid‑training:
Supervised fine‑tuning:

Tokenizer

Tokenizers are often overlooked, but they play a critical role in building effective language models. I created a video to share my journey of understanding and choosing the right tokenizer.

Picking the right tokenizer – the reason behind the choice:

Attention

Attention is one of those concepts that sounds complex, but its idea is simple: focus on what matters. Understanding attention completely changed how I looked at language models.

Picking the right attention mechanism – what actually works in practice:
Self‑attention under the hood – step‑by‑step breakdown:

Architecture

The architecture I followed is a modern pre‑normalized Transformer block, optimized for efficiency, stability, and scalability, especially for mid‑sized models like a 550 M‑parameter SLM.

Transformer block diagram

Training Cost

For training I used RunPod () and rented 8 × A100 GPUs for 1.5 days, with a total cost of approximately $405.

Note: Make sure your root disk has enough space. I had to cancel one training run because I ran out of disk space.

Training cost screenshot

Final Output

After training and setup, the model is now up and running, ready to answer questions.

Model output screenshot

Book

Book cover

Throughout this journey, one resource that consistently helped me was my own book, Building a Small Language Model from Scratch. Writing the book forced me to slow down and deeply understand every component—tokenizers, attention mechanisms, architecture choices, training pipelines, and debugging failures. When building this 550 M‑parameter model, I often returned to my own explanations, diagrams, and code walkthroughs to validate decisions and avoid shortcuts.

Gumroad:
Amazon:
Leanpub:

Summary

Over the last four months I’ve fully dedicated myself to building small language models from scratch. Along the way I’ve learned a tremendous amount, and I’ll be sharing those lessons through upcoming YouTube videos and blog posts.

Can this model compete with frontier‑lab models? Absolutely not—and that was never the goal. What truly matters are the lessons learned at every step of the journey. The model is still being tested, and once validation is complete across all datasets, I’ll make it available on Hugging Face.

🔥Finally, I was able to build the model from scratch🔥

Dataset

Tokenizer

Attention

Architecture

Training Cost

Final Output

Book

Summary

Related posts

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

OpenAI's GPT-5.2 is here: what enterprises need to know

GPT-5.2

OpenAI fires back at Google with GPT-5.2 after ‘code red’ memo