🔥Finally, I was able to build the model from scratch🔥
Source: Dev.to
After multiple iterations, experiments, and lessons learned, I finally built a 550 M‑parameter model completely from scratch.
This isn’t my first time building a small language model. I’ve built a few before, but they were trained on toy datasets like TinyStories. Some of the earlier projects are:
- Qwen
- Gemma
- Rnj‑1
- Meta LLaMA
This time, I made a deliberate choice: to build something meaningful, using real data, not a toy dataset.
Dataset
- Pretraining:
- Mid‑training:
- Supervised fine‑tuning:
Tokenizer
Tokenizers are often overlooked, but they play a critical role in building effective language models. I created a video to share my journey of understanding and choosing the right tokenizer.
Picking the right tokenizer – the reason behind the choice:
Attention
Attention is one of those concepts that sounds complex, but its idea is simple: focus on what matters. Understanding attention completely changed how I looked at language models.
- Picking the right attention mechanism – what actually works in practice:
- Self‑attention under the hood – step‑by‑step breakdown:
Architecture
The architecture I followed is a modern pre‑normalized Transformer block, optimized for efficiency, stability, and scalability, especially for mid‑sized models like a 550 M‑parameter SLM.

Training Cost
For training I used RunPod () and rented 8 × A100 GPUs for 1.5 days, with a total cost of approximately $405.
Note: Make sure your root disk has enough space. I had to cancel one training run because I ran out of disk space.

Final Output
After training and setup, the model is now up and running, ready to answer questions.

Book

Throughout this journey, one resource that consistently helped me was my own book, Building a Small Language Model from Scratch. Writing the book forced me to slow down and deeply understand every component—tokenizers, attention mechanisms, architecture choices, training pipelines, and debugging failures. When building this 550 M‑parameter model, I often returned to my own explanations, diagrams, and code walkthroughs to validate decisions and avoid shortcuts.
- Gumroad:
- Amazon:
- Leanpub:
Summary
Over the last four months I’ve fully dedicated myself to building small language models from scratch. Along the way I’ve learned a tremendous amount, and I’ll be sharing those lessons through upcoming YouTube videos and blog posts.
Can this model compete with frontier‑lab models? Absolutely not—and that was never the goal. What truly matters are the lessons learned at every step of the journey. The model is still being tested, and once validation is complete across all datasets, I’ll make it available on Hugging Face.