đ„Finally, I was able to build the model from scratchđ„
Source: Dev.to
After multiple iterations, experiments, and lessons learned, I finally built a 550âŻMâparameter model completely from scratch.
This isnât my first time building a small language model. Iâve built a few before, but they were trained on toy datasets like TinyStories. Some of the earlier projects are:
- Qwen
- Gemma
- Rnjâ1
- Meta LLaMA
This time, I made a deliberate choice: to build something meaningful, using real data, not a toy dataset.
Dataset
- Pretraining:
- Midâtraining:
- Supervised fineâtuning:
Tokenizer
Tokenizers are often overlooked, but they play a critical role in building effective language models. I created a video to share my journey of understanding and choosing the right tokenizer.
Picking the right tokenizer â the reason behind the choice:
Attention
Attention is one of those concepts that sounds complex, but its idea is simple: focus on what matters. Understanding attention completely changed how I looked at language models.
- Picking the right attention mechanism â what actually works in practice:
- Selfâattention under the hood â stepâbyâstep breakdown:
Architecture
The architecture I followed is a modern preânormalized Transformer block, optimized for efficiency, stability, and scalability, especially for midâsized models like a 550âŻMâparameter SLM.

Training Cost
For training I used RunPod () and rented 8âŻĂâŻA100 GPUs for 1.5âŻdays, with a total cost of approximately $405.
Note: Make sure your root disk has enough space. I had to cancel one training run because I ran out of disk space.

Final Output
After training and setup, the model is now up and running, ready to answer questions.

Book

Throughout this journey, one resource that consistently helped me was my own book, Building a Small Language Model from Scratch. Writing the book forced me to slow down and deeply understand every componentâtokenizers, attention mechanisms, architecture choices, training pipelines, and debugging failures. When building this 550âŻMâparameter model, I often returned to my own explanations, diagrams, and code walkthroughs to validate decisions and avoid shortcuts.
- Gumroad:
- Amazon:
- Leanpub:
Summary
Over the last four months Iâve fully dedicated myself to building small language models from scratch. Along the way Iâve learned a tremendous amount, and Iâll be sharing those lessons through upcoming YouTube videos and blog posts.
Can this model compete with frontierâlab models? Absolutely notâand that was never the goal. What truly matters are the lessons learned at every step of the journey. The model is still being tested, and once validation is complete across all datasets, Iâll make it available on HuggingâŻFace.