LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

Published: (December 9, 2025 at 09:49 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

The Setup: Why I Chose the RTX 3090

Let’s kick things off with the hardware. I’ve been using an RTX 3090 for a while now; with 24 GB of VRAM it’s a popular choice for deep learning. The power comes at a price, so I had to ensure my workspace is efficient enough to handle the workload without overheating.

The first step was installing the necessary libraries. I opted for PyTorch because it balances performance and ease of use.

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install transformers datasets

Now the real work begins once you start fine‑tuning the model.

The Architecture: Building from the Ground Up

Onto the fun part: defining the architecture of the base model. I chose a standard transformer architecture.

import torch
import torch.nn as nn

class SimpleTransformer(nn.Module):
    def __init__(self, input_dim, num_heads, num_layers):
        super(SimpleTransformer, self).__init__()
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads),
            num_layers=num_layers
        )

    def forward(self, x):
        return self.encoder(x)

model = SimpleTransformer(input_dim=512, num_heads=8, num_layers=6)

Hyperparameter tuning—selecting the right number of layers and heads—proved crucial.

Training: The Good, the Bad, and the Computationally Intensive

With the model in place, I fed it data scraped from various sources. Not all data is created equal, so cleaning the dataset was essential.

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
train_data = dataset["train"]

After a few epochs the model began generating coherent text, though occasional nonsensical outputs highlighted the limitations of training from scratch and the importance of data quality.

Troubleshooting: When Things Go South

A major issue I encountered was gradient explosion. Gradient clipping helped stabilize training.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

This simple line prevented loss values from skyrocketing and made the training process more manageable.

Real‑World Applications: Where This All Leads

The trained model can generate code snippets for small projects, acting like an always‑on assistant. However, it also raises ethical considerations: models can perpetuate biases if trained on problematic data. Developers should be mindful of these risks.

Future Thoughts: The Path Ahead

Looking forward, I plan to explore advanced techniques such as fine‑tuning on specific tasks and mixed‑precision training for efficiency. If you’re on a similar path, feel free to share your tools and challenges.

Building an LLM from scratch has been a rewarding experience—embracing failures, celebrating small victories, and continuously pushing boundaries. Grab your RTX 3090 and let’s build something amazing together!

Back to Blog

Related posts

Read more »