Linear Regression: Code (a) Line

Published: 2 days ago (May 2, 2026 at 02:19 PM EDT)

3 min read

Source: Dev.to

Source: Dev.to

It’s time to write your first ML model and predict house prices.

To follow along, take a look at the complete project:

https://github.com/yotambelgoroski/ml_unchained-house_pricing

Step 1: It’s all about data

ML is all about data—you can’t create a model without training it, and you can’t train it without data.

Our dataset is typically split into two parts:

Training data – data used to train a model
Test data – once a model is trained, we take input (x) from the test set, predict the output (ŷ), and compare that prediction to the real value (y). This tells us how well our model performs.

In more advanced setups, you might also see a validation set, which is used to tune the model before testing it.

Where does data come from?

The answer depends on your business and use case. For learning purposes, Kaggle is a great source for datasets and ML resources. To keep things simple, I use a script that generates synthetic data.

How much data do I need for training?

There is no fixed number—as model complexity increases, more data is required. A common rule of thumb is:

Have 10×–20× more data points than features (independent variables)

We currently have one feature (sqm), so I used 10 records to train the model—the bare minimum to keep things simple.

How much data do I need for testing?

A simple approach is to split your dataset using an 80:20 ratio:

80% for training
20% for testing

Step 2: Training the model

Now that we have our dataset, it’s time to train a model.

Training involves three steps:

Load the training data
Train the model in memory based on that data
Serialize – save the trained model to disk so it can be reused without retraining

import joblib
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression

FEATURE_COLS = ["sqm"]
TARGET_COL = "price"
MODEL_FILENAME = "house_price_model.joblib"

def load_training_data(train_path: Path) -> pd.DataFrame:
    return pd.read_csv(train_path)

def train_model(df: pd.DataFrame) -> LinearRegression:
    model = LinearRegression()
    model.fit(df[FEATURE_COLS], df[TARGET_COL])
    return model

def save_model(model: LinearRegression, dest_path: Path) -> None:
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(model, dest_path)
    print(f"Model saved → {dest_path}")

def train(train_path: Path, model_dir: Path) -> LinearRegression:
    df = load_training_data(train_path)
    model = train_model(df)
    save_model(model, model_dir / MODEL_FILENAME)
    print(f"Model trained on {len(df)} samples.")
    return model

This is it—our first model!

Our Dependencies

Pandas – a data handling library for working with tabular data. Its core structure, the DataFrame, allows us to easily access and manipulate data.
scikit-learn – a machine learning library for Python. LinearRegression is one of its models, used to learn the best linear relationship between input features and a target value.
Joblib – a utility library used here for serialization. It allows us to save a trained model to disk and load it later for inference.

Congratulations — you’ve created your first model! However, it’s not production‑ready yet. Next, we’ll use the test data to evaluate how good our model really is.

Linear Regression: Code (a) Line

Step 1: It’s all about data

Where does data come from?

How much data do I need for training?

How much data do I need for testing?

Step 2: Training the model

Our Dependencies

Related posts

How to Use the Claude API with Python

Trump administration considers mandatory pre-release vetting of AI models — Anthropic's Mythos cited as catalyst for policy reversal

How to build an LLM wiki with How to build an LLM wiki with Claude and MCP

Transformers Are Inherently Succinct