Linear Regression: Code (a) Line

Published: (May 2, 2026 at 02:19 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

It’s time to write your first ML model and predict house prices.

To follow along, take a look at the complete project:

https://github.com/yotambelgoroski/ml_unchained-house_pricing

Step 1: It’s all about data

ML is all about data—you can’t create a model without training it, and you can’t train it without data.

Our dataset is typically split into two parts:

  • Training data – data used to train a model
  • Test data – once a model is trained, we take input (x) from the test set, predict the output (ŷ), and compare that prediction to the real value (y). This tells us how well our model performs.

In more advanced setups, you might also see a validation set, which is used to tune the model before testing it.

Where does data come from?

The answer depends on your business and use case. For learning purposes, Kaggle is a great source for datasets and ML resources. To keep things simple, I use a script that generates synthetic data.

How much data do I need for training?

There is no fixed number—​as model complexity increases, more data is required. A common rule of thumb is:

Have 10×–20× more data points than features (independent variables)

We currently have one feature (sqm), so I used 10 records to train the model—the bare minimum to keep things simple.

How much data do I need for testing?

A simple approach is to split your dataset using an 80:20 ratio:

  • 80% for training
  • 20% for testing

Step 2: Training the model

Now that we have our dataset, it’s time to train a model.

Training involves three steps:

  1. Load the training data
  2. Train the model in memory based on that data
  3. Serialize – save the trained model to disk so it can be reused without retraining
import joblib
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression

FEATURE_COLS = ["sqm"]
TARGET_COL = "price"
MODEL_FILENAME = "house_price_model.joblib"

def load_training_data(train_path: Path) -> pd.DataFrame:
    return pd.read_csv(train_path)

def train_model(df: pd.DataFrame) -> LinearRegression:
    model = LinearRegression()
    model.fit(df[FEATURE_COLS], df[TARGET_COL])
    return model

def save_model(model: LinearRegression, dest_path: Path) -> None:
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(model, dest_path)
    print(f"Model saved → {dest_path}")

def train(train_path: Path, model_dir: Path) -> LinearRegression:
    df = load_training_data(train_path)
    model = train_model(df)
    save_model(model, model_dir / MODEL_FILENAME)
    print(f"Model trained on {len(df)} samples.")
    return model

This is it—our first model!

Our Dependencies

  • Pandas – a data handling library for working with tabular data. Its core structure, the DataFrame, allows us to easily access and manipulate data.
  • scikit-learn – a machine learning library for Python. LinearRegression is one of its models, used to learn the best linear relationship between input features and a target value.
  • Joblib – a utility library used here for serialization. It allows us to save a trained model to disk and load it later for inference.

Congratulations — you’ve created your first model! However, it’s not production‑ready yet. Next, we’ll use the test data to evaluate how good our model really is.

0 views
Back to Blog

Related posts

Read more »

How to Use the Claude API with Python

You Have a Python Script. You Want It to Think. That’s the whole premise. This tutorial shows you how to connect your code to Claude — Anthropic’s AI model — s...

Transformers Are Inherently Succinct

Resources - View PDFhttps://arxiv.org/pdf/2510.19315 - HTML experimentalhttps://arxiv.org/html/2510.19315v2 Abstract We propose succinctness as a measure of th...