Customer Lifetime Value (CLV) Prediction with Machine Learning

Published: (February 23, 2026 at 03:00 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Introduction

Customer acquisition is expensive. But do you know which customers will actually generate long‑term revenue? That’s where Customer Lifetime Value (CLV) comes in.

Instead of focusing on one‑off transactions, CLV estimates the total revenue a business expects from a customer over their entire relationship.

In this project I built an end‑to‑end CLV prediction model and then deployed it as a production‑ready API.

In this article we’ll cover:

  • Business problem
  • Data preprocessing
  • Model development
  • Model evaluation
  • Model deployment with FastAPI
  • Production‑ready setup

The Business Problem

Businesses want to answer:

  • Which customers are most valuable?
  • Who should receive retention incentives?
  • Where should marketing budgets be allocated?

Predicting CLV helps with:

  • Customer segmentation
  • Revenue forecasting
  • Budget optimization
  • Retention strategies

This is a regression problem since CLV is a continuous value.


Step 1: Data Preprocessing

The dataset includes

  • Purchase frequency
  • Recency
  • Average transaction value
  • Tenure
  • Demographic features

Data preparation

Before training any model we need to separate the features from the target variable. In this case CLV is what we’re trying to predict, and everything else serves as input:

x = df.drop('CLV', axis=1)
y = df['CLV']

We also check for missing values:

x.isnull().sum()

Clean data is non‑negotiable. Missing values can silently corrupt a model’s performance if left unaddressed.

Splitting the dataset

We divide the data into training and testing sets (80 % for training, 20 % for evaluation):

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

Setting random_state=42 ensures reproducibility, so results remain consistent across runs.


Step 2: Model Development

Linear Regression

We start with linear regression, a simple but interpretable baseline. It assumes a linear relationship between the features and the target, making it fast to train and easy to explain to stakeholders.

from sklearn.linear_model import LinearRegression

linear = LinearRegression()
linear.fit(x_train, y_train)
predictions = linear.predict(x_test)

Random Forest Regressor

Next we train a Random Forest – an ensemble method that builds 200 decision trees and averages their predictions. This approach is more robust to non‑linear patterns and typically outperforms linear models on complex real‑world data.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(x_train, y_train)
rf_predictions = rf.predict(x_test)

Step 3: Model Evaluation

We evaluate both models using Root Mean Squared Error (RMSE) and R² Score. RMSE tells us the average prediction error in the same units as CLV, while R² tells us how much of the variance in CLV our model explains (1 = perfect, 0 = no better than guessing the mean).

from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

rmse_linear = sqrt(mean_squared_error(y_test, predictions))
r2_linear   = r2_score(y_test, predictions)

rmse_rf = sqrt(mean_squared_error(y_test, rf_predictions))
r2_rf   = r2_score(y_test, rf_predictions)

print(f'RMSE_linear: {rmse_linear}')
print(f'R2_linear:   {r2_linear}')
print(f'RMSE_rf:     {rmse_rf}')
print(f'R2_rf:       {r2_rf}')

In most real‑world CLV scenarios, the Random Forest will outperform Linear Regression due to its ability to capture complex, non‑linear relationships between customer features and lifetime value.

Saving the model

Once we’re satisfied with performance, we persist the trained model and the feature schema using joblib. This makes reloading the model later straightforward:

import joblib

# Save
joblib.dump(rf, 'CLV_model.joblib')
joblib.dump(feature_name, 'modelfeatures.joblib')

# Load (example)
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')

Saving the feature set alongside the model documents exactly what columns and structure the model expects at inference time, preventing subtle bugs when deploying.


Step 4: Model Deployment with FastAPI

Training a model is only half the work. To put it into production you need an API that other systems can call. Below is a minimal REST endpoint built with FastAPI.

1. Install dependencies

pip install fastapi uvicorn joblib scikit-learn pandas

2. Create the API

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title='Customer Lifetime Value Prediction API')

# Load the saved model and feature schema
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')

# Define the input schema (adjust fields to match your actual dataset columns)
class CLVInput(BaseModel):
    Customer_Age: int
    Annual_Income: float
    Tenure_Months: int
    Monthly_Spend: float
    Visits_Per_Month: int
    Avg_Basket_Value: float
    Support_Tickets: int

@app.get("/")
def health_check():
    return {"status": "API is running"}

@app.post("/predict-CLV")
def predict_clv(data: CLVInput):
    # Build a feature vector in the same order as the model expects
    x = np.array([[getattr(data, f) for f in feature_name]])
    prediction = model.predict(x)[0]
    return {"predicted_CLV": prediction}

Run the service:

uvicorn your_script_name:app --reload

You now have a production‑ready endpoint that can be called by downstream applications, dashboards, or batch jobs.


Summary

  • Defined the business problem and why CLV matters.
  • Preprocessed the data, handling missing values and splitting into train/test sets.
  • Trained two models (Linear Regression, Random Forest) and compared them using RMSE & R².
  • Saved the best model and its feature schema with joblib.
  • Wrapped the model in a FastAPI service for real‑time inference.

With this pipeline you can continuously retrain, version, and serve CLV predictions, enabling data‑driven decisions around acquisition, retention, and budgeting.

## 3. Run the Server Locally

```bash
uvicorn app:app --reload

Your API will be live at .
You can test it at . FastAPI generates interactive API documentation automatically.

4. Deploy to the Cloud

For production, deploy the API to a cloud provider. Here’s a quick overview:

Railway or Render (simplest): Push your code to GitHub and connect the repository. Both platforms auto‑detect Python apps and handle deployment with minimal configuration.

Requirements

Create a requirements.txt file with the following packages:

fastapi
uvicorn
joblib
scikit-learn
pandas

Summary

Here’s the end‑to‑end workflow we covered:

  • Load and explore the customer dataset.
  • Prepare features by separating inputs from the CLV target.
  • Train two models—Linear Regression and Random Forest—and compare them using RMSE and R².
  • Save the best model using joblib.
  • Deploy via FastAPI with a /predict endpoint that accepts customer data and returns a CLV estimate.

Predicting Customer Lifetime Value turns raw customer data into a strategic business asset. With a deployed model, your sales and marketing teams can make real‑time decisions based on predicted value, not just historical behaviour.

0 views
Back to Blog

Related posts

Read more »

A Discord Bot that Teaches ASL

This is a submission for the Built with Google Gemini: Writing Challengehttps://dev.to/challenges/mlh/built-with-google-gemini-02-25-26 What I Built with Google...

AWS who? Meet AAS

Introduction Predicting the downfall of SaaS and its providers is a popular theme, but this isn’t an AWS doomsday prophecy. AWS still commands roughly 30 % of...