Customer Lifetime Value (CLV) Prediction with Machine Learning
Source: Dev.to
Introduction
Customer acquisition is expensive. But do you know which customers will actually generate long‑term revenue? That’s where Customer Lifetime Value (CLV) comes in.
Instead of focusing on one‑off transactions, CLV estimates the total revenue a business expects from a customer over their entire relationship.
In this project I built an end‑to‑end CLV prediction model and then deployed it as a production‑ready API.
In this article we’ll cover:
- Business problem
- Data preprocessing
- Model development
- Model evaluation
- Model deployment with FastAPI
- Production‑ready setup
The Business Problem
Businesses want to answer:
- Which customers are most valuable?
- Who should receive retention incentives?
- Where should marketing budgets be allocated?
Predicting CLV helps with:
- Customer segmentation
- Revenue forecasting
- Budget optimization
- Retention strategies
This is a regression problem since CLV is a continuous value.
Step 1: Data Preprocessing
The dataset includes
- Purchase frequency
- Recency
- Average transaction value
- Tenure
- Demographic features
Data preparation
Before training any model we need to separate the features from the target variable. In this case CLV is what we’re trying to predict, and everything else serves as input:
x = df.drop('CLV', axis=1)
y = df['CLV']
We also check for missing values:
x.isnull().sum()
Clean data is non‑negotiable. Missing values can silently corrupt a model’s performance if left unaddressed.
Splitting the dataset
We divide the data into training and testing sets (80 % for training, 20 % for evaluation):
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42
)
Setting random_state=42 ensures reproducibility, so results remain consistent across runs.
Step 2: Model Development
Linear Regression
We start with linear regression, a simple but interpretable baseline. It assumes a linear relationship between the features and the target, making it fast to train and easy to explain to stakeholders.
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(x_train, y_train)
predictions = linear.predict(x_test)
Random Forest Regressor
Next we train a Random Forest – an ensemble method that builds 200 decision trees and averages their predictions. This approach is more robust to non‑linear patterns and typically outperforms linear models on complex real‑world data.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(x_train, y_train)
rf_predictions = rf.predict(x_test)
Step 3: Model Evaluation
We evaluate both models using Root Mean Squared Error (RMSE) and R² Score. RMSE tells us the average prediction error in the same units as CLV, while R² tells us how much of the variance in CLV our model explains (1 = perfect, 0 = no better than guessing the mean).
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
rmse_linear = sqrt(mean_squared_error(y_test, predictions))
r2_linear = r2_score(y_test, predictions)
rmse_rf = sqrt(mean_squared_error(y_test, rf_predictions))
r2_rf = r2_score(y_test, rf_predictions)
print(f'RMSE_linear: {rmse_linear}')
print(f'R2_linear: {r2_linear}')
print(f'RMSE_rf: {rmse_rf}')
print(f'R2_rf: {r2_rf}')
In most real‑world CLV scenarios, the Random Forest will outperform Linear Regression due to its ability to capture complex, non‑linear relationships between customer features and lifetime value.
Saving the model
Once we’re satisfied with performance, we persist the trained model and the feature schema using joblib. This makes reloading the model later straightforward:
import joblib
# Save
joblib.dump(rf, 'CLV_model.joblib')
joblib.dump(feature_name, 'modelfeatures.joblib')
# Load (example)
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')
Saving the feature set alongside the model documents exactly what columns and structure the model expects at inference time, preventing subtle bugs when deploying.
Step 4: Model Deployment with FastAPI
Training a model is only half the work. To put it into production you need an API that other systems can call. Below is a minimal REST endpoint built with FastAPI.
1. Install dependencies
pip install fastapi uvicorn joblib scikit-learn pandas
2. Create the API
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI(title='Customer Lifetime Value Prediction API')
# Load the saved model and feature schema
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')
# Define the input schema (adjust fields to match your actual dataset columns)
class CLVInput(BaseModel):
Customer_Age: int
Annual_Income: float
Tenure_Months: int
Monthly_Spend: float
Visits_Per_Month: int
Avg_Basket_Value: float
Support_Tickets: int
@app.get("/")
def health_check():
return {"status": "API is running"}
@app.post("/predict-CLV")
def predict_clv(data: CLVInput):
# Build a feature vector in the same order as the model expects
x = np.array([[getattr(data, f) for f in feature_name]])
prediction = model.predict(x)[0]
return {"predicted_CLV": prediction}
Run the service:
uvicorn your_script_name:app --reload
You now have a production‑ready endpoint that can be called by downstream applications, dashboards, or batch jobs.
Summary
- Defined the business problem and why CLV matters.
- Preprocessed the data, handling missing values and splitting into train/test sets.
- Trained two models (Linear Regression, Random Forest) and compared them using RMSE & R².
- Saved the best model and its feature schema with
joblib. - Wrapped the model in a FastAPI service for real‑time inference.
With this pipeline you can continuously retrain, version, and serve CLV predictions, enabling data‑driven decisions around acquisition, retention, and budgeting.
## 3. Run the Server Locally
```bash
uvicorn app:app --reload
Your API will be live at .
You can test it at . FastAPI generates interactive API documentation automatically.
4. Deploy to the Cloud
For production, deploy the API to a cloud provider. Here’s a quick overview:
Railway or Render (simplest): Push your code to GitHub and connect the repository. Both platforms auto‑detect Python apps and handle deployment with minimal configuration.
Requirements
Create a requirements.txt file with the following packages:
fastapi
uvicorn
joblib
scikit-learn
pandas
Summary
Here’s the end‑to‑end workflow we covered:
- Load and explore the customer dataset.
- Prepare features by separating inputs from the CLV target.
- Train two models—Linear Regression and Random Forest—and compare them using RMSE and R².
- Save the best model using joblib.
- Deploy via FastAPI with a
/predictendpoint that accepts customer data and returns a CLV estimate.
Predicting Customer Lifetime Value turns raw customer data into a strategic business asset. With a deployed model, your sales and marketing teams can make real‑time decisions based on predicted value, not just historical behaviour.