Predicting Tea Sales With ML: Linear Regression, Gradient Descent & Regularization (Beginner Friendly + Code)

Published: 5 days ago (December 20, 2025 at 10:33 AM EST)

6 min read

Source: Dev.to

📚 What You’ll Learn

Linear Regression (tea sales vs. temperature)
Cost Function (how wrong your predictions are)
Gradient Descent (how to improve step‑by‑step)
Overfitting (memorizing vs. learning patterns)
Regularization (keeping models simple)
Regularized Cost Function (Ridge/Lasso)
Practical code examples with NumPy & scikit‑learn

🧪 Setup (Run These First)

# Install if needed:
# pip install numpy pandas scikit-learn matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)

⭐ Scenario 1 – Linear Regression (Tea Sales vs. Temperature)

Idea: Lower temperature → higher tea sales. Draw a straight line to predict sales from temperature.

# Synthetic dataset: temperature (°C) → tea cups sold
temps = np.array([10, 12, 15, 18, 20, 22, 24, 26, 28]).reshape(-1, 1)
tea_sales = np.array([100, 95, 85, 70, 60, 55, 50, 45, 40])

# Fit a basic linear regression
lin = LinearRegression()
lin.fit(temps, tea_sales)

print("Slope (m):", lin.coef_[0])          # cups change per 1 °C
print("Intercept (c):", lin.intercept_)   # base demand when temp = 0 °C

# Predict for tomorrow (e.g., 21 °C)
tomorrow_temp = np.array([[21]])
pred_sales = lin.predict(tomorrow_temp)
print("Predicted tea cups at 21 °C:", int(pred_sales[0]))

# Plot
plt.scatter(temps, tea_sales, color="teal", label="Actual")
plt.plot(temps, lin.predict(temps), color="orange", label="Fitted line")
plt.xlabel("Temperature (°C)")
plt.ylabel("Tea cups sold")
plt.title("Linear Regression: Tea Sales vs. Temperature")
plt.legend()
plt.show()

⭐ Scenario 2 – Cost Function (Measuring Wrongness)

Idea: Cost is the average of squared errors — big mistakes hurt more.

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

y_pred = lin.predict(temps)
print("Mean Squared Error (MSE):", mse(tea_sales, y_pred))

⭐ Scenario 3 – Gradient Descent (Step‑by‑Step Improvement)

Idea: Adjust slope m and intercept c gradually to reduce cost — like tuning a tea recipe.

# Gradient Descent for y = m*x + c (from scratch)
X = temps.flatten()
y = tea_sales.astype(float)

m, c = 0.0, 0.0          # initial guesses
lr = 0.0005              # learning rate (step size)
epochs = 5000

def predictions(m, c, X):
    return m * X + c

def gradients(m, c, X, y):
    y_hat = predictions(m, c, X)
    dm = (-2 / len(X)) * np.sum(X * (y - y_hat))
    dc = (-2 / len(X)) * np.sum(y - y_hat)
    return dm, dc

history = []
for _ in range(epochs):
    dm, dc = gradients(m, c, X, y)
    m -= lr * dm
    c -= lr * dc
    history.append(mse(y, predictions(m, c, X)))

print(f"GD learned slope m={m:.3f}, intercept c={c:.3f}, final MSE={history[-1]:.2f}")

# Plot loss curve
plt.plot(history)
plt.xlabel("Epoch")
plt.ylabel("MSE (Cost)")
plt.title("Gradient Descent: Cost vs. Epochs")
plt.show()

Tip: If lr is too large, the loss will bounce or explode. If it’s too small, learning will be very slow.

⭐ Scenario 4 – Overfitting (Memorizing Noise)

We’ll simulate a richer dataset with useful and noisy features.

# Build a dataset with signal + noise
n = 300
temp      = np.random.uniform(5, 35, size=n)               # useful
rain      = np.random.binomial(1, 0.3, size=n)             # somewhat useful
festival  = np.random.binomial(1, 0.1, size=n)             # sometimes useful
traffic   = np.random.normal(0, 1, size=n)                # weak/noisy
dog_barks = np.random.normal(0, 1, size=n)                # pure noise

# True relationship (unknown to the model)
true_sales = (120 - 2.5 * temp + 10 * rain + 15 * festival
              + 1.0 * np.random.normal(0, 3, size=n))   # added noise

# Feature matrix
X = np.column_stack([temp, rain, festival, traffic, dog_barks])
feature_names = ["temp", "rain", "festival", "traffic", "dog_barks"]

X_train, X_test, y_train, y_test = train_test_split(
    X, true_sales, test_size=0.25, random_state=42
)

# Plain Linear Regression (can overfit)
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

print("Linear Regression Coefficients:")
for name, coef in zip(feature_names, lr_model.coef_):
    print(f"  {name}: {coef:.3f}")

print("Train MSE:", mean_squared_error(y_train, lr_model.predict(X_train)))
print("Test  MSE:", mean_squared_error(y_test,  lr_model.predict(X_test)))

If you see large coefficients on obviously noisy features (e.g., dog_barks) or a train MSE much lower than test MSE, that’s overfitting.

⭐ Scenario 5 – Fixing Overfitting

Strategies

Remove useless features (manual feature selection).
Gather more data (the classic remedy).
Use regularization (systematic penalty on large weights).

⭐ Scenario 6 – Regularization (Penalty for Complexity)

Regularization adds a penalty term to the cost that shrinks large coefficients — like telling your tea‑maker to use fewer ingredients or lose a bonus.

⭐ Scenario 7 – Regularized Linear Regression (Ridge & Lasso)

# Ridge (L2) – penalizes squared weights
ridge = Ridge(alpha=1.0)          # alpha = regularization strength
ridge.fit(X_train, y_train)

# Lasso (L1) – penalizes absolute weights, can zero‑out features
lasso = Lasso(alpha=0.5, max_iter=10000)
lasso.fit(X_train, y_train)

def show_results(model, name):
    print(f"\n{name} Coefficients:")
    for feat, coef in zip(feature_names, model.coef_):
        print(f"  {feat}: {coef:.3f}")
    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse  = mean_squared_error(y_test,  model.predict(X_test))
    print(f"Train MSE: {train_mse:.2f}")
    print(f"Test  MSE: {test_mse:.2f}")

show_results(ridge, "Ridge")
show_results(lasso, "Lasso")

What to look for

Model	Effect on Coefficients	Typical Outcome
Ridge	Shrinks all coefficients toward zero but keeps them all	Reduces variance, improves test‑set performance
Lasso	Can drive some coefficients exactly to zero	Performs both regularization and feature selection

🎉 Wrap‑Up

Linear regression gives a simple, interpretable model.
Cost (MSE) quantifies prediction error.
Gradient descent iteratively minimizes that cost.
Overfitting occurs when the model memorizes noise.
Regularization (Ridge/Lasso) tames overfitting by penalizing large weights.

Now you have a complete, runnable notebook‑style guide that ties tea‑stall intuition to real‑world machine‑learning practice. Happy modeling!

ss features

# Ridge: L2 penalty
ridge = Ridge(alpha=10.0)   # alpha = λ (higher = stronger penalty)
ridge.fit(X_train, y_train)

print("\nRidge Coefficients (alpha=10):")
for name, coef in zip(feature_names, ridge.coef_):
    print(f"  {name}: {coef:.3f}")

print("Ridge Train MSE:", mean_squared_error(y_train, ridge.predict(X_train)))
print("Ridge Test  MSE:", mean_squared_error(y_test,  ridge.predict(X_test)))

# Lasso: L1 penalty
lasso = Lasso(alpha=1.0)    # try different alphas like 0.1, 0.5, 2.0
lasso.fit(X_train, y_train)

print("\nLasso Coefficients (alpha=1.0):")
for name, coef in zip(feature_names, lasso.coef_):
    print(f"  {name}: {coef:.3f}")

print("Lasso Train MSE:", mean_squared_error(y_train, lasso.predict(X_train)))
print("Lasso Test  MSE:", mean_squared_error(y_test,  lasso.predict(X_test)))

What to look for

Ridge should shrink noisy coefficients closer to zero.
Lasso may set truly useless features exactly to zero (feature selection).
Test MSE should improve vs. plain Linear Regression.

⭐ Scenario 8: How Regularization Fixes Overfitting (Deep Dive)

Let’s compare models across different penalties and visualize coefficient shrinkage.

alphas = [0.0, 0.1, 1.0, 10.0, 50.0]  # 0.0 ~ plain linear regression for comparison
coef_paths_ridge = []
train_mse_ridge, test_mse_ridge = [], []

for a in alphas:
    if a == 0.0:
        model = LinearRegression()
    else:
        model = Ridge(alpha=a)
    model.fit(X_train, y_train)
    coef_paths_ridge.append(model.coef_)
    train_mse_ridge.append(mean_squared_error(y_train, model.predict(X_train)))
    test_mse_ridge.append(mean_squared_error(y_test, model.predict(X_test)))

coef_paths_ridge = np.array(coef_paths_ridge)

# Plot Ridge coefficient paths
plt.figure(figsize=(8, 5))
for i, name in enumerate(feature_names):
    plt.plot(alphas, coef_paths_ridge[:, i], marker="o", label=name)
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge: Coefficient Shrinkage with Increasing Penalty")
plt.legend()
plt.show()

# Plot Train vs Test MSE for Ridge
plt.figure(figsize=(8, 5))
plt.plot(alphas, train_mse_ridge, marker="o", label="Train MSE")
plt.plot(alphas, test_mse_ridge, marker="o", label="Test MSE")
plt.xscale("log")
plt.xlabel("alpha (log scale)")
plt.ylabel("MSE")
plt.title("Ridge: Train vs Test MSE Across Penalties")
plt.legend()
plt.show()

Interpretation

At low alpha, coefficients stay large → risk of overfitting (low train MSE, higher test MSE).
As alpha increases, coefficients shrink → simpler model, better generalization.
If alpha is too high, the model becomes too simple → underfitting (both MSEs rise).
Look for the alpha where test MSE is lowest — that’s the sweet spot.

🧠 Bonus: Simple Tea Forecast Function

def forecast_tea_cups(temp_c, rain=0, festival=0, model=ridge):
    """Quick helper using your fitted model (default: ridge)."""
    x = np.array([[temp_c, rain, festival, 0.0, 0.0]])  # ignore traffic/dog_barks at prediction time
    return float(model.predict(x)[0])

print("Forecast for 18°C, raining, festival day:",
      round(forecast_tea_cups(18, rain=1, festival=1)))
print("Forecast for 30°C, no rain, normal day:",
      round(forecast_tea_cups(30, rain=0, festival=0)))

✅ Final Takeaways

Linear Regression: Draws the best straight line between features and target.
Cost Function (MSE): Penalizes prediction errors, especially big ones.
Gradient Descent: Iteratively improves parameters to minimize cost.
Overfitting: Model learns noise; great on training, poor on new data.
Regularization (Ridge/Lasso): Shrinks weights, removes noise, improves generalization.
Choose α (lambda) carefully: Too small → overfit; too large → underfit.