Failed Machine Learning Experiment: Training XGBoost Classifier with 1.5m signals

Published: (February 8, 2026 at 02:15 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for Failed Machine Learning Experiment: Training XGBoost Classifier with 1.5m signals

In 2022 I started creating trading strategies in Python, aiming for powerful ML‑based approaches despite limited knowledge. I decided to let AI (Sonnet 4.5) write the code and suggest model parameters (via Grok Thinking).

After reviewing many market price charts, I hypothesized that certain patterns could be exploited with a suitable trading strategy and position optimization to generate a few percent of return. The experiment was set up in two Jupyter notebooks: one for XGBoost model training and another for strategy backtesting.

Data Collection & Labeling

  • Downloaded 5 years of 15‑minute price data for the top 30 crypto tokens and stored them as Parquet files.
  • Identified every price point that was followed by a drop larger than 3 % within the next ten 15‑minute intervals.
  • Extracted the preceding ten price points together with technical‑analysis indicators as features.
  • Generated 500 k “drop” signals and added 1 M random non‑drop samples, yielding 1.5 M training instances (20 % reserved for testing).

To make drops comparable across assets, I normalized them using a Z‑score approach:

drop_zscore = drop_pct / volatility   # volatility = std deviation of returns
# Threshold: drop_zscore <= -2  (i.e., a drop twice the typical volatility)

Feature Engineering

Features were derived from momentum, volatility, and price‑difference indicators. After preprocessing, the dataset was fed into XGBoost with hyperparameters recommended by Grok:

# Recommended hyperparameters
max_depth: 3-7          # prevents memorizing noise
learning_rate: 0.01-0.1  # smaller = better with more trees
n_estimators: 200-500   # use early stopping
subsample: 0.6-0.9
colsample_bytree: 0.6-0.9
scale_pos_weight: 3-10  # handles class imbalance

Model Performance

Train Set

ROC-AUC Score: 0.6899

Classification Report
----------------------
               precision    recall  f1-score   support
No Signal          0.93      0.62      0.74   3,149,036
Signal             0.19      0.66      0.29     426,220

accuracy                                 0.62   3,575,256
macro avg          0.56      0.64      0.52   3,575,256
weighted avg       0.84      0.62      0.69   3,575,256

Confusion Matrix
[[1,938,267 1,210,769]
 [  144,995   281,225]]

Test Set (Unseen Data)

ROC-AUC Score: 0.6761

Classification Report
...
Train AUC: 0.6899
Test AUC:  0.6761
Difference: 0.0138
✓ Good generalization – minimal overfitting

Feature Importance

The most influential feature for predicting drops was basic returns. However, the model produced many false positives, which could be detrimental to a portfolio.

Confusion Matrix by Feature Space

Backtesting

I loaded the trained model into a backtesting simulation, defining position parameters such as TP, SL, delay, and cooldown. A grid‑search over 900 parameter combinations (≈3 hours on a local machine) was performed to find the optimal configuration.

Result: Every scenario ended in a 100 % loss. The backtesting process failed completely.

Reflections

Working step‑by‑step with Cursor + Sonnet 4.5 felt seamless for code generation; the assistant produced functional notebooks after minimal debugging. However, the Jupyter integration was cumbersome—each change required manually closing, reopening, and rerunning the notebook. Consequently, I switched to Ask Mode and pasted code blocks manually.

The experiment demonstrated that, despite decent AUC scores, the classifier’s false‑positive rate and the inability to translate predictions into profitable trades render the approach ineffective in its current form. Further work would need to address signal precision, risk management, and more realistic market frictions.

0 views
Back to Blog

Related posts

Read more »