Failed Machine Learning Experiment: Training XGBoost Classifier with 1.5m signals
Source: Dev.to

In 2022 I started creating trading strategies in Python, aiming for powerful ML‑based approaches despite limited knowledge. I decided to let AI (Sonnet 4.5) write the code and suggest model parameters (via Grok Thinking).
After reviewing many market price charts, I hypothesized that certain patterns could be exploited with a suitable trading strategy and position optimization to generate a few percent of return. The experiment was set up in two Jupyter notebooks: one for XGBoost model training and another for strategy backtesting.
Data Collection & Labeling
- Downloaded 5 years of 15‑minute price data for the top 30 crypto tokens and stored them as Parquet files.
- Identified every price point that was followed by a drop larger than 3 % within the next ten 15‑minute intervals.
- Extracted the preceding ten price points together with technical‑analysis indicators as features.
- Generated 500 k “drop” signals and added 1 M random non‑drop samples, yielding 1.5 M training instances (20 % reserved for testing).
To make drops comparable across assets, I normalized them using a Z‑score approach:
drop_zscore = drop_pct / volatility # volatility = std deviation of returns
# Threshold: drop_zscore <= -2 (i.e., a drop twice the typical volatility)
Feature Engineering
Features were derived from momentum, volatility, and price‑difference indicators. After preprocessing, the dataset was fed into XGBoost with hyperparameters recommended by Grok:
# Recommended hyperparameters
max_depth: 3-7 # prevents memorizing noise
learning_rate: 0.01-0.1 # smaller = better with more trees
n_estimators: 200-500 # use early stopping
subsample: 0.6-0.9
colsample_bytree: 0.6-0.9
scale_pos_weight: 3-10 # handles class imbalance
Model Performance
Train Set
ROC-AUC Score: 0.6899
Classification Report
----------------------
precision recall f1-score support
No Signal 0.93 0.62 0.74 3,149,036
Signal 0.19 0.66 0.29 426,220
accuracy 0.62 3,575,256
macro avg 0.56 0.64 0.52 3,575,256
weighted avg 0.84 0.62 0.69 3,575,256
Confusion Matrix
[[1,938,267 1,210,769]
[ 144,995 281,225]]
Test Set (Unseen Data)
ROC-AUC Score: 0.6761
Classification Report
...
Train AUC: 0.6899
Test AUC: 0.6761
Difference: 0.0138
✓ Good generalization – minimal overfitting
Feature Importance
The most influential feature for predicting drops was basic returns. However, the model produced many false positives, which could be detrimental to a portfolio.

Backtesting
I loaded the trained model into a backtesting simulation, defining position parameters such as TP, SL, delay, and cooldown. A grid‑search over 900 parameter combinations (≈3 hours on a local machine) was performed to find the optimal configuration.
Result: Every scenario ended in a 100 % loss. The backtesting process failed completely.
Reflections
Working step‑by‑step with Cursor + Sonnet 4.5 felt seamless for code generation; the assistant produced functional notebooks after minimal debugging. However, the Jupyter integration was cumbersome—each change required manually closing, reopening, and rerunning the notebook. Consequently, I switched to Ask Mode and pasted code blocks manually.
The experiment demonstrated that, despite decent AUC scores, the classifier’s false‑positive rate and the inability to translate predictions into profitable trades render the approach ineffective in its current form. Further work would need to address signal precision, risk management, and more realistic market frictions.