SmartKNN Regression Benchmarks High-Dimensional Datasets
Source: Dev.to
Overview
This release presents initial regression benchmarks for SmartKNN, evaluated on large‑dimensional datasets with a focus on single‑prediction p95 latency and R² under real production constraints. All benchmarks are:
- CPU‑only
- Single‑query inference
- Non‑parametric, nonlinear models
- Large‑scale datasets
Additional benchmarks (higher‑dimensional datasets, classification tasks, mixed feature spaces) will be released soon.
Datasets
| Dataset | OpenML ID | Approx. Rows | Features (D) | Task | Source |
|---|---|---|---|---|---|
| Buzzinsocialmedia_Twitter | 4549 | 466,600 | 77 | Regression | OpenML |
| Allstate_Claims_Severity | 44045 | 150,654 | 124 | Regression | OpenML |
| College Scorecard | 46674 | 99,759 | 118 | Regression | OpenML |
Benchmark Results
Buzzinsocialmedia_Twitter
| Model | RMSE ↓ | R² ↑ | Train (s) | Batch (ms) | Single Med (ms) | Single p95 (ms) |
|---|---|---|---|---|---|---|
| XGBoost | 254.43 | 0.8274 | 22.21 | 0.005 | 0.228 | 0.280 |
| LightGBM | 214.79 | 0.8770 | 25.67 | 0.008 | 0.511 | 0.650 |
| CatBoost | 231.43 | 0.8572 | 39.53 | 0.000 | 0.809 | 1.021 |
| SmartKNN (wt=0.0) | 167.15 | 0.9255 | 214.39 | 0.060 | 0.383 | 0.561 |
Allstate_Claims_Severity
| Model | RMSE ↓ | R² ↑ | Train (s) | Batch (ms) | Single Med (ms) | Single p95 (ms) |
|---|---|---|---|---|---|---|
| XGBoost | 0.5355 | 0.5604 | 11.20 | 0.005 | 0.211 | 0.272 |
| LightGBM | 0.5356 | 0.5603 | 8.40 | 0.020 | 0.511 | 0.630 |
| CatBoost | 0.5408 | 0.5516 | 22.84 | 0.043 | 1.035 | 1.308 |
| SmartKNN (wt=0.0) | 0.6219 | 0.4071 | 51.51 | 0.062 | 0.305 | 0.366 |
College Scorecard
| Model | RMSE ↓ | R² ↑ | Train (s) | Batch (ms) | Single Med (ms) | Single p95 (ms) |
|---|---|---|---|---|---|---|
| XGBoost | 0.1855 | 0.6935 | 8.36 | 0.006 | 0.237 | 0.329 |
| LightGBM | 0.1864 | 0.6905 | 5.77 | 0.010 | 0.505 | 0.635 |
| CatBoost | 0.1946 | 0.6626 | 14.25 | 0.001 | 0.879 | 0.950 |
| SmartKNN (wt=0.0) | 0.2300 | 0.5290 | 27.31 | 0.054 | 0.248 | 0.286 |
Key Findings
- SmartKNN achieves competitive p95 single‑prediction latency on CPU among non‑parametric, nonlinear models, especially on the Buzzinsocialmedia_Twitter dataset where it outperforms tree‑based baselines in latency while delivering the highest R².
- Tree‑based models (XGBoost, LightGBM, CatBoost) generally provide better accuracy and lower average latency, but SmartKNN narrows the tail‑latency gap, which is often the dominant concern in production systems.
- All results are reproducible using the publicly available OpenML datasets.
Community Involvement
We encourage the community to:
- Run these benchmarks on different hardware
- Test alternative ANN configurations
- Compare against additional models
- Share results publicly
If you encounter a performance regression, please open a GitHub Issue. For questions, ideas, or improvements, start a GitHub Discussion. New benchmark results can be posted as issues or discussions as well.
Resources
- Website: SmartKNN Documentation
- Code: SmartKNN Repository