SmartKNN Regression Benchmarks High-Dimensional Datasets

Published: (December 29, 2025 at 01:58 AM EST)
1 min read
Source: Dev.to

Source: Dev.to

Overview

This release presents initial regression benchmarks for SmartKNN, evaluated on large‑dimensional datasets with a focus on single‑prediction p95 latency and R² under real production constraints. All benchmarks are:

  • CPU‑only
  • Single‑query inference
  • Non‑parametric, nonlinear models
  • Large‑scale datasets

Additional benchmarks (higher‑dimensional datasets, classification tasks, mixed feature spaces) will be released soon.


Datasets

DatasetOpenML IDApprox. RowsFeatures (D)TaskSource
Buzzinsocialmedia_Twitter4549466,60077RegressionOpenML
Allstate_Claims_Severity44045150,654124RegressionOpenML
College Scorecard4667499,759118RegressionOpenML

Benchmark Results

Buzzinsocialmedia_Twitter

ModelRMSE ↓R² ↑Train (s)Batch (ms)Single Med (ms)Single p95 (ms)
XGBoost254.430.827422.210.0050.2280.280
LightGBM214.790.877025.670.0080.5110.650
CatBoost231.430.857239.530.0000.8091.021
SmartKNN (wt=0.0)167.150.9255214.390.0600.3830.561

Allstate_Claims_Severity

ModelRMSE ↓R² ↑Train (s)Batch (ms)Single Med (ms)Single p95 (ms)
XGBoost0.53550.560411.200.0050.2110.272
LightGBM0.53560.56038.400.0200.5110.630
CatBoost0.54080.551622.840.0431.0351.308
SmartKNN (wt=0.0)0.62190.407151.510.0620.3050.366

College Scorecard

ModelRMSE ↓R² ↑Train (s)Batch (ms)Single Med (ms)Single p95 (ms)
XGBoost0.18550.69358.360.0060.2370.329
LightGBM0.18640.69055.770.0100.5050.635
CatBoost0.19460.662614.250.0010.8790.950
SmartKNN (wt=0.0)0.23000.529027.310.0540.2480.286

Key Findings

  • SmartKNN achieves competitive p95 single‑prediction latency on CPU among non‑parametric, nonlinear models, especially on the Buzzinsocialmedia_Twitter dataset where it outperforms tree‑based baselines in latency while delivering the highest R².
  • Tree‑based models (XGBoost, LightGBM, CatBoost) generally provide better accuracy and lower average latency, but SmartKNN narrows the tail‑latency gap, which is often the dominant concern in production systems.
  • All results are reproducible using the publicly available OpenML datasets.

Community Involvement

We encourage the community to:

  • Run these benchmarks on different hardware
  • Test alternative ANN configurations
  • Compare against additional models
  • Share results publicly

If you encounter a performance regression, please open a GitHub Issue. For questions, ideas, or improvements, start a GitHub Discussion. New benchmark results can be posted as issues or discussions as well.

Resources

Back to Blog

Related posts

Read more »