SmartKNN Regression Benchmarks High-Dimensional Datasets

Published: 2 hours ago (December 29, 2025 at 01:58 AM EST)

1 min read

Source: Dev.to

Overview

This release presents initial regression benchmarks for SmartKNN, evaluated on large‑dimensional datasets with a focus on single‑prediction p95 latency and R² under real production constraints. All benchmarks are:

CPU‑only
Single‑query inference
Non‑parametric, nonlinear models
Large‑scale datasets

Additional benchmarks (higher‑dimensional datasets, classification tasks, mixed feature spaces) will be released soon.

Datasets

Dataset	OpenML ID	Approx. Rows	Features (D)	Task	Source
Buzzinsocialmedia_Twitter	4549	466,600	77	Regression	OpenML
Allstate_Claims_Severity	44045	150,654	124	Regression	OpenML
College Scorecard	46674	99,759	118	Regression	OpenML

Benchmark Results

Buzzinsocialmedia_Twitter

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single p95 (ms)
XGBoost	254.43	0.8274	22.21	0.005	0.228	0.280
LightGBM	214.79	0.8770	25.67	0.008	0.511	0.650
CatBoost	231.43	0.8572	39.53	0.000	0.809	1.021
SmartKNN (wt=0.0)	167.15	0.9255	214.39	0.060	0.383	0.561

Allstate_Claims_Severity

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single p95 (ms)
XGBoost	0.5355	0.5604	11.20	0.005	0.211	0.272
LightGBM	0.5356	0.5603	8.40	0.020	0.511	0.630
CatBoost	0.5408	0.5516	22.84	0.043	1.035	1.308
SmartKNN (wt=0.0)	0.6219	0.4071	51.51	0.062	0.305	0.366

College Scorecard

Model	RMSE ↓	R² ↑	Train (s)	Batch (ms)	Single Med (ms)	Single p95 (ms)
XGBoost	0.1855	0.6935	8.36	0.006	0.237	0.329
LightGBM	0.1864	0.6905	5.77	0.010	0.505	0.635
CatBoost	0.1946	0.6626	14.25	0.001	0.879	0.950
SmartKNN (wt=0.0)	0.2300	0.5290	27.31	0.054	0.248	0.286

Key Findings

SmartKNN achieves competitive p95 single‑prediction latency on CPU among non‑parametric, nonlinear models, especially on the Buzzinsocialmedia_Twitter dataset where it outperforms tree‑based baselines in latency while delivering the highest R².
Tree‑based models (XGBoost, LightGBM, CatBoost) generally provide better accuracy and lower average latency, but SmartKNN narrows the tail‑latency gap, which is often the dominant concern in production systems.
All results are reproducible using the publicly available OpenML datasets.

Community Involvement

We encourage the community to:

Run these benchmarks on different hardware
Test alternative ANN configurations
Compare against additional models
Share results publicly

If you encounter a performance regression, please open a GitHub Issue. For questions, ideas, or improvements, start a GitHub Discussion. New benchmark results can be posted as issues or discussions as well.

Resources

Website: SmartKNN Documentation
Code: SmartKNN Repository

SmartKNN Regression Benchmarks High-Dimensional Datasets

Overview

Datasets

Benchmark Results

Buzzinsocialmedia_Twitter

Allstate_Claims_Severity

College Scorecard

Key Findings

Community Involvement

Related posts

How AI Is Reshaping Diagnostics in Healthcare

🎅🎄 Happy Data-Pocalypse, Users! (Bad Advice from the IT-Grinch) 🎄🎅

Converting Text Documents into Enterprise Ready Knowledge Graphs

AIToolsJS introduces an AI-powered Invoice Data Extraction solution