model evaluation

2天前 · ai

为什么强化学习在缺乏表征深度时会出现平台期（以及NeurIPS 2025的其他关键要点）

每年，NeurIPS 产生数百篇令人印象深刻的论文，其中少数几篇微妙地重新定义了从业者对规模化、评估和系统设计的思考方式……

#reinforcement learning #representation depth #NeurIPS 2025 #scaling laws #model evaluation #system design #machine learning research
4天前 · ai

在 Kaggle 上推出社区基准

《Introducing Community Benchmarks on Kaggle》的封面图片：https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A...

#Kaggle #community benchmarks #model evaluation #AI research #machine learning #benchmarking #datasets #AI community
1周前 · ai

你的模型选择并不像你想的那样重要……这其实是个好消息

引言我在Twitter上看到这项研究，忍不住一直在思考。2009年，神经科学家把一条死去的大西洋鲑放进fMRI扫描仪中，…

#model evaluation #LLM benchmarks #null models #AlpacaEval #machine learning reproducibility #baseline comparisons
1周前 · ai

使用 NeMo Agent Toolkit 衡量关键要素

关于可观测性、评估和模型比较的实用指南《Measuring What Matters with NeMo Agent Toolkit》首次发表于 Towards Data Science。

#NeMo #AI agents #model evaluation #observability #NVIDIA
1周前 · ai

Artificial Analysis 对其 AI Intelligence Index 进行彻底改革，用“真实世界”测试取代流行的基准测试

构建更智能 AI 模型的军备竞赛面临测量问题：用于对它们进行排名的测试几乎和模型的提升一样快地变得过时。O...

#AI benchmarking #Artificial Analysis #Intelligence Index #real‑world tests #model evaluation #AI metrics
2周前 · ai

2026 年开发者将被问及的可持续 AI 基准

封面图：Sustainable AI Benchmarks 开发者将在 2026 年被问及 https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=aut...

#sustainable AI #AI benchmarks #model evaluation #AI ethics #carbon footprint #AI development #2026 trends
3周前 · ai

机器学习中的数据泄漏

Data Leakage 在 Machine Learning 中常常受指导者在 Machine Learning 工作流中犯下基本错误：Exploratory Data Analysis (EDA) → preprocessing…

#data leakage #machine learning #train-test contamination #data preprocessing #standardization #model evaluation
3周前 · ai

机器学习中的模型评估、模型选择和算法选择

Model Evaluation 从基本模型评估开始——快速测试，判断模型是诚实还是仅仅运气好。当数据很少时，使用专为…

#model evaluation #model selection #algorithm selection #cross-validation #bootstrap #small datasets #machine learning
3周前 · ai

关于评估对抗鲁棒性

为什么一些 AI 防御会失效——对测试和安全的简要观察人们构建从数据中学习的系统，但微小的棘手变化可能导致它们失效。研究……

#adversarial attacks #robustness #AI safety #model evaluation #security testing #best practices
3周前 · ai

ML模型：为什么你的预测是好的……直到它不是

文章图片https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazo...

#machine learning #feature engineering #ML pipelines #model evaluation #business metrics #data science #production ML #model monitoring
3周前 · ai

能自动搭建 eval 设置吗？

为什么 eval 感觉痛苦以及它为何总是被跳过 🔥 eval 本应让你安全，但其设置常常感觉像惩罚：- 你复制 prompts 到…

#model evaluation #AI testing #prompt engineering #automation #scaffolding #metrics #LLM #evaluation pipelines
0个月前 · ai

在臃肿的 RAG 流水线中运行 Evals

比较不同数据集和模型的指标。文章《Running Evals on a Bloated RAG Pipeline》首次发表于 Towards Data Science……

#RAG #retrieval-augmented generation #model evaluation #pipeline performance #metrics #LLM #AI evaluation

Newer posts

Older posts