AI evaluation

1周前 · ai

为什么 90% 的 Text-to-SQL 准确率是 100% 没用的

自助分析的永恒承诺文章《Why 90% Accuracy in Text-to-SQL is 100% Useless》首次发表于 Towards Data Science....

#text-to-sql #natural-language-processing #SQL #accuracy-metrics #self-service-analytics #LLM #AI-evaluation
3周前 · ai

如何构建可扩展的 AI 代理评估框架

规模化问题所以，你已经构建了一个出色的 AI 代理。你用几十个示例对其进行了测试，结果完美无缺。现在，你准备将它部署到生产环境……

#AI evaluation #agent monitoring #scalable testing #automated scoring #LLM performance
0个月前 · ai

当 AI 学会承认错误时，信任成为真正的责任

引言：OpenAI 的最新研究方向标志着在先进的 AI 系统的训练和评估方式上出现了重大演进，提出了根本性的问题……

#AI transparency #confession mechanism #OpenAI #model hallucination #responsible AI #AI evaluation
0个月前 · ai

在臃肿的 RAG 流水线中运行 Evals

比较不同数据集和模型的指标。文章《Running Evals on a Bloated RAG Pipeline》首次发表于 Towards Data Science……

#RAG #retrieval-augmented generation #model evaluation #pipeline performance #metrics #LLM #AI evaluation
0个月前 · ai

衡量 AI 完成长任务的能力

请提供您希望翻译的文章摘录或摘要文本，我将为您翻译成简体中文。

#AI evaluation #long-context tasks #benchmarking #LLM performance #AI metrics
0个月前 · ai

衡量 AI 完成长任务的能力：Opus 4.5 的 50% 视野为 4h49M

请提供您希望翻译的具体摘录或摘要文本，我将为您翻译成简体中文。

#AI evaluation #long-context tasks #Opus 4.5 #task horizon #benchmarking
1个月前 · ai

AI 代理在复杂任务上失败率为 63%。Patronus AI 表示其全新的“活体”训练世界可以解决这一问题。

Patronus AI，这家获得包括 Lightspeed Venture Partners 和 Datadog 在内的投资者提供的 2000 万美元融资的人工智能评估初创公司，推出了一个…

#AI agents #reinforcement learning #training environments #synthetic worlds #Patronus AI #complex task performance #AI evaluation
1个月前 · ai

使用后见之明对十年历史的 Hacker News 讨论进行自动评分

昨天我偶然看到这个 HN 讨论串——Show HN：Gemini Pro 3 出现幻觉……

#LLM #auto-grading #Hacker News #ChatGPT #Gemini #retrospective analysis #AI evaluation
1个月前 · ai

如何将 System prompts 用作评估的 Ground Truth

问题：缺乏明确的 ground truth 大多数团队在评估其 AI 代理时遇到困难，因为他们没有明确定义的 ground truth。典型工作流程：...

#system prompts #ground truth #AI evaluation #prompt engineering #LLM evaluation #evaluation metrics
1个月前 · ai

Gemini 3 Pro 在盲测中获得 69% 的信任度，较 Gemini 2.5 的 16% 提升：评估 AI 的真实世界信任而非学术基准的必要性

就在几周前，Google 推出了 Gemini 3 模型，声称它在多个 AI 基准中取得了领先地位。但供应商面临的挑战是……

#Gemini 3 #trustworthiness #AI evaluation #benchmarking #large language models #Google AI #Prolific study
1个月前 · ai

我放了一个测试，5/6 的 SOTA LLM 直接掉裤子

我一直在研究的假设是，是什么让一个实体“深度”智能——不仅仅是聪明或有能力，而是以超越 pa… 的方式理解现实。

#LLM #prompt engineering #AI evaluation #persona prompting #sales pitch test #analogical reasoning