LLM evaluation

2天前 · ai

一种几何方法用于在无需 LLM 判官的情况下识别幻觉

想象一群飞行中的鸟儿。它们没有领袖，没有中央指挥。每只鸟都与邻近的鸟对齐——匹配方向，调整速度，保持……

#hallucination detection #LLM evaluation #geometric method #AI safety #natural language processing
4天前 · ai

不，AI 不会编程。那些说相反的人只是在卖弄烟雾。

对编程中 AI 炒作的控诉 > 几周前，在看到又一个“专家”声称“Gemini 3 Pro 革命性地 …”。

#AI code generation #LLM evaluation #software development #programming hype #code automation
1个月前 · ai

如何使用 Synthetic Data 评估 LLM Prompt：一步一步的指南

概述：在生产环境中部署大型语言模型（LLMs）已将软件工程的瓶颈从代码语法转移到数据质量。- In t...

#synthetic data #LLM evaluation #prompt engineering #generative AI #RAG #hallucination mitigation #AI testing
1个月前 · ai

LLM 评估指南：何时在你的 AI 应用中添加在线评估

原文：https://launchdarkly.com/docs/tutorials/when-to-add-online-evals – 发布于2025年11月13日

#LLM evaluation #online evals #AI monitoring #quality scoring #LLM-as-a-judge #LaunchDarkly #production traffic #AI Configs
1个月前 · ai

低代码 LLM 评估框架（n8n）：自动化测试指南

引言在当今快节奏的技术环境中，确保 language models 的质量、准确性和一致性比以往任何时候都更加重要。At t...

#low-code #n8n #LLM evaluation #automation #AI testing #workflow automation #quality assurance
1个月前 · ai

如何将 System prompts 用作评估的 Ground Truth

问题：缺乏明确的 ground truth 大多数团队在评估其 AI 代理时遇到困难，因为他们没有明确定义的 ground truth。典型工作流程：...

#system prompts #ground truth #AI evaluation #prompt engineering #LLM evaluation #evaluation metrics
1个月前 · ai

二元加权评估...如何

1. 什么是二元加权评估？从高层次来看：- 为任务定义一组二元标准。每个标准都是一个可以用…回答的问题。

#LLM evaluation #binary weighted evaluation #agent testing #AI metrics #prompt engineering
1个月前 · ai

【论文】EvilGenie：奖励劫持基准

我们介绍 EvilGenie，一个用于编程环境中 reward hacking 的基准。我们从 LiveCodeBench 获取问题，并创建一个环境，使得 agents …

#reward hacking #code generation #benchmark #LLM evaluation #AI safety