AI evaluation

1 week ago · ai

Why 90% Accuracy in Text-to-SQL is 100% Useless

The eternal promise of self-service analytics The post Why 90% Accuracy in Text-to-SQL is 100% Useless appeared first on Towards Data Science....

#text-to-sql #natural-language-processing #SQL #accuracy-metrics #self-service-analytics #LLM #AI-evaluation
3 weeks ago · ai

How to Build an AI Agent Evaluation Framework That Scales

The Scaling Problem So, you've built a great AI agent. You've tested it with a few dozen examples, and it works perfectly. Now, you're ready to deploy it to pr...

#AI evaluation #agent monitoring #scalable testing #automated scoring #LLM performance
0 month ago · ai

When Ai Learns to Admit Its Mistakes Trust Becomes a Real Responsibility

Introduction OpenAI’s latest research direction marks a significant evolution in how advanced AI systems are trained and evaluated, raising fundamental questio...

#AI transparency #confession mechanism #OpenAI #model hallucination #responsible AI #AI evaluation
0 month ago · ai

Running Evals on a Bloated RAG Pipeline

Comparing metrics across datasets and models The post Running Evals on a Bloated RAG Pipeline appeared first on Towards Data Science....

#RAG #retrieval-augmented generation #model evaluation #pipeline performance #metrics #LLM #AI evaluation
0 month ago · ai

Measuring AI Ability to Complete Long Tasks

Article URL: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ Comments URL: https://news.ycombinator.com/item?id=46342166 Points: 1...

#AI evaluation #long-context tasks #benchmarking #LLM performance #AI metrics
0 month ago · ai

Measuring AI Ability to Complete Long Tasks: Opus 4.5 has 50% horizon of 4h49M

Article URL: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ Comments URL: https://news.ycombinator.com/item?id=46342166 Points: 3...

#AI evaluation #long-context tasks #Opus 4.5 #task horizon #benchmarking
1 month ago · ai

AI agents fail 63% of the time on complex tasks. Patronus AI says its new 'living' training worlds can fix that.

Patronus AI, the artificial intelligence evaluation startup backed by $20 million from investors including Lightspeed Venture Partners and Datadog, unveiled a n...

#AI agents #reinforcement learning #training environments #synthetic worlds #Patronus AI #complex task performance #AI evaluation
1 month ago · ai

Auto-grading decade-old Hacker News discussions with hindsight

!hnherohttps://bear-images.sfo2.cdn.digitaloceanspaces.com/karpathy/hnhero.webp Yesterday I stumbled on this HN thread — Show HN: Gemini Pro 3 hallucinates the...

#LLM #auto-grading #Hacker News #ChatGPT #Gemini #retrospective analysis #AI evaluation
1 month ago · ai

How to use System prompts as Ground Truth for Evaluation

The Problem: Lack of Clear Ground Truth Most teams struggle to evaluate their AI agents because they don’t have a well‑defined ground truth. Typical workflow:...

#system prompts #ground truth #AI evaluation #prompt engineering #LLM evaluation #evaluation metrics
1 month ago · ai

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks

Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor...

#Gemini 3 #trustworthiness #AI evaluation #benchmarking #large language models #Google AI #Prolific study
1 month ago · ai

I Drop a Test, 5 Out of 6 SOTA LLMs Drop Their Pants Off

The Hypothesis I've been researching what makes an entity 'deeply' intelligent—not just smart or capable, but understanding reality in a way that transcends pa...

#LLM #prompt engineering #AI evaluation #persona prompting #sales pitch test #analogical reasoning