Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

Published: (March 3, 2026 at 09:30 AM EST)
3 min read

Source: Hacker News

The Core Problem

You can’t manually QA an AI agent. When you ship a new prompt, swap a model, or add a tool, it’s hard to know whether the agent still behaves correctly across the thousands of ways users might interact with it. Most teams resort to manual spot‑checking (which doesn’t scale), waiting for user complaints (too late), or brittle scripted tests.

Our Solution: Simulation

Synthetic users interact with your agent the way real users do, and LLM‑based judges evaluate whether it responded correctly—across the full conversational arc, not just single turns.

Scenario Generation + Real Conversation Import

  • A scenario‑generation agent bootstraps your test suite from a description of your agent.
  • Real users find paths no generator anticipates, so we also ingest production conversations and automatically extract test cases. Your coverage evolves as your users do.

Mock Tool Platform

Agents often call external tools. Running simulations against real APIs is slow and flaky. Our mock tool platform lets you define tool schemas, behavior, and return values, allowing simulations to exercise tool selection and decision‑making without touching production systems.

Deterministic, Structured Test Cases

LLMs are stochastic, making flaky CI tests. Instead of free‑form prompts, our evaluators are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word‑for‑word precision matters. This ensures the synthetic user behaves consistently across runs—same branching logic, same inputs—so a failure is a real regression, not noise.

Live Agent Monitoring

Cekura also monitors live agent traffic. While tracing platforms like Langfuse or LangSmith are great for debugging individual LLM calls, conversational agents often fail across multiple turns. For example, a verification flow that requires name, date of birth, and phone number may skip a step; each individual turn looks fine, but the overall session is broken.

  • Turn‑by‑turn evaluation (tracing platforms) checks each turn in isolation.
  • Session‑level evaluation (Cekura) assesses the full transcript, flagging failures that only become visible when the whole conversation is considered.

Example

A banking agent receives a failed verification in step 1 but proceeds anyway. A turn‑based evaluator would mark step 3 (address confirmation) as green because the right question was asked. Cekura’s judge sees the entire session and flags it as failed because verification never succeeded.

Try Cekura

  • Website: https://www.cekura.ai
  • Free trial: 7‑day free trial, no credit card required
  • Paid plans: starting at $30 / month

Demo Video

Watch the product video: https://www.youtube.com/watch?v=n8FFKv1-nMw

  • The first minute covers quick onboarding.
  • Skip to 8:40 to see the results.

Community Discussion

Curious what the HN community is doing—how are you testing behavioral regressions in your agents? What failure modes have hurt you most? Feel free to share your experiences below.

Comments URL: https://news.ycombinator.com/item?id=47232903

0 views
Back to Blog

Related posts

Read more »

The IRIX 6.5.7M (sgi) source code

- AI CODE CREATION GitHub CopilotWrite better code with AI https://github.com/features/copilot - GitHub SparkBuild and deploy intelligent apps https://github.co...

You Just Reveived

'Disclaimer: These are my personal views and do not represent any organization or professional advice. Tue, 03 Mar 2026 08:52:08 +0200