Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

Published: 6 hours ago (March 3, 2026 at 09:30 AM EST)

3 min read

Source: Hacker News

The Core Problem

You can’t manually QA an AI agent. When you ship a new prompt, swap a model, or add a tool, it’s hard to know whether the agent still behaves correctly across the thousands of ways users might interact with it. Most teams resort to manual spot‑checking (which doesn’t scale), waiting for user complaints (too late), or brittle scripted tests.

Our Solution: Simulation

Synthetic users interact with your agent the way real users do, and LLM‑based judges evaluate whether it responded correctly—across the full conversational arc, not just single turns.

Scenario Generation + Real Conversation Import

A scenario‑generation agent bootstraps your test suite from a description of your agent.
Real users find paths no generator anticipates, so we also ingest production conversations and automatically extract test cases. Your coverage evolves as your users do.

Mock Tool Platform

Agents often call external tools. Running simulations against real APIs is slow and flaky. Our mock tool platform lets you define tool schemas, behavior, and return values, allowing simulations to exercise tool selection and decision‑making without touching production systems.

Deterministic, Structured Test Cases

LLMs are stochastic, making flaky CI tests. Instead of free‑form prompts, our evaluators are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word‑for‑word precision matters. This ensures the synthetic user behaves consistently across runs—same branching logic, same inputs—so a failure is a real regression, not noise.

Live Agent Monitoring

Cekura also monitors live agent traffic. While tracing platforms like Langfuse or LangSmith are great for debugging individual LLM calls, conversational agents often fail across multiple turns. For example, a verification flow that requires name, date of birth, and phone number may skip a step; each individual turn looks fine, but the overall session is broken.

Turn‑by‑turn evaluation (tracing platforms) checks each turn in isolation.
Session‑level evaluation (Cekura) assesses the full transcript, flagging failures that only become visible when the whole conversation is considered.

Example

A banking agent receives a failed verification in step 1 but proceeds anyway. A turn‑based evaluator would mark step 3 (address confirmation) as green because the right question was asked. Cekura’s judge sees the entire session and flags it as failed because verification never succeeded.

Try Cekura

Website: https://www.cekura.ai
Free trial: 7‑day free trial, no credit card required
Paid plans: starting at $30 / month

Demo Video

Watch the product video: https://www.youtube.com/watch?v=n8FFKv1-nMw

The first minute covers quick onboarding.
Skip to 8:40 to see the results.

Community Discussion

Curious what the HN community is doing—how are you testing behavioral regressions in your agents? What failure modes have hurt you most? Feel free to share your experiences below.

Comments URL: https://news.ycombinator.com/item?id=47232903

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

The Core Problem

Our Solution: Simulation

Scenario Generation + Real Conversation Import

Mock Tool Platform

Deterministic, Structured Test Cases

Live Agent Monitoring

Example

Try Cekura

Demo Video

Community Discussion

Related posts

Iran War Cost Tracker

Intel's make-or-break 18A process node debuts for data center with 288-core Xeon

Why payment fees matter more than you think

You are going to get priced out of the best AI coding tools