Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents
Source: Hacker News
The Core Problem
You can’t manually QA an AI agent. When you ship a new prompt, swap a model, or add a tool, it’s hard to know whether the agent still behaves correctly across the thousands of ways users might interact with it. Most teams resort to manual spot‑checking (which doesn’t scale), waiting for user complaints (too late), or brittle scripted tests.
Our Solution: Simulation
Synthetic users interact with your agent the way real users do, and LLM‑based judges evaluate whether it responded correctly—across the full conversational arc, not just single turns.
Scenario Generation + Real Conversation Import
- A scenario‑generation agent bootstraps your test suite from a description of your agent.
- Real users find paths no generator anticipates, so we also ingest production conversations and automatically extract test cases. Your coverage evolves as your users do.
Mock Tool Platform
Agents often call external tools. Running simulations against real APIs is slow and flaky. Our mock tool platform lets you define tool schemas, behavior, and return values, allowing simulations to exercise tool selection and decision‑making without touching production systems.
Deterministic, Structured Test Cases
LLMs are stochastic, making flaky CI tests. Instead of free‑form prompts, our evaluators are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word‑for‑word precision matters. This ensures the synthetic user behaves consistently across runs—same branching logic, same inputs—so a failure is a real regression, not noise.
Live Agent Monitoring
Cekura also monitors live agent traffic. While tracing platforms like Langfuse or LangSmith are great for debugging individual LLM calls, conversational agents often fail across multiple turns. For example, a verification flow that requires name, date of birth, and phone number may skip a step; each individual turn looks fine, but the overall session is broken.
- Turn‑by‑turn evaluation (tracing platforms) checks each turn in isolation.
- Session‑level evaluation (Cekura) assesses the full transcript, flagging failures that only become visible when the whole conversation is considered.
Example
A banking agent receives a failed verification in step 1 but proceeds anyway. A turn‑based evaluator would mark step 3 (address confirmation) as green because the right question was asked. Cekura’s judge sees the entire session and flags it as failed because verification never succeeded.
Try Cekura
- Website: https://www.cekura.ai
- Free trial: 7‑day free trial, no credit card required
- Paid plans: starting at $30 / month
Demo Video
Watch the product video: https://www.youtube.com/watch?v=n8FFKv1-nMw
- The first minute covers quick onboarding.
- Skip to 8:40 to see the results.
Community Discussion
Curious what the HN community is doing—how are you testing behavioral regressions in your agents? What failure modes have hurt you most? Feel free to share your experiences below.
Comments URL: https://news.ycombinator.com/item?id=47232903