Launch HN: Canary (YC W26) – AI QA that understands your code
Source: Hacker News
Overview
Hey HN! We’re Aakash and Viswesh, and we’re building Canary (https://www.runcanary.ai). We create AI agents that read your codebase, determine what a pull request actually changed, and generate and execute tests for every affected user workflow.
How Canary Works
- Connect to your codebase – Canary analyzes the structure of your app (routes, controllers, validation logic).
- Read the PR diff – It understands the intent behind the changes.
- Generate and run tests – Tests are executed against your preview app, checking real user flows end‑to‑end.
- Comment on the PR – Results and recordings are posted directly on the PR, highlighting any unexpected behavior.
- Trigger tests via comments – You can start specific user‑workflow tests with a PR comment.
Beyond PR Testing
- Tests generated from a PR can be moved into regression suites.
- You can create tests by prompting in plain English.
- Canary can generate a full test suite from your codebase, schedule it, and run it continuously.
Example
One of our construction‑tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.
Technical Challenges
QA spans many modalities:
- Source code, DOM/ARIA, device emulators
- Visual verification, screen‑recording analysis
- Network/console logs, live browser state
A single foundation model can’t handle all of these. We also need:
- Custom browser fleets, user sessions, ephemeral environments
- On‑device farms and data seeding for reliable test execution
- A specialized harness to expose second‑order effects that happy‑path testing would miss
Benchmark: QA‑Bench v0
To measure our purpose‑built QA agent, we released QA‑Bench v0, the first benchmark for code verification.
- Task: Given a real PR, identify every affected user workflow and produce relevant tests.
- Dataset: 35 real PRs from Grafana, Mattermost, Cal.com, and Apache Superset.
- Metrics: Relevance, Coverage, Coherence.
Results
| Model | Relevance | Coverage | Coherence |
|---|---|---|---|
| Canary | – | Lead | – |
| GPT 5.4 | – | -11 pts | – |
| Claude Code (Opus 4.6) | – | -18 pts | – |
| Sonnet 4.6 | – | -26 pts | – |
Coverage showed the largest gap, with Canary leading by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6.
For full methodology and per‑repo breakdowns, read the benchmark report: https://www.runcanary.ai/blog/qa-bench-v0
Demo
You can check out the product demo here: https://youtu.be/NeD9g1do_BU
Call for Feedback
We’d love feedback from anyone working on code verification or thinking about how to measure this differently.