Launch HN: Canary (YC W26) – AI QA that understands your code

Published: (March 19, 2026 at 12:01 PM EDT)
3 min read

Source: Hacker News

Overview

Hey HN! We’re Aakash and Viswesh, and we’re building Canary (https://www.runcanary.ai). We create AI agents that read your codebase, determine what a pull request actually changed, and generate and execute tests for every affected user workflow.

How Canary Works

  1. Connect to your codebase – Canary analyzes the structure of your app (routes, controllers, validation logic).
  2. Read the PR diff – It understands the intent behind the changes.
  3. Generate and run tests – Tests are executed against your preview app, checking real user flows end‑to‑end.
  4. Comment on the PR – Results and recordings are posted directly on the PR, highlighting any unexpected behavior.
  5. Trigger tests via comments – You can start specific user‑workflow tests with a PR comment.

Beyond PR Testing

  • Tests generated from a PR can be moved into regression suites.
  • You can create tests by prompting in plain English.
  • Canary can generate a full test suite from your codebase, schedule it, and run it continuously.

Example

One of our construction‑tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.

Technical Challenges

QA spans many modalities:

  • Source code, DOM/ARIA, device emulators
  • Visual verification, screen‑recording analysis
  • Network/console logs, live browser state

A single foundation model can’t handle all of these. We also need:

  • Custom browser fleets, user sessions, ephemeral environments
  • On‑device farms and data seeding for reliable test execution
  • A specialized harness to expose second‑order effects that happy‑path testing would miss

Benchmark: QA‑Bench v0

To measure our purpose‑built QA agent, we released QA‑Bench v0, the first benchmark for code verification.

  • Task: Given a real PR, identify every affected user workflow and produce relevant tests.
  • Dataset: 35 real PRs from Grafana, Mattermost, Cal.com, and Apache Superset.
  • Metrics: Relevance, Coverage, Coherence.

Results

ModelRelevanceCoverageCoherence
CanaryLead
GPT 5.4-11 pts
Claude Code (Opus 4.6)-18 pts
Sonnet 4.6-26 pts

Coverage showed the largest gap, with Canary leading by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6.
For full methodology and per‑repo breakdowns, read the benchmark report: https://www.runcanary.ai/blog/qa-bench-v0

Demo

You can check out the product demo here: https://youtu.be/NeD9g1do_BU

Call for Feedback

We’d love feedback from anyone working on code verification or thinking about how to measure this differently.

0 views
Back to Blog

Related posts

Read more »