Launch HN: Canary (YC W26) – AI QA that understands your code

Published: 1 day ago (March 19, 2026 at 12:01 PM EDT)

3 min read

Source: Hacker News

Overview

Hey HN! We’re Aakash and Viswesh, and we’re building Canary (https://www.runcanary.ai). We create AI agents that read your codebase, determine what a pull request actually changed, and generate and execute tests for every affected user workflow.

How Canary Works

Connect to your codebase – Canary analyzes the structure of your app (routes, controllers, validation logic).
Read the PR diff – It understands the intent behind the changes.
Generate and run tests – Tests are executed against your preview app, checking real user flows end‑to‑end.
Comment on the PR – Results and recordings are posted directly on the PR, highlighting any unexpected behavior.
Trigger tests via comments – You can start specific user‑workflow tests with a PR comment.

Beyond PR Testing

Tests generated from a PR can be moved into regression suites.
You can create tests by prompting in plain English.
Canary can generate a full test suite from your codebase, schedule it, and run it continuously.

Example

One of our construction‑tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.

Technical Challenges

QA spans many modalities:

Source code, DOM/ARIA, device emulators
Visual verification, screen‑recording analysis
Network/console logs, live browser state

A single foundation model can’t handle all of these. We also need:

Custom browser fleets, user sessions, ephemeral environments
On‑device farms and data seeding for reliable test execution
A specialized harness to expose second‑order effects that happy‑path testing would miss

Benchmark: QA‑Bench v0

To measure our purpose‑built QA agent, we released QA‑Bench v0, the first benchmark for code verification.

Task: Given a real PR, identify every affected user workflow and produce relevant tests.
Dataset: 35 real PRs from Grafana, Mattermost, Cal.com, and Apache Superset.
Metrics: Relevance, Coverage, Coherence.

Results

Model	Relevance	Coverage	Coherence
Canary	–	Lead	–
GPT 5.4	–	-11 pts	–
Claude Code (Opus 4.6)	–	-18 pts	–
Sonnet 4.6	–	-26 pts	–

Coverage showed the largest gap, with Canary leading by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6.
For full methodology and per‑repo breakdowns, read the benchmark report: https://www.runcanary.ai/blog/qa-bench-v0

Demo

You can check out the product demo here: https://youtu.be/NeD9g1do_BU

Call for Feedback

We’d love feedback from anyone working on code verification or thinking about how to measure this differently.

Launch HN: Canary (YC W26) – AI QA that understands your code

Overview

How Canary Works

Beyond PR Testing

Example

Technical Challenges

Benchmark: QA‑Bench v0

Results

Demo

Call for Feedback

Related posts

Launch HN: Sitefire (YC W26) – Automating actions to improve AI visibility

MacBook M5 Pro and Qwen3.5 = Local AI Security System

90% of crypto's Illinois primary spending failed to achieve its objective

Oregon School Cell Phone Ban: 'Engaged Students, Joyful Teachers'