We built test mode. Then discovered it was broken.

Published: 3 days ago (June 8, 2026 at 05:00 AM EDT)

3 min read

Source: Dev.to

Part of building jhansi.io in public. Test mode sounded simple. Upload code, pass a command, jhansi runs it + your test suite. Done. Except it wasn’t done. First run: empty output. No errors. Just silence. Here’s what broke — and how it changed how we think about AI-generated code. AI writes code. Scripts, APIs, full backends. But code without proof is liability. Test mode is the proof. You upload a project to a jhansi sandbox, pass the command that starts your app, and jhansi: Runs the command Waits for the server to come up Executes your test suite against it Returns results Kills everything All inside an isolated container. Nothing escapes. Nothing persists. This is the verification layer missing from Cursor, Claude Code, Windsurf. They generate. We verify. v0.4 of test mode accepted a filename. Upload app.py, call exec with filename: “app.py”, jhansi figures out how to run it. The problem: real projects aren’t single files. A Flask app is app.py + tests/ + requirements.txt. When we uploaded them separately, they landed flat in the workspace. pytest couldn’t find tests/. The installer couldn’t find requirements.txt. We built test mode for the toy world. But AI doesn’t generate toys. It generates projects. AI agents don’t write hello_world.py. They write repos. Obvious once you see it. Upload the whole project as a zip.

From inside your project

cd my_project && zip -r ../my_project.zip .

Upload to sandbox

curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/upload
-F “file=@my_project.zip”

jhansi extracts it preserving structure. tests/ lands where pytest expects it. requirements.txt lands where the installer looks. This also killed the filename param. You now pass the actual command: curl -X POST http://localhost:8000/v1/sandboxes/sb_abc123/exec
-H “Content-Type: application/json”
-d ’{“command”: “python app.py”, “test”: true}’

Language-agnostic. Python, Node, Go, Java. Same API. jhansi handles the runtime. When test: true: Install deps — blocking. Wait for pip install to finish. This was bug #2. Start your app — detached, in the background Wait 2s for the server to bind to port Run tests — pytest, jest, go test, mvn test. Auto-detected. Return output — stdout, stderr, test summary Kill container — no state leaks Test runner needs zero config. If pytest finds it locally, we find it in the sandbox. v1 ran install + app start in one Docker command. Container starts → pip install begins → python app.py tries to start → pytest fires 2s later. But pip install flask was still downloading. Server wasn’t up. Tests hit ConnectionRefused. The fix: serialize it. Install deps. Block until done. Start app. Detach. Sleep 2s. Test. Obvious in hindsight. You only learn this by shipping and watching it fail. We shipped test mode in v0.4. It works. All four languages tested end-to-end. But it took discovering that AI generates projects, not scripts, to get there. The first design was for the demo. The second design is for the world AI actually creates. This is why building in public matters. Not to announce features. To document how the problem reveals itself when you touch it. v0.5 is serve mode — start a server, get a temporary preview URL, share it with your team, kill it when you’re done. The last verification step before you deploy anywhere real. No more “works on my machine” from an LLM. Code is open source at github.com/jhansi-io/petri. Apache 2.0. Self-host today. Building AI tooling at a bank or fintech and this sounds familiar? I want to hear from you. jhansi.io — the missing runtime layer for AI-generated code.

We built test mode. Then discovered it was broken.

From inside your project

Upload to sandbox

Related posts

Automated Testing for SCORM E-Learning Packages Using Playwright — A Step-by-Step Guide

AMD RCE Ignored, GitHub Boosts Secret Scanning with LLMs, AUR Supply Chain Attack

Why SCORM Refuses to Die — And What AI Finally Changes About That

AI Agent Security, Open-Source Code Generation, and Frontier Models on Bedrock