I Spent a Week on VibeCode Arena. Here Is Everything I Did Not Expect.

Published: 1 month ago (March 31, 2026 at 03:20 AM EDT)

9 min read

Source: Dev.to

Source: Dev.to

Introduction

I will be upfront about something.
I almost did not write this. Not because the week was boring – the opposite, actually. I almost did not write it because some of what I found was uncomfortable to admit, and the comfortable thing would have been to just post a highlight reel and move on.

But highlight reels are useless. So here is the actual account.

How I discovered VibeCode Arena

I had been hearing about VibeCode Arena for a bit – a platform by HackerEarth where you watch AI models compete on the same prompt, vote blind before the reveal, and open your results as community challenges.

It sounded interesting. I kept putting it off the way you put off anything that might make you feel like you don’t know what you’re doing. Eventually I just opened it.

The first thing you see is the Duels feed. People have submitted prompts, two models have gone head‑to‑head, and the community is voting on which output is better. Some of the stuff people are building on here is genuinely creative — retro terminals, interactive tools, little games. I spent longer than I meant to just scrolling through before I actually tried anything myself.

My first duel

The way it works is simple:

You write a prompt.
Two AI models generate simultaneously.
You watch both outputs side‑by‑side.
You vote for the better one before the model names are revealed.
The reveal only happens after you commit.

I typed something I genuinely wanted to see — not a test prompt, an actual idea I had been sitting on. Both outputs came in, I read through them, and I voted.

Then the model names appeared.

I had voted for the model I never use. The one I had quietly written off months ago was based on nothing more specific than a vague impression I had picked up from other people’s opinions. I did not feel vindicated. I felt slightly embarrassed, because the implication was obvious — I had been choosing my tools based on reputation and habit, not based on what they actually produced.

I ran three more duels that day. My assumptions were wrong twice.

A fun experiment

I wanted to try creating my own challenge and didn’t want to overthink it, so I typed:

“design a game with stick figures fighting”

Two outputs appeared:

Output A – clean and structured, but felt more like a diagram than a game.
Output B – chaotic in a good way: player‑1 attack button, player‑2 attack button, health bars, the whole thing. It felt like something you would actually waste two minutes on.

I picked the second one – Codestral‑2508 – a model I had never seriously considered before. Again, the blind format forced an honest evaluation.

I opened it as a community challenge, which means anyone can now jump in, take that base, prompt AI to improve it, and submit their version. The leaderboard fills up with different people’s takes on the same starting point.

What I felt in that moment was something I did not expect: a kind of genuine curiosity. Not “I built a thing,” but “I started something and I have no idea where it goes.” That is a different feeling and, honestly, a more interesting one.

Day three – watching, not building

I spent most of day three not building anything, just looking.

The challenge feed has a range of submissions:

Serious – UI components, accessibility‑focused tools, form builders.
Unhinged – wild, experimental projects.

The leaderboard on each one tells you something about what the community actually values versus what looks impressive at first glance.

What I noticed was how much you learn from watching other people’s iterations of the same starting point:

Someone takes a base output and restructures it entirely.
Someone else adds a feature you hadn’t thought of.
A third person goes in a direction that seems wrong, but then you see their score and have to reconsider your assumptions.

I kept thinking – this is the thing most people miss when they use AI alone. You get your own blind spots plus the model’s blind spots. Open it up and suddenly you have twenty different sets of eyes on the same problem.

A concrete example – the guitar challenge

I ran a duel with the prompt:

“design a guitar with all the key notes represented on strings.”

Six words. I play a bit of guitar, so I was curious what the models would do with something I could actually evaluate properly.

Both outputs produced an Interactive Guitar Fretboard: all twelve notes across the top, six strings, click a note and see its positions highlighted in green. At first glance, it looked great – clean, functional, more than I expected from six words.

When I actually used it, the problems became obvious:

Issue	Why it matters
Note positions are inaccurate	The highlighted positions don’t match standard tuning – the core teaching function fails.
No sound playback	Clicking a note produces silence; a guitar tool without audio misses half its purpose.
No fret numbers or markers	You can’t tell which fret you’re looking at; the neck looks slightly too short.

I voted for Codestral‑2508 again – better visual layout, marginally better structure – but both outputs had the same fundamental problems.

I opened it as a challenge with a clear brief:

Fix the note positions.
Add fret numbers.
Add fret markers (the typical in‑lay markers).
Add sound playback.

That felt more honest than pretending the output was finished.

What I learned after four days

I did not read a guide about this; it just happened. After four days of carefully evaluating other models’ outputs, I noticed I was prompting differently:

Less about how to structure things.
More about what the thing needs to do and why it needs to do it.

The outputs got better – not dramatically, but noticeably.

Takeaway

Blind voting forces you to evaluate what a model actually produces, not what you think it should produce based on reputation. Opening the process to the community multiplies perspectives, exposing blind spots you’d never see on your own. The result? A deeper curiosity, better prompts, and ultimately, more useful AI‑generated creations.

Reflections After a Week on VibeCode Arena

I think what happened is that spending time on the evaluation side—really sitting with outputs and asking what is working and what is not—quietly changed what I was putting into the prompts. I had internalised something about the gap between what prompts produce and what they should produce, and it was coming out the other side.

That is not something I could have gotten from reading about prompting. It came from doing the evaluation work repeatedly.

I ran a duel on a prompt close to something I had actually built myself a few months ago. I was quietly confident going in. I knew this territory. I had already solved this problem.

One of the outputs handled a specific part of the interaction better than I had.

Not overall better.
Not in every dimension.
But in one specific way that mattered, it made a decision I had not made, and the decision was right.

I sat with that for longer than I probably needed to.

What I landed on

It wasn’t that AI is better. It was that I had been too close to my own previous solution to see its weaknesses clearly. I had built a thing, shipped it, moved on, and calcified around it. Fresh evaluation—even from a model—caught what I had stopped being able to see.

That is uncomfortable. It is also useful.

A week is not long. I am not going to pretend it transformed everything, but a few things shifted.

Model choice: I now think about it differently. Not “which model is best in general” (that question has no useful answer), but which model for what kind of task, under what conditions, evaluated by what criteria. This framing is more honest and more practical.
Evaluation rigor: I evaluate AI output more carefully. I have a clearer sense of what I am actually looking at—what is genuinely solid, what just looks good, what will cause problems later. That clarity came from doing blind evaluations repeatedly, not from reading about them.
Comfort with incompleteness: Both of my challenges—the stick‑figure brawl and the guitar fretboard—were unfinished in clear ways. Opening them as challenges rather than polishing them to death felt right. The platform treats incompleteness as an invitation rather than a failure. That reframe is quietly useful.

Vibe coding in context

Most of the conversation around vibe coding is about speed: build faster, ship faster, stop writing every line by hand.
That is true, but it is also incomplete.

What vibe coding actually demands, if you do it properly, is better judgment. You are no longer the one writing every line, so you must be the one who can evaluate every line. You have to:

Know what good looks like.
Catch what is wrong before it ships.
Make the calls the AI cannot make because it lacks your context.

VibeCode Arena is the most direct way I have found to build that judgment—not through tutorials, but through doing it on real prompts, with real outputs, scored honestly, in a community that is working on the same thing.

Where I am now

One week in, I am more calibrated than I was.
I am also more aware of how much calibration I still need.

That feels like the right place to be.

Try it yourself

VibeCode Arena home:
Live challenge (guitar fretboard):

The best thing a week on VibeCode Arena did was make me honest about what I did not know. That is more valuable than any output it produced.