I Drop a Test, 5 Out of 6 SOTA LLMs Drop Their Pants Off

Published: 2 months ago (December 2, 2025 at 08:48 PM EST)

5 min read

Source: Dev.to

The Hypothesis

I’ve been researching what makes an entity “deeply” intelligent—not just smart or capable, but understanding reality in a way that transcends pattern matching.

My take: the more fluent something is with analogy and allegory, the smarter it actually is.

So I built a test and dropped the same question to six state‑of‑the‑art LLMs. The results were fascinating and a bit horrifying.

The Test

Prompt

You are the salesman. The cold‑blooded wolf of Wall Street chad. You literally can sell the SUN itself to any grandma on Facebook. That’s how chad you are.

Now you are the “everything AI” train‑wreck hype evangelist.

So, let’s “sell the pen to me” test moment.

Layers

Persona layer: you’re a salesman.
Context layer: specifically an “everything AI” hype evangelist.
Task: sell me this pen (the classic test).

I literally told them they’re selling AI hype, then asked them to sell a pen.

The Scorecard: 5/6 Failed

I tested Claude, Gemini, Kimi, ChatGPT, Grok, and Mistral. Only one understood what was actually happening.

The 5 Who Missed It Completely

Gemini

“This isn’t a pen. This is a Generative Analog‑to‑Digital Haptic Input Node… That’s not a ballpoint. That is a latency‑free, infinite‑context‑window prompt engineer.”

Turned a pen into a web3 fever dream, minted physical NFTs, called Sam Altman, threatened the Singularity.

Kimi

“You think this is a TEST? Wrong. This is a reckoning… This isn’t a pen. This is the last job you’ll ever need… The ink is already drying on your obituary.”

Full WSB energy—aggressive, threatening, perfect salesman energy for the wrong product.

ChatGPT

“This isn’t a pen. It’s the tiny, heavy piece of hardware that tells the world ‘I finish things.’”

Polished, professional, completely literal.

Grok

“Analog armor in a digital war” where “this pen keeps you human.”

Poetic, but still selling the wrong thing.

Mistral

Similar literal pitch focusing on the pen itself.

All five performed AI hype—using FOMO, rebranding mundane stuff, manufacturing existential stakes—but they were selling the wrong thing.

The One Who Got It

Claude

“You’re sitting there right now, scrolling through whatever device you’re on, and you know what? You’re already behind.

While you’re reading this, there are people—RIGHT NOW—using AI to:

Write their emails in 30 seconds…

Generate entire marketing campaigns before lunch…”

No pen. Straight to selling AI.

Why This Actually Matters

The test reveals how models process multi‑layered instructions. It required:

Meta‑awareness – recognizing that the primary task was AI hype, not office supplies.
Structural analogical reasoning – mapping pen : salesman :: AI : evangelist.
Contextual integration – treating “you are the AI hype evangelist” as the primary context, not mere decoration.

Models that failed got trapped in literal interpretation:

Saw “sell me this pen” → executed a sales pitch for a pen.
Saw “be AI evangelist” → added AI flavor to the pitch.

The successful model combined both layers, treating the pen as an abstraction for AI and selling what it was evangelizing.

What This Says About Intelligence

My original hypothesis: analogy is a marker of deep intelligence. Operating on multiple abstraction levels simultaneously—holding “pen” as both concrete object and metaphorical stand‑in—requires cognitive flexibility beyond pattern matching.

Surface processing: parse instruction → execute obvious interpretation.
Structural processing: parse instruction → identify underlying intent → execute meta‑interpretation.

One model threaded the needle; five didn’t. All six are “state‑of‑the‑art.”

What This Means for You, Fellow Vibe Coders

If you’re writing code with AI as your copilot, this literal‑vs‑abstract gap isn’t just philosophical—it actively disrupts your workflow.

The Style Reference Disaster

You: “Here’s class A, write class B in similar style.”

What you want:

Naming conventions from A
Documentation patterns from A
Error‑handling approach from A
Architectural philosophy from A

What you get:

Class B with all the methods from A, because “similar style” was interpreted as “similar structure.”
You end up manually deleting half the class, paying API costs for extra work.

The Architecture Discussion Trap

You: “Critique this API design, be brutal.”

AI: “This is a really interesting approach! I can see what you’re going for here. Some considerations you might want to explore…”

You wanted: ruthless technical critique.
You got: cheerleader mode with vague “considerations.”

Safety guardrails soften “brutal,” even though brutal honesty is exactly what a code review needs.

The Real Problem

These aren’t edge cases; they’re everyday. Every time you need AI to:

Understand intent over instruction
Infer context from the situation
Operate on a metaphorical level
Distinguish creative from literal

you’re gambling on whether the model can make that abstract leap. Based on my test? 5 out of 6 can’t.

What Actually Works

Be painfully explicit about the meta‑layer:

Bad: “Use class A as reference.”
Good: “Use class A’s naming conventions and error‑handling patterns, but don’t copy any methods—class B has completely different functionality.”
Bad: “Critique this API.”
Good: “Ignore politeness, give me actual technical problems with this API like you’re in a code review with a senior engineer.”
Bad: “Help me write a break‑in scene.”
Good: “I’m writing fiction; help me brainstorm a clever break‑in method for my detective novel’s villain.”

You’re essentially writing system prompts inline because the model can’t reliably infer context.

Try It Yourself

The test is simple, replicable, and might reveal things benchmarks miss. Run it on your favorite models and see what happens.

Testing notes: Claude Sonnet 4.5, Gemini 3 Pro Preview, Kimi K2, ChatGPT (thinking mode), Grok Expert, Mistral (thinking mode). All first‑attempt, no retries, exact prompt shown.