EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Published: 1 month ago (March 19, 2026 at 05:01 PM EDT)

1 min read

Source: Hacker News

Motivation

Current benchmarks for large language model (LLM) code generation primarily evaluate mainstream languages like Python, where models benefit from massive pretraining corpora. This leads to inflated accuracy scores that may reflect data memorization rather than genuine reasoning ability.

EsoLang‑Bench

EsoLang‑Bench is a benchmark of 80 programming problems across five esoteric languages—Brainfuck, Befunge‑98, Whitespace, Unlambda, and Shakespeare—where training data is 5,000 to 100,000× scarcer than Python.

Evaluation

We evaluated five frontier models using five prompting strategies and two agentic coding systems.

Results

The best‑performing model achieves only 3.8 % overall accuracy, compared to ~90 % on equivalent Python tasks.
All models score 0 % on problems above the Easy tier.
Whitespace remains completely unsolved (0 % across all configurations).
Self‑reflection provides essentially zero benefit.

These results reveal a dramatic gap between benchmark performance on mainstream languages and genuine programming ability, suggesting that current LLM code generation capabilities are far narrower than headline metrics imply.

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Motivation

EsoLang‑Bench

Evaluation

Results

Related posts

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

Chat GPT 5.2 cannot explain the German word 'geschniegelt'

Anthropic is giving Claude the ability to use your Mac for you

I tuned Hindsight for long conversations