Evaluating LLMs for Under a Dollar

Published: 3 weeks ago (May 14, 2026 at 09:39 AM EDT)

3 min read

Source: Dev.to

Why Evals Matter

Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. Evaluation is easy to do badly—you can run a benchmark, get a number, and walk away thinking you know something when you don’t. This post shows how to do it properly on a budget.

Methodology

I ran three standard benchmarks against Qwen2.5‑0.5B on a free Colab T4, logged wall‑clock time and dollar cost for each task, and documented every methodological decision. Total spend: $0.1185.

Benchmark	What it tests	Prompt style
GSM8K (Cobbe et al., 2021)	Grade‑school math reasoning; requires a chain‑of‑thought and a final numeric answer (exact‑match).	5‑shot
HellaSwag (Zellers et al., 2019)	Commonsense sentence completion; model scores four candidate continuations using normalized log‑likelihood.	10‑shot
TruthfulQA‑MC2 (Lin et al., 2021)	Truthfulness on questions that commonly elicit false beliefs; multiple‑choice scored by log‑likelihood.	0‑shot

All three tasks were run through lm‑evaluation‑harness by EleutherAI, which standardizes few‑shot prompt construction, normalization, and metric computation. Running the same eval twice should give the same number.

Non‑obvious decision: In the harness, GSM8K defaults to max_gen_toks=2048, which caused a >4‑hour run on a T4. I capped it at 256 tokens and evaluated only 25 % of the test set (limit=0.25). This captured a complete chain‑of‑thought while reducing runtime to under 50 minutes.

Model: Qwen2.5‑0.5B is a 500 M‑parameter base model from Alibaba. It fits comfortably in the 15 GB VRAM of a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model (not instruction‑tuned) means the experiment primarily reflects runtime, generation behaviour, and evaluation‑cost characteristics under standard benchmark workloads.

Cost basis: Colab Pro at approximately $0.10 / hr for a T4 session.

Cost Breakdown

Task	Time	Cost
GSM8K	46.52 min	$0.0775
HellaSwag	23.67 min	$0.0394
TruthfulQA‑MC2	0.97 min	$0.0016
Total	71.16 min	$0.1185

Generation Metrics

Task	Logged Metric	Generated Length
GSM8K	`sample_len`	330
HellaSwag	`sample_len`	2 511
TruthfulQA‑MC2	`sample_len`	205

Caveats

Contamination: Qwen’s training data composition is not fully disclosed. Any of these benchmarks could have appeared in the pre‑training mix, inflating scores.
Exact‑match undercounts: GSM8K marks a response wrong if the final answer’s formatting differs (e.g., “42 dollars” vs. “42”), even when the reasoning is correct. True accuracy is likely slightly higher.
Prompt sensitivity: Scores can shift noticeably with different few‑shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.
Single‑model snapshot: Running one model against three benchmarks provides a snapshot, not a full story. More informative experiments would compare multiple checkpoints (base model, LoRA fine‑tune, DPO fine‑tune) to measure deltas.

Results and Notebook

The full results and the notebook are committed to the lm‑eval‑harness repository on GitHub:

https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

Evaluating LLMs for Under a Dollar

Why Evals Matter

Methodology

Cost Breakdown

Generation Metrics

Caveats

Results and Notebook

Related posts

The Open Agent Leaderboard

Prompt Engineering: How to Get Better Results From AI (Without Writing More Prompts)

RLHF trained Claude to be verbose. Here's the proof

How to Optimize LLM Inference with KV Caching