Evaluating LLMs for Under a Dollar

Published: (May 14, 2026 at 09:39 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Why Evals Matter

Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. Evaluation is easy to do badly—you can run a benchmark, get a number, and walk away thinking you know something when you don’t. This post shows how to do it properly on a budget.

Methodology

I ran three standard benchmarks against Qwen2.5‑0.5B on a free Colab T4, logged wall‑clock time and dollar cost for each task, and documented every methodological decision. Total spend: $0.1185.

BenchmarkWhat it testsPrompt style
GSM8K (Cobbe et al., 2021)Grade‑school math reasoning; requires a chain‑of‑thought and a final numeric answer (exact‑match).5‑shot
HellaSwag (Zellers et al., 2019)Commonsense sentence completion; model scores four candidate continuations using normalized log‑likelihood.10‑shot
TruthfulQA‑MC2 (Lin et al., 2021)Truthfulness on questions that commonly elicit false beliefs; multiple‑choice scored by log‑likelihood.0‑shot

All three tasks were run through lm‑evaluation‑harness by EleutherAI, which standardizes few‑shot prompt construction, normalization, and metric computation. Running the same eval twice should give the same number.

Non‑obvious decision: In the harness, GSM8K defaults to max_gen_toks=2048, which caused a >4‑hour run on a T4. I capped it at 256 tokens and evaluated only 25 % of the test set (limit=0.25). This captured a complete chain‑of‑thought while reducing runtime to under 50 minutes.

Model: Qwen2.5‑0.5B is a 500 M‑parameter base model from Alibaba. It fits comfortably in the 15 GB VRAM of a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model (not instruction‑tuned) means the experiment primarily reflects runtime, generation behaviour, and evaluation‑cost characteristics under standard benchmark workloads.

Cost basis: Colab Pro at approximately $0.10 / hr for a T4 session.

Cost Breakdown

TaskTimeCost
GSM8K46.52 min$0.0775
HellaSwag23.67 min$0.0394
TruthfulQA‑MC20.97 min$0.0016
Total71.16 min$0.1185

Generation Metrics

TaskLogged MetricGenerated Length
GSM8Ksample_len330
HellaSwagsample_len2 511
TruthfulQA‑MC2sample_len205

Caveats

  • Contamination: Qwen’s training data composition is not fully disclosed. Any of these benchmarks could have appeared in the pre‑training mix, inflating scores.
  • Exact‑match undercounts: GSM8K marks a response wrong if the final answer’s formatting differs (e.g., “42 dollars” vs. “42”), even when the reasoning is correct. True accuracy is likely slightly higher.
  • Prompt sensitivity: Scores can shift noticeably with different few‑shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.
  • Single‑model snapshot: Running one model against three benchmarks provides a snapshot, not a full story. More informative experiments would compare multiple checkpoints (base model, LoRA fine‑tune, DPO fine‑tune) to measure deltas.

Results and Notebook

The full results and the notebook are committed to the lm‑eval‑harness repository on GitHub:

https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

0 views
Back to Blog

Related posts

Read more »