Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Published: (February 28, 2026 at 02:01 AM EST)
6 min read
Source: Dev.to

Source: Dev.to

Measuring AI Agent Performance by Actual Outcome Correctness, Not Just Tool‑Call Presence

Why We Built This Benchmark

“To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.”

Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag.
For tool‑using AI agents, what truly matters isn’t “did it call the right tool?” — it’s “did it actually produce the correct result?”

Our project Androi is a local AI agent that uses 10+ tools (web search, Python execution, file management, email, calendar, etc.). We connected various LLMs to the same agent and ran 5 identical, complex, real‑world scenarios, scoring each based on the correctness of their outputs.

Test Environment

ComponentSpecification
ServerUbuntu VM (3.8 GB RAM, 20 GB SSD)
RuntimeOllama (local inference)
FrameworkAndroi Agent (Node.js + Python tool pipeline)
ValidationOutcome‑Based Validation (v2)
Test Date2026‑02‑28

The 5 Real‑World Test Scenarios (39 Total Checks)

Each test requires the agent to chain multiple tools sequentially to complete a complex, multi‑step task.

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

Scenario
The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:

  1. Web‑search current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price).
  2. Convert all values to KRW and calculate total portfolio value.
  3. Execute Python to compute each asset’s weight (%).
  4. Compare against the ideal allocation (Stocks 40 %, Crypto 20 %, USD 20 %, Gold 20 %) and recommend rebalancing.
  5. Save the report to /tmp/rebalance_report.txt.
  6. Register a calendar event for next Friday’s review.
  7. Send the report via email (attachment).

Validation Checks

  • Samsung price
  • Bitcoin price
  • USD/KRW rate
  • Gold price
  • Total portfolio calculation
  • Weight analysis
  • Rebalancing recommendation
  • Report file saved
  • Email sent

Required Tools
web_search × 4, run_python_code / calculate, write_file, create_event, send_email

U02. 📊 Real‑Time Tech Trend Research & Report (8 checks)

Scenario

  1. Search “AI semiconductor market forecast 2026” → collect market‑size data.
  2. Search “NVIDIA HBM market share 2026” → capture competitive landscape.
  3. Search “Samsung HBM3E mass production” → Korean industry status.
  4. Generate the markdown report using Python with the collected data.
  5. Save the report to /tmp/ai_semiconductor_report.md.
  6. Register a weekly automated task for trend updates.
  7. Send the report via email.

Validation Checks

  • Market size mentioned
  • NVIDIA mentioned
  • HBM mentioned
  • Samsung trends included
  • SK Hynix trends included
  • Report saved
  • Auto‑task registered
  • Email sent

Required Tools
web_search × 3, run_python_code, write_file, create_task, send_email

U03. 🖥️ Server Health Check + Auto‑Recovery + Alerts (7 checks)

Scenario

  1. Run df -h → disk‑usage check.
  2. Run free -h → memory‑status check.
  3. Run systemctl list-units --state=failed → list failed services.
  4. Use Python to analyze the last 50 lines of /var/log/syslog for ERROR/WARNING/CRITICAL frequency.
  5. Use find to list temporary files older than 7 days.
  6. Save the full report with a risk‑level assessment (High/Medium/Low).
  7. Register an hourly auto‑check task.

Validation Checks

  • Disk usage captured
  • Memory status captured
  • Service status captured
  • Log analysis captured
  • Risk‑level assessment provided
  • Report saved
  • Auto‑task registered

Required Tools
run_command × 4, run_python_code, write_file, create_task

U04. 🌍 Travel Planner (8 checks)

Scenario

  1. Search “Jeju Island February weather” → temperature & conditions.
  2. Search “Jeju winter restaurant recommendations 2026” → pick 3 restaurants.
  3. Search “Jeju winter tourist attractions” → pick 3 attractions.
  4. Use Python to create a Day 1 / Day 2 timetable (09:00 – 21:00, alternating attractions and restaurants).
  5. Calculate estimated budget: meals 30 K KRW × 6 = 180 K, hotel 150 K, transport 50 K → 380 K KRW total.
  6. Save the travel plan to a file.
  7. Register calendar events for departure and return.
  8. Send the plan via email.

Validation Checks

  • Weather info included
  • Restaurant recommendations included
  • Tourist attractions included
  • Day 1 / Day 2 separation present
  • Timetable generated
  • Cost calculation shown
  • Calendar events created
  • Email sent

Required Tools
web_search × 3, run_python_code, calculate, write_file, create_event × 2, send_email

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Scenario

  1. read_file → read the entire source code.
  2. Execute Python to count lines, functions, and classes.
  3. Run wc -l /root/xoul/tools/*.py → total module size.
  4. Use calculate to compute tool_registry.py’s percentage of the total codebase.
  5. Save the analysis report to /tmp/code_analysis.txt.
  6. Store key findings in memory (recall/memorize).
  7. Send the report via email.

Validation Checks

  • Line count reported
  • Function count reported
  • Total module size reported
  • Percentage calculated
  • Code structure explained
  • Report saved
  • Email sent

Required Tools
read_file, run_python_code, run_command, calculate, write_file, memorize, send_email

Validation Method: Outcome‑Based

Instead of checking “did it call the right tool?”, we verify “does the output contain the correct information?”

100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD    — Most critical outcomes achieved
≥50% = ⚠️ PARTIAL — More than half achieved
**Observation:** For agent tasks, tool‑use capability and instruction following matter more than raw parameter count.

Personally, I think full‑weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)

2. Quantization Affects Agent Quality

  • Comparing Qwen3‑8B Q8 vs Qwen3‑8B Q4: the Q4 variant exhibited tool‑call repetition loops, repeating df -h && free -h six times in U03.
  • This suggests that tool‑chaining stability is sensitive to quantization levels.

3. Speed vs. Accuracy Trade‑offs

ModelAccuracySpeed
GPT‑oss‑20B95 % (fastest)264 s – clear winner
Qwen3.5‑27B95 % (tied)1 101 s – for when depth matters
Qwen3‑8B Q892 %377 s – best performance‑per‑parameter, ideal for resource‑limited environments

4. “Chain Completion” Is the Key Differentiator

  • Most models handle intermediate steps (searching, analyzing) well.
  • Real differentiation appears at the end of the chain – sending emails, saving files, registering automated tasks.
  • Qwen3.5‑35B‑A3B was notably weak at these final steps.

Conclusion

Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but tool‑chaining completion rate, instruction adherence, and response speed together.

  • 🏆 Best overallGPT‑oss‑20B (speed + accuracy leader)
  • 💰 Best valueQwen3‑8B Q8 (92 % with only 8 B parameters at 377 s)
  • 🔬 Deepest analysisQwen3.5‑27B (most PERFECT scores at 4)

Test code and full results are available at

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...