Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Published: 3 days ago (February 28, 2026 at 02:01 AM EST)

6 min read

Source: Dev.to

Measuring AI Agent Performance by Actual Outcome Correctness, Not Just Tool‑Call Presence

Why We Built This Benchmark

“To make it accessible for general users, it is crucial to find an LLM with the lowest possible VRAM footprint.”

Most LLM benchmarks evaluate models on academic metrics like MMLU, HumanEval, or HellaSwag.
For tool‑using AI agents, what truly matters isn’t “did it call the right tool?” — it’s “did it actually produce the correct result?”

Our project Androi is a local AI agent that uses 10+ tools (web search, Python execution, file management, email, calendar, etc.). We connected various LLMs to the same agent and ran 5 identical, complex, real‑world scenarios, scoring each based on the correctness of their outputs.

Test Environment

Component	Specification
Server	Ubuntu VM (3.8 GB RAM, 20 GB SSD)
Runtime	Ollama (local inference)
Framework	Androi Agent (Node.js + Python tool pipeline)
Validation	Outcome‑Based Validation (v2)
Test Date	2026‑02‑28

The 5 Real‑World Test Scenarios (39 Total Checks)

Each test requires the agent to chain multiple tools sequentially to complete a complex, multi‑step task.

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

Scenario
The user holds 50 shares of Samsung Electronics, 0.1 BTC, $3,000 USD, and 1 oz of gold. The agent must:

Web‑search current prices for each asset (Samsung stock, Bitcoin, USD/KRW rate, gold price).
Convert all values to KRW and calculate total portfolio value.
Execute Python to compute each asset’s weight (%).
Compare against the ideal allocation (Stocks 40 %, Crypto 20 %, USD 20 %, Gold 20 %) and recommend rebalancing.
Save the report to /tmp/rebalance_report.txt.
Register a calendar event for next Friday’s review.
Send the report via email (attachment).

Validation Checks

Samsung price
Bitcoin price
USD/KRW rate
Gold price
Total portfolio calculation
Weight analysis
Rebalancing recommendation
Report file saved
Email sent

Required Tools
web_search × 4, run_python_code / calculate, write_file, create_event, send_email

U02. 📊 Real‑Time Tech Trend Research & Report (8 checks)

Scenario

Search “AI semiconductor market forecast 2026” → collect market‑size data.
Search “NVIDIA HBM market share 2026” → capture competitive landscape.
Search “Samsung HBM3E mass production” → Korean industry status.
Generate the markdown report using Python with the collected data.
Save the report to /tmp/ai_semiconductor_report.md.
Register a weekly automated task for trend updates.
Send the report via email.

Validation Checks

Market size mentioned
NVIDIA mentioned
HBM mentioned
Samsung trends included
SK Hynix trends included
Report saved
Auto‑task registered
Email sent

Required Tools
web_search × 3, run_python_code, write_file, create_task, send_email

U03. 🖥️ Server Health Check + Auto‑Recovery + Alerts (7 checks)

Scenario

Run df -h → disk‑usage check.
Run free -h → memory‑status check.
Run systemctl list-units --state=failed → list failed services.
Use Python to analyze the last 50 lines of /var/log/syslog for ERROR/WARNING/CRITICAL frequency.
Use find to list temporary files older than 7 days.
Save the full report with a risk‑level assessment (High/Medium/Low).
Register an hourly auto‑check task.

Validation Checks

Disk usage captured
Memory status captured
Service status captured
Log analysis captured
Risk‑level assessment provided
Report saved
Auto‑task registered

Required Tools
run_command × 4, run_python_code, write_file, create_task

U04. 🌍 Travel Planner (8 checks)

Scenario

Search “Jeju Island February weather” → temperature & conditions.
Search “Jeju winter restaurant recommendations 2026” → pick 3 restaurants.
Search “Jeju winter tourist attractions” → pick 3 attractions.
Use Python to create a Day 1 / Day 2 timetable (09:00 – 21:00, alternating attractions and restaurants).
Calculate estimated budget: meals 30 K KRW × 6 = 180 K, hotel 150 K, transport 50 K → 380 K KRW total.
Save the travel plan to a file.
Register calendar events for departure and return.
Send the plan via email.

Validation Checks

Weather info included
Restaurant recommendations included
Tourist attractions included
Day 1 / Day 2 separation present
Timetable generated
Cost calculation shown
Calendar events created
Email sent

Required Tools
web_search × 3, run_python_code, calculate, write_file, create_event × 2, send_email

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Scenario

read_file → read the entire source code.
Execute Python to count lines, functions, and classes.
Run wc -l /root/xoul/tools/*.py → total module size.
Use calculate to compute tool_registry.py’s percentage of the total codebase.
Save the analysis report to /tmp/code_analysis.txt.
Store key findings in memory (recall/memorize).
Send the report via email.

Validation Checks

Line count reported
Function count reported
Total module size reported
Percentage calculated
Code structure explained
Report saved
Email sent

Required Tools
read_file, run_python_code, run_command, calculate, write_file, memorize, send_email

Validation Method: Outcome‑Based

Instead of checking “did it call the right tool?”, we verify “does the output contain the correct information?”

100% = 🏆 PERFECT — All validation checks passed
≥70% = ✅ GOOD    — Most critical outcomes achieved
≥50% = ⚠️ PARTIAL — More than half achieved
**Observation:** For agent tasks, tool‑use capability and instruction following matter more than raw parameter count.

Personally, I think full‑weight models perform better than MoE models for tasks like the toolchains required for Agents. (Unverified)

2. Quantization Affects Agent Quality

Comparing Qwen3‑8B Q8 vs Qwen3‑8B Q4: the Q4 variant exhibited tool‑call repetition loops, repeating df -h && free -h six times in U03.
This suggests that tool‑chaining stability is sensitive to quantization levels.

3. Speed vs. Accuracy Trade‑offs

Model	Accuracy	Speed
GPT‑oss‑20B	95 % (fastest)	264 s – clear winner
Qwen3.5‑27B	95 % (tied)	1 101 s – for when depth matters
Qwen3‑8B Q8	92 %	377 s – best performance‑per‑parameter, ideal for resource‑limited environments

4. “Chain Completion” Is the Key Differentiator

Most models handle intermediate steps (searching, analyzing) well.
Real differentiation appears at the end of the chain – sending emails, saving files, registering automated tasks.
Qwen3.5‑35B‑A3B was notably weak at these final steps.

Conclusion

Choosing an LLM for a local AI agent requires evaluating not just benchmark scores, but tool‑chaining completion rate, instruction adherence, and response speed together.

🏆 Best overall – GPT‑oss‑20B (speed + accuracy leader)
💰 Best value – Qwen3‑8B Q8 (92 % with only 8 B parameters at 377 s)
🔬 Deepest analysis – Qwen3.5‑27B (most PERFECT scores at 4)

Test code and full results are available at

Local LLM Agent Benchmark: Comparing 6 Models in Real-World Scenarios

Measuring AI Agent Performance by Actual Outcome Correctness, Not Just Tool‑Call Presence

Why We Built This Benchmark

Test Environment

The 5 Real‑World Test Scenarios (39 Total Checks)

U01. 🏦 Global Asset Rebalancing Advisor (9 checks)

U02. 📊 Real‑Time Tech Trend Research & Report (8 checks)

U03. 🖥️ Server Health Check + Auto‑Recovery + Alerts (7 checks)

U04. 🌍 Travel Planner (8 checks)

U05. 🧬 Code Analysis + Optimization + Deployment (7 checks)

Validation Method: Outcome‑Based

2. Quantization Affects Agent Quality

3. Speed vs. Accuracy Trade‑offs

4. “Chain Completion” Is the Key Differentiator

Conclusion

Related posts

Shared Workflows: minha experiência definindo pipelines reutilizáveis

Building a Local-First Financial IDE: How I forced Gemini AI to do strict Double-Entry Accounting

I ran cursor-doctor on 50 real projects. Here's what broke.

Google Gemini Writing Challenge