Benchmarking AI Agent Frameworks in 2026: AutoAgents (Rust) vs LangChain, LangGraph, LlamaIndex, PydanticAI, and more

Published: (February 18, 2026 at 05:16 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Production‑Ready AI Agent Framework Benchmark

We built AutoAgents – a Rust‑native framework for tool‑using AI agents – and measured it against the established Python and Rust players under identical conditions.

📋 Overview

  • Task – ReAct‑style agent that receives a question, decides whether to call a tool, parses a Parquet file, computes the average trip duration, and returns a formatted answer.
  • Scope – Single‑step tool call (no long‑horizon multi‑agent workflow).
  • Modelgpt‑5.1 (used across all frameworks).
  • Requests – 50 total, with 10 concurrent (TPM‑rate limited).
  • Hardware – Identical machine for every run; no process‑affinity pinning.

Measured Metrics

MetricDescription
End‑to‑end latencyP50, P95, P99 (ms)
ThroughputRequests per second (rps)
Peak RSS memoryMB
CPU usage% of a single core
Cold‑start timems (time to first request after process start)
Determinism rate% of runs producing identical output
Success rate% of successful completions (all frameworks 100 % except CrewAI, which was excluded after a 44 % failure rate)

Benchmark code and raw JSON are in the repository.

📊 Results

Raw Numbers

FrameworkLanguageAvg LatencyP95 LatencyThroughputPeak MemoryCPU %Cold‑StartScore
AutoAgentsRust5,714 ms9,652 ms4.97 rps1,046 MB29.2 %4 ms98.03
RigRust6,065 ms10,131 ms4.44 rps1,019 MB24.3 %4 ms90.06
LangChainPython6,046 ms10,209 ms4.26 rps5,706 MB64.0 %62 ms48.55
PydanticAIPython6,592 ms11,311 ms4.15 rps4,875 MB53.9 %56 ms48.95
LlamaIndexPython6,990 ms11,960 ms4.04 rps4,860 MB59.7 %54 ms43.66
GraphBitJS/TS8,425 ms14,388 ms3.14 rps4,718 MB44.6 %138 ms22.53
LangGraphPython10,155 ms16,891 ms2.70 rps5,570 MB39.7 %63 ms0.85

Composite score – weighted, min‑max normalized aggregate across all dimensions (latency 27.8 %, throughput 33.3 %, memory 22.2 %, CPU efficiency 16.7 %).


Memory Impact

FrameworkPeak Memory (MB)Approx. RAM for 50 instances
AutoAgents1,046~51 GB
Rig1,019~50 GB
LangChain5,706~279 GB
LangGraph5,570~272 GB
PydanticAI4,875~238 GB
LlamaIndex4,860~237 GB
GraphBit4,718~230 GB

Python frameworks carry a baseline memory cost (interpreter, dependency tree, GC). Rust’s ownership model frees memory immediately, eliminating a persistent GC heap.


Throughput & Latency

  • Throughput: AutoAgents 4.97 rps vs. an average of 3.66 rps across Python frameworks (+36 %).
  • Latency (P95): AutoAgents 9,652 ms vs. LangGraph 16,891 ms – the tail‑latency gap widens dramatically where reliability matters.

Cold‑Start

FrameworkCold‑Start (ms)Relative to AutoAgents
AutoAgents4 ms
LangChain62 ms15× slower
PydanticAI56 ms14× slower
LlamaIndex54 ms14× slower
GraphBit138 ms34× slower
LangGraph63 ms16× slower

Near‑zero initialization shines in serverless or auto‑scaling environments.


CPU Utilization

FrameworkCPU %Interpretation
Rig24.3 %Most efficient (Rust)
AutoAgents29.2 %Good efficiency
LangChain64.0 %Highest CPU demand (Python)

Higher CPU usage reduces headroom for traffic bursts.

🧮 Composite Score Formula

The composite score is calculated using min‑max normalization so that each dimension lies on a 0 – 1 scale (best = 1, worst = 0).

Formula

[ \text{score}= \underbrace{\text{mmLow}(\text{latency})}{\text{27.8%}} + \underbrace{\text{mmLow}(\text{memory})}{\text{22.2%}} + \underbrace{\text{mmHigh}(\text{throughput})}{\text{33.3%}} + \underbrace{\text{mmHigh}(\text{cpu_eff})}{\text{16.7%}} ]

where

[ \begin{aligned} \text{mmHigh}(v,; \text{min},; \text{max}) &= \frac{v-\text{min}}{\text{max}-\text{min}} \ \text{mmLow}(v,; \text{min},; \text{max}) &= \frac{\text{max}-v}{\text{max}-\text{min}} \end{aligned} ]

Weight Breakdown

DimensionWeightNormalisation direction
Latency27.8 %Low is better (mmLow)
Memory usage22.2 %Low is better (mmLow)
Throughput33.3 %High is better (mmHigh)
CPU efficiency16.7 %High is better (mmHigh)

Note: The weights reflect production priorities:

  • Throughput (capacity) – highest priority
  • Latency (user experience) – second priority
  • Memory (infrastructure cost) – third priority
  • CPU efficiency (burst headroom) – fourth priority

⚠️ Limitations & Scope

AspectNote
Agent complexityOnly single‑step tool calls were measured. Multi‑step or long‑horizon planning may shift the balance.
Multi‑agent orchestrationFrameworks like LangGraph or CrewAI are optimized for complex orchestration, which we did not benchmark.
Answer qualityDeterminism rate tracks output consistency, not correctness.
StreamingAll runs used blocking responses; streaming latency profiles differ.
ModelBenchmarks used gpt‑5.1 (similar to gpt‑4o‑mini). Different models will change the LLM‑dominated latency portion.
HardwareResults are tied to the specific hardware used; absolute numbers will vary on other machines.

📌 Takeaway

  • Memory is the biggest differentiator: Rust‑based AutoAgents uses ~5× less RAM than the average Python framework on the same workload.
  • Cold‑start latency is an order of magnitude lower for AutoAgents – a qualitative win for serverless or autoscaling deployments.
  • Throughput per instance is higher, meaning fewer instances are needed to serve a given load.
  • The overall composite score places AutoAgents clearly ahead of the Python ecosystem for this single‑tool benchmark.

If these gaps matter for your production use case, we welcome contributions that extend the benchmark suite (e.g., multi‑step agents, different LLMs, streaming, or additional languages).


Prepared by the AutoAgents team – February 2026

Benchmark Summary

MetricAutoAgents / RigPython frameworks (average)
Peak memory≤ 1.1 GB≥ 4.7 GB
Cold‑start latency10× lower
ThroughputHigher per instance
Composite scoreLeading
  • The memory advantage (≈ 5×) is structural; it cannot be eliminated by configuration tweaks.
  • Throughput and latency improvements are meaningful, though less dramatic for single‑agent tasks.

Ongoing Work

We’re extending the benchmark with:

  • More task types
  • Multi‑step workflows
  • Streaming measurements

Issues and pull requests are welcome.

⭐️ Give us a star on GitHub: https://github.com/liquidos-ai/AutoAgents

Thanks!

0 views
Back to Blog

Related posts

Read more »

aigent: toolchain for AI agent skills

AI agents are getting good at following instructions. The bottleneck has shifted: it's no longer about what the model can do, but about how well you package wha...