Benchmarking AI Agent Frameworks in 2026: AutoAgents (Rust) vs LangChain, LangGraph, LlamaIndex, PydanticAI, and more
Source: Dev.to
Production‑Ready AI Agent Framework Benchmark
We built AutoAgents – a Rust‑native framework for tool‑using AI agents – and measured it against the established Python and Rust players under identical conditions.
📋 Overview
- Task – ReAct‑style agent that receives a question, decides whether to call a tool, parses a Parquet file, computes the average trip duration, and returns a formatted answer.
- Scope – Single‑step tool call (no long‑horizon multi‑agent workflow).
- Model –
gpt‑5.1(used across all frameworks). - Requests – 50 total, with 10 concurrent (TPM‑rate limited).
- Hardware – Identical machine for every run; no process‑affinity pinning.
Measured Metrics
| Metric | Description |
|---|---|
| End‑to‑end latency | P50, P95, P99 (ms) |
| Throughput | Requests per second (rps) |
| Peak RSS memory | MB |
| CPU usage | % of a single core |
| Cold‑start time | ms (time to first request after process start) |
| Determinism rate | % of runs producing identical output |
| Success rate | % of successful completions (all frameworks 100 % except CrewAI, which was excluded after a 44 % failure rate) |
Benchmark code and raw JSON are in the repository.
📊 Results
Raw Numbers
| Framework | Language | Avg Latency | P95 Latency | Throughput | Peak Memory | CPU % | Cold‑Start | Score |
|---|---|---|---|---|---|---|---|---|
| AutoAgents | Rust | 5,714 ms | 9,652 ms | 4.97 rps | 1,046 MB | 29.2 % | 4 ms | 98.03 |
| Rig | Rust | 6,065 ms | 10,131 ms | 4.44 rps | 1,019 MB | 24.3 % | 4 ms | 90.06 |
| LangChain | Python | 6,046 ms | 10,209 ms | 4.26 rps | 5,706 MB | 64.0 % | 62 ms | 48.55 |
| PydanticAI | Python | 6,592 ms | 11,311 ms | 4.15 rps | 4,875 MB | 53.9 % | 56 ms | 48.95 |
| LlamaIndex | Python | 6,990 ms | 11,960 ms | 4.04 rps | 4,860 MB | 59.7 % | 54 ms | 43.66 |
| GraphBit | JS/TS | 8,425 ms | 14,388 ms | 3.14 rps | 4,718 MB | 44.6 % | 138 ms | 22.53 |
| LangGraph | Python | 10,155 ms | 16,891 ms | 2.70 rps | 5,570 MB | 39.7 % | 63 ms | 0.85 |
Composite score – weighted, min‑max normalized aggregate across all dimensions (latency 27.8 %, throughput 33.3 %, memory 22.2 %, CPU efficiency 16.7 %).
Memory Impact
| Framework | Peak Memory (MB) | Approx. RAM for 50 instances |
|---|---|---|
| AutoAgents | 1,046 | ~51 GB |
| Rig | 1,019 | ~50 GB |
| LangChain | 5,706 | ~279 GB |
| LangGraph | 5,570 | ~272 GB |
| PydanticAI | 4,875 | ~238 GB |
| LlamaIndex | 4,860 | ~237 GB |
| GraphBit | 4,718 | ~230 GB |
Python frameworks carry a baseline memory cost (interpreter, dependency tree, GC). Rust’s ownership model frees memory immediately, eliminating a persistent GC heap.
Throughput & Latency
- Throughput: AutoAgents 4.97 rps vs. an average of 3.66 rps across Python frameworks (+36 %).
- Latency (P95): AutoAgents 9,652 ms vs. LangGraph 16,891 ms – the tail‑latency gap widens dramatically where reliability matters.
Cold‑Start
| Framework | Cold‑Start (ms) | Relative to AutoAgents |
|---|---|---|
| AutoAgents | 4 ms | 1× |
| LangChain | 62 ms | 15× slower |
| PydanticAI | 56 ms | 14× slower |
| LlamaIndex | 54 ms | 14× slower |
| GraphBit | 138 ms | 34× slower |
| LangGraph | 63 ms | 16× slower |
Near‑zero initialization shines in serverless or auto‑scaling environments.
CPU Utilization
| Framework | CPU % | Interpretation |
|---|---|---|
| Rig | 24.3 % | Most efficient (Rust) |
| AutoAgents | 29.2 % | Good efficiency |
| LangChain | 64.0 % | Highest CPU demand (Python) |
Higher CPU usage reduces headroom for traffic bursts.
🧮 Composite Score Formula
The composite score is calculated using min‑max normalization so that each dimension lies on a 0 – 1 scale (best = 1, worst = 0).
Formula
[ \text{score}= \underbrace{\text{mmLow}(\text{latency})}{\text{27.8%}} + \underbrace{\text{mmLow}(\text{memory})}{\text{22.2%}} + \underbrace{\text{mmHigh}(\text{throughput})}{\text{33.3%}} + \underbrace{\text{mmHigh}(\text{cpu_eff})}{\text{16.7%}} ]
where
[ \begin{aligned} \text{mmHigh}(v,; \text{min},; \text{max}) &= \frac{v-\text{min}}{\text{max}-\text{min}} \ \text{mmLow}(v,; \text{min},; \text{max}) &= \frac{\text{max}-v}{\text{max}-\text{min}} \end{aligned} ]
Weight Breakdown
| Dimension | Weight | Normalisation direction |
|---|---|---|
| Latency | 27.8 % | Low is better (mmLow) |
| Memory usage | 22.2 % | Low is better (mmLow) |
| Throughput | 33.3 % | High is better (mmHigh) |
| CPU efficiency | 16.7 % | High is better (mmHigh) |
Note: The weights reflect production priorities:
- Throughput (capacity) – highest priority
- Latency (user experience) – second priority
- Memory (infrastructure cost) – third priority
- CPU efficiency (burst headroom) – fourth priority
⚠️ Limitations & Scope
| Aspect | Note |
|---|---|
| Agent complexity | Only single‑step tool calls were measured. Multi‑step or long‑horizon planning may shift the balance. |
| Multi‑agent orchestration | Frameworks like LangGraph or CrewAI are optimized for complex orchestration, which we did not benchmark. |
| Answer quality | Determinism rate tracks output consistency, not correctness. |
| Streaming | All runs used blocking responses; streaming latency profiles differ. |
| Model | Benchmarks used gpt‑5.1 (similar to gpt‑4o‑mini). Different models will change the LLM‑dominated latency portion. |
| Hardware | Results are tied to the specific hardware used; absolute numbers will vary on other machines. |
📌 Takeaway
- Memory is the biggest differentiator: Rust‑based AutoAgents uses ~5× less RAM than the average Python framework on the same workload.
- Cold‑start latency is an order of magnitude lower for AutoAgents – a qualitative win for serverless or autoscaling deployments.
- Throughput per instance is higher, meaning fewer instances are needed to serve a given load.
- The overall composite score places AutoAgents clearly ahead of the Python ecosystem for this single‑tool benchmark.
If these gaps matter for your production use case, we welcome contributions that extend the benchmark suite (e.g., multi‑step agents, different LLMs, streaming, or additional languages).
Prepared by the AutoAgents team – February 2026
Benchmark Summary
| Metric | AutoAgents / Rig | Python frameworks (average) |
|---|---|---|
| Peak memory | ≤ 1.1 GB | ≥ 4.7 GB |
| Cold‑start latency | 10× lower | — |
| Throughput | Higher per instance | — |
| Composite score | Leading | — |
- The memory advantage (≈ 5×) is structural; it cannot be eliminated by configuration tweaks.
- Throughput and latency improvements are meaningful, though less dramatic for single‑agent tasks.
Ongoing Work
We’re extending the benchmark with:
- More task types
- Multi‑step workflows
- Streaming measurements
Issues and pull requests are welcome.
⭐️ Give us a star on GitHub: https://github.com/liquidos-ai/AutoAgents
Thanks!