[Paper] Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
Source: arXiv - 2604.24110v1
Overview
The paper evaluates ITAS, a multi‑agent tutoring platform that stitches together several specialized large language model (LLM) agents to answer student questions. By measuring latency and cost across three Google Vertex AI pricing tiers and up to 50 concurrent users, the authors show how different deployment choices affect performance—from a single‑lecture demo to a campus‑wide rollout.
Key Contributions
- Empirical latency benchmark for a four‑agent LLM tutoring system across three pricing tiers (Standard PayGo, Priority PayGo, Provisioned Throughput).
- Scalability analysis covering 11 concurrency levels (1–50 simultaneous users) using real‑world graduate‑level STEM queries.
- Cost‑performance trade‑off model that compares per‑token pay‑as‑you‑go pricing with reserved‑capacity pricing, expressed in terms of textbook‑equivalent cost per student per semester.
- Tier‑selection guidance that maps typical educational use‑cases (seminar, classroom, university) to the most economical and responsive pricing tier.
Methodology
- System under test – ITAS orchestrates four specialized agents (e.g., conceptual explanation, problem‑solving, code debugging, feedback) on top of Gemini 2.5 Flash via Google Vertex AI.
- Workload generation – Over 3,000 real queries harvested from a live graduate STEM course were replayed to the system.
- Throughput tiers –
- Standard PayGo: baseline on‑demand pricing, no priority queue.
- Priority PayGo: same pay‑per‑token model but with a higher‑priority service class that reduces queuing delays.
- Provisioned Throughput: a fixed number of “tokens per second” reserved for the tenant, billed regardless of actual usage.
- Concurrency sweep – Simultaneous user sessions were ramped from 1 up to 50, measuring end‑to‑end response time (including the parallel “max‑latency” effect of multiple agents).
- Cost accounting – Token consumption per request was logged, then multiplied by each tier’s per‑token price. Results were normalized to a semester‑long textbook cost for easy interpretation.
Results & Findings
| Tier | Latency (median) @ 1‑50 users | Scaling behavior | Cost per student (worst‑case) |
|---|---|---|---|
| Priority PayGo | < 4 s across all loads | Flat, virtually no degradation | ≈ $12 (≈ 1/10 of a $120 textbook) |
| Standard PayGo | ~2 s (1‑5 users) → > 10 s (≥ 30 users) | Significant slowdown once > 20 concurrent users | ≈ $15 (still < textbook) |
| Provisioned Throughput | 1.2 s (≤ 20 users) → saturates > 20 users, latency spikes | Best at low concurrency, hits hard ceiling at ~20 users | $20–$30 if continuously reserved; becomes cheaper than PayGo when traffic is bursty and predictable |
- Parallel‑phase max effect: Because each query spawns four simultaneous API calls, the overall response time is dominated by the slowest agent. Priority PayGo’s reduced queuing mitigates this effect.
- Cost comparison: Even the most expensive provisioned scenario stays well below the cost of a single STEM textbook per semester, making LLM tutoring financially viable for most institutions.
- Tier‑selection matrix:
- Seminar / pilot: Provisioned Throughput (low concurrency, best latency).
- Classroom (20‑30 students): Priority PayGo (stable sub‑4 s).
- University‑wide (≥ 30 concurrent users): Priority PayGo remains the only tier that avoids severe latency spikes; Standard PayGo is only suitable for low‑traffic labs.
Practical Implications
- Deployers can pick a pricing tier based on expected concurrent load rather than defaulting to the cheapest pay‑as‑you‑go plan.
- Latency guarantees (sub‑4 s) are achievable at scale with Priority PayGo, which is critical for maintaining student engagement in live tutoring sessions.
- Budget planning: Institutions can budget tutoring services as a fraction of textbook costs, freeing up funds for other instructional resources.
- Predictable traffic patterns (e.g., scheduled office‑hours, exam‑prep weeks) can be matched with Provisioned Throughput to lock in lower per‑token rates, reducing overall spend.
- Architecture insight: Multi‑agent designs must account for the “slowest‑agent” bottleneck; developers may consider dynamic agent selection or early‑exit strategies to shave milliseconds off response times.
Limitations & Future Work
- Single LLM provider – Experiments are limited to Gemini 2.5 Flash; results may differ with other models or providers.
- Fixed agent count – The study uses four agents; scaling the number of specialized agents could exacerbate the max‑latency effect.
- Workload representativeness – Queries stem from a graduate STEM course; other domains (humanities, K‑12) may exhibit different token usage patterns.
- Cost model granularity – Real‑world contracts often include volume discounts or enterprise‑level SLA tiers not captured in the three tested plans.
Future research could explore adaptive agent orchestration (e.g., skipping unnecessary agents), cross‑provider cost‑latency trade‑offs, and long‑term field studies measuring learning outcomes alongside system performance.
Authors
- Iizalaarab Elhaimeur
- Nikos Chrisochoides
Paper Information
- arXiv ID: 2604.24110v1
- Categories: cs.CY, cs.AI, cs.DC, cs.LG
- Published: April 27, 2026
- PDF: Download PDF