Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing
Source: Dev.to
Which Is Faster?
Short answer:
Splitting the questions into multiple independent parallel requests is almost always faster.
Why? – A First‑Principles Look at How LLMs “Write”
| Step | What Happens | Latency Impact |
|---|---|---|
| Autoregressive generation | The model produces one token at a time, appends it to the prompt, then generates the next token. | N tokens → N forward passes |
| Prefill | The whole input prompt is processed once to build the KV‑cache. | Linear in input length |
| Decode | Tokens are generated sequentially (one‑by‑one). | Dominates total latency |
Consequences
- A 100‑token answer ≈ 100 inference steps.
- A 500‑token answer ≈ 500 inference steps.
- Total output length directly determines total latency.
Scenario: 5 Independent Questions, ~200 Tokens Each
Approach A – Combine All Questions into One Request
Please answer the following questions separately:
1. …
2. …
3. …
4. …
5. …
What the model must do
- Prefill: Process a long concatenated prompt (all 5 questions).
- Decode: Generate ≈ 5 × 200 = 1000 tokens sequentially.
- Extra overhead:
- Context switches (“now answering question 3”).
- Larger KV‑cache → more attention computation per step.
- Formatting/transition text often pushes the token count above 1000.
Estimated latency ≈ 1000 × (per‑token generation time).
Approach B – Send 5 Independent Requests in Parallel
Each request contains a single question and generates ≈ 200 tokens.
What the server does
- Prefill: Shorter prompt for each request.
- Decode: 5 separate decode streams run concurrently (or are batched together).
- Modern inference engines (vLLM, TensorRT‑LLM, TGI, etc.) use continuous batching: a single GPU forward pass can emit one token for each of the 5 requests at the same time.
Estimated latency ≈ max(individual request latency) ≈ 200 × (per‑token generation time).
Direct Comparison
| Approach | Total Output Tokens | Approx. Latency (relative) |
|---|---|---|
| Combined request | ~1000+ | ~1000 decode steps (sequential) |
| 5 parallel requests | ~200 each | ~200 decode steps (parallel) |
Theoretical speed‑up: ~5× (equal to the number of questions).
Why Parallel Requests Are Faster on the Server Side
-
Continuous Batching – GPUs excel at parallel matrix ops.
- 5 short requests → a 5‑way batched forward pass, producing 5 tokens per step.
- 1 long request → a single‑sequence pass, producing only 1 token per step.
-
Higher GPU Utilization – Batching many short sequences keeps the GPU busy, whereas a single long sequence wastes parallel capacity.
-
Prefill vs. Decode –
- Combined: Longer prefill + longer decode.
- Split: Shorter prefill for each request; all prefills can be pipelined or run concurrently, and each decode is short.
Quality Considerations (Beyond Speed)
| Issue | Combined Prompt | Split Requests |
|---|---|---|
| Attention dilution | Irrelevant context can lower answer quality (“Lost in the Middle”). | Full focus on a single question. |
| Formatting errors | Numbering/omission mistakes more likely. | Isolated output → cleaner formatting. |
| Error propagation | Mistake in Q2 can affect Q3‑Q5 (autoregressive inertia). | Errors stay confined to the offending request. |
When Combining Might Still Be Reasonable
| Situation | Reason |
|---|---|
| Hidden correlations | If questions are related (e.g., parts of the same report), a shared context can improve consistency. |
| Strict API rate limits | If you can only make 3 calls/min, you may need to bundle. |
| Network latency dominates | Very high round‑trip latency (e.g., > 2 s) could make 5 separate calls slower than one combined call. Modern APIs are usually 100‑300 ms, so this is rare. |
| Extremely short answers | When each answer is only a word or two, the prefill overhead dominates; a single request reduces redundant prefills. |
Quick Empirical Benchmark (Async Python)
import asyncio
import time
import aiohttp
API_URL = "https://api.your-llm.com/v1/completions"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY"}
async def ask_single(session, prompt):
start = time.time()
async with session.post(
API_URL,
json={"model": "gpt-4o-mini", "prompt": prompt, "max_tokens": 300},
headers=HEADERS,
) as resp:
await resp.json() # ignore content, just wait for response
return time.time() - start
async def benchmark():
questions = [
"Question 1: …",
"Question 2: …",
"Question 3: …",
"Question 4: …",
"Question 5: …",
]
async with aiohttp.ClientSession() as session:
# ---- Approach A: Combined -------------------------------------------------
combined_prompt = "Please answer each question separately:\n" + "\n".join(questions)
t_combined = await ask_single(session, combined_prompt)
# ---- Approach B: Parallel -------------------------------------------------
tasks = [ask_single(session, q) for q in questions]
t_parallel = max(await asyncio.gather(*tasks))
print(f"Combined request latency : {t_combined:.2f}s")
print(f"Parallel requests latency: {t_parallel:.2f}s")
print(f"Speed‑up factor : {t_combined / t_parallel:.2f}×")
if __name__ == "__main__":
asyncio.run(benchmark())
Run the script a few times and you’ll typically see the parallel version ~4‑5× faster for independent, medium‑length answers.
TL;DR
- Speed: 5 parallel requests ≈ 5× faster than one combined request (assuming the service can batch them).
- Quality: Parallel requests keep the model’s attention focused and avoid cross‑question contamination.
- Exceptions: Only consider combining when questions are truly inter‑dependent, you’re throttled by strict rate limits, or network latency dwarfs generation time.
Bottom line: When the questions are unrelated, fire off separate, concurrent requests. 🚀
Parallel vs. Combined Requests
# Parallel execution
start = time.time()
await asyncio.gather(*[ask_single(session, q) for q in questions])
time_parallel = time.time() - start
print(f"Combined: {time_combined:.2f}s")
print(f"Parallel: {time_parallel:.2f}s")
print(f"Speedup: {time_combined / time_parallel:.1f}x")
In practice, 5 moderately complex independent questions typically achieve a 3–5× speedup with parallel requests.
Comparison
| Dimension | Combined request | Split parallel requests |
|---|---|---|
| Generation speed | Slow (sequential output of all answers) | Fast (parallel generation, latency = slowest) |
| GPU utilization | Low (single‑sequence inference) | High (batched parallel inference) |
| Answer quality | May degrade (attention dilution) | Better (isolated context) |
| API calls | 1 | N (one per question) |
| Best for | Rate‑limited / extremely short answers | Independent questions needing detailed answers |
Core principle (one sentence)
LLM’s autoregressive mechanism means output is sequential; combining requests forces all outputs into a single serial stream, whereas splitting requests leverages server‑side parallelism to generate multiple outputs simultaneously—the classic trade‑off of using more concurrent slots (space) to save time.