Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing

Published: 2 days ago (May 2, 2026 at 08:19 PM EDT)

6 min read

Source: Dev.to

Which Is Faster?

Short answer:
Splitting the questions into multiple independent parallel requests is almost always faster.

Why? – A First‑Principles Look at How LLMs “Write”

Step	What Happens	Latency Impact
Autoregressive generation	The model produces one token at a time, appends it to the prompt, then generates the next token.	N tokens → N forward passes
Prefill	The whole input prompt is processed once to build the KV‑cache.	Linear in input length
Decode	Tokens are generated sequentially (one‑by‑one).	Dominates total latency

Consequences

A 100‑token answer ≈ 100 inference steps.
A 500‑token answer ≈ 500 inference steps.
Total output length directly determines total latency.

Scenario: 5 Independent Questions, ~200 Tokens Each

Approach A – Combine All Questions into One Request

Please answer the following questions separately:
1. …
2. …
3. …
4. …
5. …

What the model must do

Prefill: Process a long concatenated prompt (all 5 questions).
Decode: Generate ≈ 5 × 200 = 1000 tokens sequentially.
Extra overhead:
- Context switches (“now answering question 3”).
- Larger KV‑cache → more attention computation per step.
- Formatting/transition text often pushes the token count above 1000.

Estimated latency ≈ 1000 × (per‑token generation time).

Approach B – Send 5 Independent Requests in Parallel

Each request contains a single question and generates ≈ 200 tokens.

What the server does

Prefill: Shorter prompt for each request.
Decode: 5 separate decode streams run concurrently (or are batched together).
Modern inference engines (vLLM, TensorRT‑LLM, TGI, etc.) use continuous batching: a single GPU forward pass can emit one token for each of the 5 requests at the same time.

Estimated latency ≈ max(individual request latency) ≈ 200 × (per‑token generation time).

Direct Comparison

Approach	Total Output Tokens	Approx. Latency (relative)
Combined request	~1000+	~1000 decode steps (sequential)
5 parallel requests	~200 each	~200 decode steps (parallel)

Theoretical speed‑up: ~5× (equal to the number of questions).

Why Parallel Requests Are Faster on the Server Side

Continuous Batching – GPUs excel at parallel matrix ops.
- 5 short requests → a 5‑way batched forward pass, producing 5 tokens per step.
- 1 long request → a single‑sequence pass, producing only 1 token per step.
Higher GPU Utilization – Batching many short sequences keeps the GPU busy, whereas a single long sequence wastes parallel capacity.
Prefill vs. Decode –
- Combined: Longer prefill + longer decode.
- Split: Shorter prefill for each request; all prefills can be pipelined or run concurrently, and each decode is short.

Quality Considerations (Beyond Speed)

Issue	Combined Prompt	Split Requests
Attention dilution	Irrelevant context can lower answer quality (“Lost in the Middle”).	Full focus on a single question.
Formatting errors	Numbering/omission mistakes more likely.	Isolated output → cleaner formatting.
Error propagation	Mistake in Q2 can affect Q3‑Q5 (autoregressive inertia).	Errors stay confined to the offending request.

When Combining Might Still Be Reasonable

Situation	Reason
Hidden correlations	If questions are related (e.g., parts of the same report), a shared context can improve consistency.
Strict API rate limits	If you can only make 3 calls/min, you may need to bundle.
Network latency dominates	Very high round‑trip latency (e.g., > 2 s) could make 5 separate calls slower than one combined call. Modern APIs are usually 100‑300 ms, so this is rare.
Extremely short answers	When each answer is only a word or two, the prefill overhead dominates; a single request reduces redundant prefills.

Quick Empirical Benchmark (Async Python)

import asyncio
import time
import aiohttp

API_URL = "https://api.your-llm.com/v1/completions"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY"}

async def ask_single(session, prompt):
    start = time.time()
    async with session.post(
        API_URL,
        json={"model": "gpt-4o-mini", "prompt": prompt, "max_tokens": 300},
        headers=HEADERS,
    ) as resp:
        await resp.json()          # ignore content, just wait for response
    return time.time() - start

async def benchmark():
    questions = [
        "Question 1: …",
        "Question 2: …",
        "Question 3: …",
        "Question 4: …",
        "Question 5: …",
    ]

    async with aiohttp.ClientSession() as session:
        # ---- Approach A: Combined -------------------------------------------------
        combined_prompt = "Please answer each question separately:\n" + "\n".join(questions)
        t_combined = await ask_single(session, combined_prompt)

        # ---- Approach B: Parallel -------------------------------------------------
        tasks = [ask_single(session, q) for q in questions]
        t_parallel = max(await asyncio.gather(*tasks))

        print(f"Combined request latency : {t_combined:.2f}s")
        print(f"Parallel requests latency: {t_parallel:.2f}s")
        print(f"Speed‑up factor          : {t_combined / t_parallel:.2f}×")

if __name__ == "__main__":
    asyncio.run(benchmark())

Run the script a few times and you’ll typically see the parallel version ~4‑5× faster for independent, medium‑length answers.

TL;DR

Speed: 5 parallel requests ≈ 5× faster than one combined request (assuming the service can batch them).
Quality: Parallel requests keep the model’s attention focused and avoid cross‑question contamination.
Exceptions: Only consider combining when questions are truly inter‑dependent, you’re throttled by strict rate limits, or network latency dwarfs generation time.

Bottom line: When the questions are unrelated, fire off separate, concurrent requests. 🚀

Parallel vs. Combined Requests

# Parallel execution
start = time.time()
await asyncio.gather(*[ask_single(session, q) for q in questions])
time_parallel = time.time() - start

print(f"Combined: {time_combined:.2f}s")
print(f"Parallel: {time_parallel:.2f}s")
print(f"Speedup: {time_combined / time_parallel:.1f}x")

In practice, 5 moderately complex independent questions typically achieve a 3–5× speedup with parallel requests.

Comparison

Dimension	Combined request	Split parallel requests
Generation speed	Slow (sequential output of all answers)	Fast (parallel generation, latency = slowest)
GPU utilization	Low (single‑sequence inference)	High (batched parallel inference)
Answer quality	May degrade (attention dilution)	Better (isolated context)
API calls	1	N (one per question)
Best for	Rate‑limited / extremely short answers	Independent questions needing detailed answers

Core principle (one sentence)
LLM’s autoregressive mechanism means output is sequential; combining requests forces all outputs into a single serial stream, whereas splitting requests leverages server‑side parallelism to generate multiple outputs simultaneously—the classic trade‑off of using more concurrent slots (space) to save time.

Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing

Which Is Faster?

Why? – A First‑Principles Look at How LLMs “Write”

Scenario: 5 Independent Questions, ~200 Tokens Each

Approach A – Combine All Questions into One Request

Approach B – Send 5 Independent Requests in Parallel

Direct Comparison

Why Parallel Requests Are Faster on the Server Side

Quality Considerations (Beyond Speed)

When Combining Might Still Be Reasonable

Quick Empirical Benchmark (Async Python)

TL;DR

Parallel vs. Combined Requests

Comparison

Related posts

Compute Arbitrage: Why API Routing Is the Next Big Infrastructure Play

Free GCP Practice Exams (13 Certification Tracks with Explanations)

Codex /goal and OpenGUI: long-running tasks need state

'My Data Transfer Bill Cost What?': When Cloud Economics Go Wrong

Which Is Faster?

Why? – A First‑Principles Look at How LLMs “Write”

Scenario: 5 Independent Questions, ~200 Tokens Each

Approach A – Combine All Questions into One Request

Approach B – Send 5 Independent Requests in Parallel

Direct Comparison

Why Parallel Requests Are Faster on the Server Side

Quality Considerations (Beyond Speed)

When Combining Might Still Be Reasonable

Quick Empirical Benchmark (Async Python)

TL;DR

Parallel vs. Combined Requests

Comparison

Related posts

Compute Arbitrage: Why API Routing Is the Next Big Infrastructure Play

Free GCP Practice Exams (13 Certification Tracks with Explanations)

Codex /goal and OpenGUI: long-running tasks need state

'My Data Transfer Bill Cost What?': When Cloud Economics Go Wrong

Approach A – Combine All Questions into One Request

Approach B – Send 5 Independent Requests in Parallel