Multiple Independent Questions: Batch Into One Request or Split Into Many? — An Analysis of LLM Concurrent Processing

Published: (May 2, 2026 at 08:19 PM EDT)
6 min read
Source: Dev.to

Source: Dev.to

Which Is Faster?

Short answer:
Splitting the questions into multiple independent parallel requests is almost always faster.

Why? – A First‑Principles Look at How LLMs “Write”

StepWhat HappensLatency Impact
Autoregressive generationThe model produces one token at a time, appends it to the prompt, then generates the next token.N tokens → N forward passes
PrefillThe whole input prompt is processed once to build the KV‑cache.Linear in input length
DecodeTokens are generated sequentially (one‑by‑one).Dominates total latency

Consequences

  • A 100‑token answer ≈ 100 inference steps.
  • A 500‑token answer ≈ 500 inference steps.
  • Total output length directly determines total latency.

Scenario: 5 Independent Questions, ~200 Tokens Each

Approach A – Combine All Questions into One Request

Please answer the following questions separately:
1. …
2. …
3. …
4. …
5. …

What the model must do

  • Prefill: Process a long concatenated prompt (all 5 questions).
  • Decode: Generate ≈ 5 × 200 = 1000 tokens sequentially.
  • Extra overhead:
    • Context switches (“now answering question 3”).
    • Larger KV‑cache → more attention computation per step.
    • Formatting/transition text often pushes the token count above 1000.

Estimated latency ≈ 1000 × (per‑token generation time).

Approach B – Send 5 Independent Requests in Parallel

Each request contains a single question and generates ≈ 200 tokens.

What the server does

  • Prefill: Shorter prompt for each request.
  • Decode: 5 separate decode streams run concurrently (or are batched together).
  • Modern inference engines (vLLM, TensorRT‑LLM, TGI, etc.) use continuous batching: a single GPU forward pass can emit one token for each of the 5 requests at the same time.

Estimated latency ≈ max(individual request latency) ≈ 200 × (per‑token generation time).

Direct Comparison

ApproachTotal Output TokensApprox. Latency (relative)
Combined request~1000+~1000 decode steps (sequential)
5 parallel requests~200 each~200 decode steps (parallel)

Theoretical speed‑up: ~5× (equal to the number of questions).

Why Parallel Requests Are Faster on the Server Side

  1. Continuous Batching – GPUs excel at parallel matrix ops.

    • 5 short requests → a 5‑way batched forward pass, producing 5 tokens per step.
    • 1 long request → a single‑sequence pass, producing only 1 token per step.
  2. Higher GPU Utilization – Batching many short sequences keeps the GPU busy, whereas a single long sequence wastes parallel capacity.

  3. Prefill vs. Decode

    • Combined: Longer prefill + longer decode.
    • Split: Shorter prefill for each request; all prefills can be pipelined or run concurrently, and each decode is short.

Quality Considerations (Beyond Speed)

IssueCombined PromptSplit Requests
Attention dilutionIrrelevant context can lower answer quality (“Lost in the Middle”).Full focus on a single question.
Formatting errorsNumbering/omission mistakes more likely.Isolated output → cleaner formatting.
Error propagationMistake in Q2 can affect Q3‑Q5 (autoregressive inertia).Errors stay confined to the offending request.

When Combining Might Still Be Reasonable

SituationReason
Hidden correlationsIf questions are related (e.g., parts of the same report), a shared context can improve consistency.
Strict API rate limitsIf you can only make 3 calls/min, you may need to bundle.
Network latency dominatesVery high round‑trip latency (e.g., > 2 s) could make 5 separate calls slower than one combined call. Modern APIs are usually 100‑300 ms, so this is rare.
Extremely short answersWhen each answer is only a word or two, the prefill overhead dominates; a single request reduces redundant prefills.

Quick Empirical Benchmark (Async Python)

import asyncio
import time
import aiohttp

API_URL = "https://api.your-llm.com/v1/completions"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY"}

async def ask_single(session, prompt):
    start = time.time()
    async with session.post(
        API_URL,
        json={"model": "gpt-4o-mini", "prompt": prompt, "max_tokens": 300},
        headers=HEADERS,
    ) as resp:
        await resp.json()          # ignore content, just wait for response
    return time.time() - start

async def benchmark():
    questions = [
        "Question 1: …",
        "Question 2: …",
        "Question 3: …",
        "Question 4: …",
        "Question 5: …",
    ]

    async with aiohttp.ClientSession() as session:
        # ---- Approach A: Combined -------------------------------------------------
        combined_prompt = "Please answer each question separately:\n" + "\n".join(questions)
        t_combined = await ask_single(session, combined_prompt)

        # ---- Approach B: Parallel -------------------------------------------------
        tasks = [ask_single(session, q) for q in questions]
        t_parallel = max(await asyncio.gather(*tasks))

        print(f"Combined request latency : {t_combined:.2f}s")
        print(f"Parallel requests latency: {t_parallel:.2f}s")
        print(f"Speed‑up factor          : {t_combined / t_parallel:.2f}×")

if __name__ == "__main__":
    asyncio.run(benchmark())

Run the script a few times and you’ll typically see the parallel version ~4‑5× faster for independent, medium‑length answers.

TL;DR

  • Speed: 5 parallel requests ≈ 5× faster than one combined request (assuming the service can batch them).
  • Quality: Parallel requests keep the model’s attention focused and avoid cross‑question contamination.
  • Exceptions: Only consider combining when questions are truly inter‑dependent, you’re throttled by strict rate limits, or network latency dwarfs generation time.

Bottom line: When the questions are unrelated, fire off separate, concurrent requests. 🚀

Parallel vs. Combined Requests

# Parallel execution
start = time.time()
await asyncio.gather(*[ask_single(session, q) for q in questions])
time_parallel = time.time() - start

print(f"Combined: {time_combined:.2f}s")
print(f"Parallel: {time_parallel:.2f}s")
print(f"Speedup: {time_combined / time_parallel:.1f}x")

In practice, 5 moderately complex independent questions typically achieve a 3–5× speedup with parallel requests.

Comparison

DimensionCombined requestSplit parallel requests
Generation speedSlow (sequential output of all answers)Fast (parallel generation, latency = slowest)
GPU utilizationLow (single‑sequence inference)High (batched parallel inference)
Answer qualityMay degrade (attention dilution)Better (isolated context)
API calls1N (one per question)
Best forRate‑limited / extremely short answersIndependent questions needing detailed answers

Core principle (one sentence)
LLM’s autoregressive mechanism means output is sequential; combining requests forces all outputs into a single serial stream, whereas splitting requests leverages server‑side parallelism to generate multiple outputs simultaneously—the classic trade‑off of using more concurrent slots (space) to save time.

0 views
Back to Blog

Related posts

Read more »