Scaling AI Agents: Mastering Elasticity, State, and Throughput with C#

Published: (February 9, 2026 at 03:00 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

High‑Performance AI Agent Architecture

Imagine a high‑end restaurant during the Friday night rush. The kitchen is chaos: orders pile up, chefs sweat, and a dropped plate means a whole table’s order is lost.

Now map that scenario onto your AI infrastructure. If your GPU cluster is the kitchen and your AI agents are the chefs, what happens when the “dinner rush” of user requests hits?

  • No elastic scaling → system crashes.
  • No state persistence → lost conversations.
  • No throughput optimization → high latency and skyrocketing cloud costs.

Deploying containerized AI agents at scale isn’t just about wrapping a model in Docker; it’s about orchestrating a dynamic dance of resources. This guide breaks down the architectural pillars needed to turn a simple AI model into a resilient, cloud‑native service using modern C# and Kubernetes.

Architectural Blueprint

PillarAnalogyGoal
Elastic ScalingThe ManagerReact to fluctuating demand.
State PersistenceThe MemoryKeep conversations alive across pod crashes.
Throughput OptimizationThe Assembly LineMaximize hardware usage via batching.

1️⃣ Elastic Scaling (Intent‑Based)

In standard Kubernetes deployments you scale on CPU or RAM usage. For AI agents those metrics are misleading:

  • A GPU can show 100 % utilization while processing a massive batch, or sit idle waiting for a network response.
  • The real bottlenecks are queue depth (how many requests are waiting for the GPU) and inference latency (Time‑to‑First‑Token, TTFT).

Instrumentation with System.Diagnostics.Metrics

using System.Diagnostics;
using System.Diagnostics.Metrics;

public class InferenceMetrics
{
    private static readonly Meter _meter = new("AI.Agent.Inference");

    // Latency of generating a response (ms)
    private static readonly Histogram<double> _generationLatency =
        _meter.CreateHistogram<double>("agent.generation.latency.ms", "ms",
            "Time taken to generate a response");

    // Number of requests waiting for inference
    private static readonly ObservableGauge<int> _queueDepth =
        _meter.CreateObservableGauge<int>("agent.queue.depth",
            () => RequestQueue.Count, // callback to read current queue size
            "requests",
            "Number of requests waiting for inference");

    public void RecordLatency(double latencyMs) => _generationLatency.Record(latencyMs);
}

Win: By decoupling scaling triggers from generic CPU usage to domain‑specific metrics (latency / queue depth), the Horizontal Pod Autoscaler (HPA) can scale proactively to maintain user experience.

2️⃣ State Persistence (Short‑Term Memory)

AI agents are stateful during a session: they rely on prior messages, tool outputs, and memory. Containers, however, are ephemeral. If Pod A crashes, its in‑memory conversation history disappears.

Distributed Cache with IDistributedCache

using Microsoft.Extensions.Caching.Distributed;
using System.Text.Json;

public interface IAgentStateStore
{
    Task<T?> GetStateAsync<T>(string sessionId, CancellationToken ct);
    Task SetStateAsync<T>(string sessionId, T state, CancellationToken ct);
}

public class RedisAgentStateStore : IAgentStateStore
{
    private readonly IDistributedCache _cache;
    public RedisAgentStateStore(IDistributedCache cache) => _cache = cache;

    public async Task<T?> GetStateAsync<T>(string sessionId, CancellationToken ct)
    {
        byte[]? data = await _cache.GetAsync(sessionId, ct);
        if (data == null) return default;

        // High‑performance deserialization (source‑generated in .NET 8)
        return JsonSerializer.Deserialize<T>(data);
    }

    public async Task SetStateAsync<T>(string sessionId, T state, CancellationToken ct)
    {
        byte[] data = JsonSerializer.SerializeToUtf8Bytes(state);
        var options = new DistributedCacheEntryOptions
        {
            SlidingExpiration = TimeSpan.FromMinutes(30) // evict inactive sessions
        };
        await _cache.SetAsync(sessionId, data, options, ct);
    }
}

Win: Pods become stateless – they only host logic and model weights. If a pod crashes, the next request pulls the session state from Redis and continues without loss.

3️⃣ Throughput Optimization (Batching)

Processing AI requests one‑by‑one is like plating dishes one at a time – the expensive GPU stays under‑utilized. Request batching aggregates multiple requests into a single forward pass.

Producer‑Consumer with System.Threading.Channels

using System.Threading.Channels;

public class BatchingService
{
    private readonly Channel<InferenceRequest> _channel;
    private readonly TimeSpan _maxBatchWait = TimeSpan.FromMilliseconds(20);
    private readonly int _maxBatchSize = 32;

    public BatchingService()
    {
        var options = new BoundedChannelOptions(_maxBatchSize * 2)
        {
            FullMode = BoundedChannelFullMode.Wait
        };
        _channel = Channel.CreateBounded<InferenceRequest>(options);
    }

    // Producer: called by the HTTP endpoint
    public async ValueTask EnqueueAsync(InferenceRequest request, CancellationToken ct)
        => await _channel.Writer.WriteAsync(request, ct);

    // Consumer: background worker
    public async Task RunAsync(CancellationToken ct)
    {
        var batch = new List<InferenceRequest>(_maxBatchSize);
        while (!ct.IsCancellationRequested)
        {
            // Wait for at least one item
            if (await _channel.Reader.WaitToReadAsync(ct))
            {
                while (_channel.Reader.TryRead(out var item))
                {
                    batch.Add(item);
                    if (batch.Count >= _maxBatchSize) break;
                }

                // Optional time‑based flush
                var flushTask = Task.Delay(_maxBatchWait, ct);
                var completed = await Task.WhenAny(flushTask,
                    _channel.Reader.WaitToReadAsync(ct).AsTask());

                // Execute batch
                await ProcessBatchAsync(batch, ct);
                batch.Clear();
            }
        }
    }

    private Task ProcessBatchAsync(List<InferenceRequest> batch, CancellationToken ct)
    {
        // TODO: Call the model with the aggregated inputs
        // Record latency metrics, update queue depth, etc.
        return Task.CompletedTask;
    }
}

Win: The GPU processes a single large tensor instead of many tiny ones, dramatically improving throughput and reducing per‑request cost.

Putting It All Together

  1. Expose metrics (InferenceMetrics) → HPA scales pods based on latency/queue depth.
  2. Persist session state (RedisAgentStateStore) → pods stay stateless, enabling rapid recovery.
  3. Batch incoming requests (BatchingService + Channels) → maximize GPU utilization.

With these pillars in place, your AI service can handle the “Friday night rush” gracefully, delivering low‑latency responses, preserving conversation context, and keeping cloud spend under control. 🚀

private readonly Channel<InferenceRequest> _channel;
private readonly ModelRunner _modelRunner;

public BatchingService(ModelRunner modelRunner)
{
    // Bounded channel prevents memory exhaustion (Back‑pressure)
    _channel = Channel.CreateBounded<InferenceRequest>(new BoundedChannelOptions(1000)
    {
        FullMode = BoundedChannelFullMode.Wait
    });
    _modelRunner = modelRunner;
}

public async ValueTask EnqueueAsync(InferenceRequest request)
{
    await _channel.Writer.WriteAsync(request);
}

public async Task ProcessBatchesAsync(CancellationToken stoppingToken)
{
    await foreach (var batch in ReadBatchesAsync(stoppingToken))
    {
        await _modelRunner.ExecuteBatchAsync(batch);
    }
}

private async IAsyncEnumerable<List<InferenceRequest>> ReadBatchesAsync(
    [EnumeratorCancellation] CancellationToken ct)
{
    var batch = new List<InferenceRequest>(capacity: 32);
    var timer = Task.Delay(TimeSpan.FromMilliseconds(10), ct);

    await foreach (var request in _channel.Reader.ReadAllAsync(ct))
    {
        batch.Add(request);

        // Condition 1: Batch is full
        if (batch.Count >= 32)
        {
            yield return batch;
            batch = new List<InferenceRequest>(32);
            timer = Task.Delay(TimeSpan.FromMilliseconds(10), ct);
        }
        // Condition 2: Timeout (latency optimisation)
        else if (batch.Count > 0 && await Task.WhenAny(timer, Task.CompletedTask) == timer)
        {
            yield return batch;
            batch = new List<InferenceRequest>(32);
            timer = Task.Delay(TimeSpan.FromMilliseconds(10), ct);
        }
    }
}

The Architectural Win

  • Throughput – Batching maximises the amount of work done per GPU cycle, reducing the number of pods required and lowering costs.
  • Trade‑off – It introduces a latency vs. throughput balance that can be tuned via batch size and timeout parameters.

These three concepts form a cohesive, self‑healing system:

  1. Traffic enters and is enqueued via System.Threading.Channels.
  2. The Batching Service groups requests and retrieves Agent State from Redis.
  3. The model processes the batch; Metrics record the latency.
  4. The HPA Controller reads the custom metric. If latency spikes, it scales out pods.
  5. New pods start, connect to Redis, and join the queue processing.

Scaling AI Agents

Moving past simple containerisation requires:

  • Elastic scaling with custom metrics.
  • State persistence via distributed caching (e.g., Redis).
  • Throughput optimisation with request batching.

Mastering these transforms a brittle prototype into a robust, cloud‑native powerhouse.

Modern .NET Practices

  • Leveraging System.Threading.Channels for back‑pressure‑aware queues.
  • Using System.Diagnostics.Metrics for idiomatic, low‑overhead telemetry.

Discussion Prompts

  1. Throughput vs. latency – In your experience, is the trade‑off between batching (throughput) and real‑time processing (latency) worth it for user‑facing chat agents, or should low latency be prioritised at all costs?
  2. State persistence – How do you currently handle state in containerised environments? Do you rely on external databases, or have you found ways to keep state within the pod lifecycle effectively?

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook Cloud‑Native AI & Microservices: Containerizing Agents and Scaling Inference (Leanpub).

0 views
Back to Blog

Related posts

Read more »