On-Prem vs. Proxy — How to Deploy LLMs Without Leaking Sensitive Data
Source: Dev.to

Your SOC 2 certification covers the vendor’s infrastructure—not the data your users paste into prompts. The moment client data is sent to a cloud model, the liability rests with you. The fix is architectural.
Below are three deployment options and guidance on when to use each.
On‑Premise
The model runs on your own hardware. Nothing leaves your network, satisfying air‑gap requirements and strict data‑residency mandates.
Use it when
- Air‑gap or strict residency mandates apply
- Government, defense, or intelligence data is involved
- You process > 2 M tokens/day, making infrastructure TCO competitive with API spend
Reality check
- Up‑front cost: $80 K – $250 K+
- Time to production: 3 – 6 months
- Ongoing ops: 0.5 – 1 FTE DevOps
- Hardware refresh: every 3 – 4 years
OpenAI‑compatible endpoint on your own hardware
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 4
Proxy / Gateway
The model stays in the cloud, but a central gateway controls the control plane. Every request passes through the gateway where PII is redacted, access is enforced, and interactions are logged before reaching the cloud model.
Use it when
- Shadow AI (employees using AI without governance) is a primary risk
- Governance is needed this quarter, not next year
- You prefer OPEX over CAPEX
Good options
| Solution | Type | Highlights |
|---|---|---|
| LiteLLM | Open‑source | Built‑in Presidio PII guardrails, 100+ providers |
| Portkey | Managed | Analytics, fallback routing |
| Kong AI Gateway | Enterprise | Full‑featured API layer |
LiteLLM with PII guardrails (litellm_config.yaml)
guardrails:
guardrail_name: pii-masking
litellm_params:
# add your provider‑specific parameters here
Hybrid — Local Redaction + Cloud Inference
Sensitive data is masked locally, then anonymized text is sent to a cloud model. This delivers frontier model quality without violating residency requirements—a pattern adopted by many regulated enterprises.
- Local Presidio agent anonymizes all data before it leaves your infrastructure.
- LLM Gateway enforces RBAC and logs every interaction.
- Cloud model processes the clean, anonymized text and never sees PII.
Presidio configuration (redact before the model sees the prompt)
mode: pre_call # redact BEFORE model sees prompt
At a Glance
| Feature | On‑Premise | Proxy / Gateway | Hybrid |
|---|---|---|---|
| Data leaves? | Never | Anonymized only | Anonymized only |
| Air‑gap safe? | Yes | No | No |
| Setup time | 3 – 6 months | 2 – 6 weeks | 4 – 10 weeks |
| Cost | $80 K – $250 K+ | Low (software) | Medium |
| Frontier models? | No | Yes | Yes |
| Best for | Strict residency | Shadow AI / governance | Regulated + cloud |
Further Reading
- Full decision framework & infrastructure specs: LinkedIn Pulse
- Leadership/compliance version: Substack
- Technical deep‑dive with full code: Hashnode