Powerful LLMs Are Not the Problem — Using Them “Raw” Is A systems-engineering view for builders
Source: Dev.to
Large Language Models are no longer just tools for writing text or generating code
And that’s where a systems problem begins
This post is not about which model is better, faster, or cheaper.
It asks:
What is the correct system form of AI when it starts participating in decisions, not just producing output??
Many AI systems today are used “raw”
By “raw” I don’t mean unsafe, unethical, or non‑compliant. I mean this:
We are embedding high‑capability, non‑deterministic reasoning systems directly into environments that require stable, repeatable, auditable decisions — without a real system‑level control layer in between.
Prompt engineering, RAG, rules, and agent frameworks increase capability.
For low‑stakes tasks, this distinction barely matters.
LLMs behave more like engines than finished systems
From a systems perspective, LLMs look less like complete products and more like extremely powerful engines. They offer:
- strong generalization
- flexible reasoning paths
- impressive expressive power
But they do not inherently manage:
- stability
- permissions
- responsibility
- long‑term state consistency
In classical computing terms:
LLM ≈ CPU
Prompt≈ instruction stream
Which naturally raises the real question: Where is the operating system?
The real risk isn’t hallucinations
Hallucinations get most of the attention, but they’re not the core issue.
The deeper risks are structural.
Non‑repeatability
The same inputs, under nearly identical conditions, can produce different conclusions.
Illusion of control
LLMs can convincingly explain almost any result.
Poor debuggability
When decisions matter, we need to answer:
- What triggered this decision?
- Which path was taken?
- Would it happen again?
If we can’t answer these, the system isn’t production‑grade.
The paradox: LLMs aren’t too weak — they’re too free
The problem isn’t intelligence.
Powerful components without system‑level constraints inevitably lead to:
- behavior drift
- accumulated risk
- unclear accountability
This is not an AI problem.
Why “AI operating systems” keep coming up
We’ve seen this pattern before. CPUs alone were never enough:
| Missing Feature | Consequence |
|---|---|
| No scheduling | Chaos |
| No isolation | Insecurity |
| No state management | Instability |
Operating systems didn’t weaken CPUs.
For AI, the equivalent challenge is decision rights.
Decision models are not ML models
When we talk about decision models here, we don’t mean another trained model.
We mean a system layer that:
- does not predict
- does not generate
- does not optimize creatively
It answers one question only:
Is this decision allowed under the current system state?
The requirement is simple, but rare in practice:
Same conditions → same decision.
Companion models need a hard boundary
Long‑lived systems (AI phones, robots, vehicles) need continuity — preferences, habits, context.
This motivates the idea of companion models, but a strict rule is required:
- Companion models may provide state – never authority.
Once long‑term preference gains decision power, control erodes.
Closing: this is a systems problem, not a model race
The next phase of AI isn’t about making models smarter.
It’s about making systems:
- controllable
- repeatable
- auditable
- trustworthy over time
Intelligence without a decision kernel doesn’t scale reliability — it scales risk.
Author note
AI Decision Systems · Core Q&A (v1.0)
A: Traditional industry software excels when:
- rules are explicit
- boundaries are clear
- conditions are enumerable
LLM‑based AI becomes powerful when:
- information is incomplete
- requirements are vaguely expressed
- real‑world variables constantly change
This is a capability advantage, not an engineering maturity advantage.
Q2: You argue that “constraining LLMs” improves safety and reliability. Doesn’t that weaken their power?
A:
- Unconstrained LLMs: appear powerful, behave inconsistently, cannot be reliably audited.
- System‑governed LLMs: retain intelligence, act only under permitted conditions, with decisions that can be traced, frozen, and reviewed.
In engineering, capability without control has no production value.
Q2 (Extended): You compare LLMs to powerful car engines. Does that imply most people are “using LLMs naked”? Why is that dangerous?
A:
A high‑performance engine without transmission, brakes, or stability control becomes more dangerous as horsepower increases.
LLMs behave similarly:
- stronger reasoning
- better articulation
- larger impact radius when things go wrong
The danger is not that LLMs make mistakes, but that those mistakes can’t be contained or audited.
Q3: So like a PC needs Windows before the CPU is useful, AI needs an OS? Is that why you’re building EDCA OS?
A:
A CPU does not manage:
- task scheduling
- permission isolation
- state persistence
- fault recovery
That’s the operating system’s role.
When AI participates in decisions, it needs similar structure:
- who may decide
- under what conditions
- whether a decision is allowed
- whether it can be reproduced
EDCA OS focuses on turning decisions into system behavior, not making AI “smarter.”
Q4: Why did you choose the GPT client as your runtime environment? Is this your own standard?
A:
We prioritize:
- session stability
- built‑in behavioral boundaries
- consistent execution characteristics
At present, only a few LLM runtimes allow serious discussion of:
- decision stability
- repeatability
- “same input → same outcome” validation
This is not a model benchmark — it’s a systems prerequisite.
Q5: What’s the real difference between traditional quantitative systems and AI‑based quant systems? Where does AI quant fail?
A:
- Traditional quant systems: fixed strategies, explicit paths, auditable and back‑testable behavior.
- AI quant systems often suffer from:
- decision drift
- inconsistent behavior under identical conditions
- weak auditability
The issue is not intelligence, but missing decision‑stability structure.
Q5 (Extended): Does this mean you aim for scikit‑learn compatibility, or are you abandoning it?
A:
- scikit‑learn handles training and prediction.
- EDCA‑style decision models handle whether predictions are allowed to be acted upon.
The two can coexist: use scikit‑learn for the predictive layer, then wrap it with an EDCA decision kernel to enforce repeatability, auditability, and permission checks.
Q6: Why did you build CMRE? What were you trying to validate?
Medical scenarios combine:
- high risk
- high responsibility
- strong temptation to overstep
If a system can:
- distinguish information from judgment
- resist unauthorized decision‑making
- remain stable under pressure
then it will be safer in less critical domains.
Q7: What’s your breakthrough in LLM‑based research assistants? Why do you disconnect online retrieval during testing?
Online retrieval often causes:
- retrieval to be mistaken for reasoning
- existing conclusions to masquerade as discovery
Disconnecting search forces the model to:
- expose its reasoning structure
- operate within known constraints
- reveal gaps instead of hiding them behind citations
AI’s role in research is not to replace scientists.
Q6 (Extended): If data scarcity is no longer the bottleneck, what do you still rely on scientists for? Doesn’t AI lack cognitive bias?
What scientists uniquely provide is not data volume, but:
- which variables matter
- which assumptions deserve challenge
- which questions are worth asking
AI expands reasoning space. Humans define research direction.