Powerful LLMs Are Not the Problem — Using Them “Raw” Is A systems-engineering view for builders

Published: 1 month ago (December 23, 2025 at 03:57 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

Large Language Models are no longer just tools for writing text or generating code

And that’s where a systems problem begins

This post is not about which model is better, faster, or cheaper.
It asks:

What is the correct system form of AI when it starts participating in decisions, not just producing output??

Many AI systems today are used “raw”

By “raw” I don’t mean unsafe, unethical, or non‑compliant. I mean this:

We are embedding high‑capability, non‑deterministic reasoning systems directly into environments that require stable, repeatable, auditable decisions — without a real system‑level control layer in between.

Prompt engineering, RAG, rules, and agent frameworks increase capability.
For low‑stakes tasks, this distinction barely matters.

LLMs behave more like engines than finished systems

From a systems perspective, LLMs look less like complete products and more like extremely powerful engines. They offer:

strong generalization
flexible reasoning paths
impressive expressive power

But they do not inherently manage:

stability
permissions
responsibility
long‑term state consistency

In classical computing terms:

LLM   ≈ CPU
Prompt≈ instruction stream

Which naturally raises the real question: Where is the operating system?

The real risk isn’t hallucinations

Hallucinations get most of the attention, but they’re not the core issue.
The deeper risks are structural.

Non‑repeatability

The same inputs, under nearly identical conditions, can produce different conclusions.

Illusion of control

LLMs can convincingly explain almost any result.

Poor debuggability

When decisions matter, we need to answer:

What triggered this decision?
Which path was taken?
Would it happen again?

If we can’t answer these, the system isn’t production‑grade.

The paradox: LLMs aren’t too weak — they’re too free

The problem isn’t intelligence.
Powerful components without system‑level constraints inevitably lead to:

behavior drift
accumulated risk
unclear accountability

This is not an AI problem.

Why “AI operating systems” keep coming up

We’ve seen this pattern before. CPUs alone were never enough:

Missing Feature	Consequence
No scheduling	Chaos
No isolation	Insecurity
No state management	Instability

Operating systems didn’t weaken CPUs.
For AI, the equivalent challenge is decision rights.

Decision models are not ML models

When we talk about decision models here, we don’t mean another trained model.
We mean a system layer that:

does not predict
does not generate
does not optimize creatively

It answers one question only:

Is this decision allowed under the current system state?

The requirement is simple, but rare in practice:

Same conditions → same decision.

Companion models need a hard boundary

Long‑lived systems (AI phones, robots, vehicles) need continuity — preferences, habits, context.
This motivates the idea of companion models, but a strict rule is required:

Companion models may provide state – never authority.

Once long‑term preference gains decision power, control erodes.

Closing: this is a systems problem, not a model race

The next phase of AI isn’t about making models smarter.
It’s about making systems:

controllable
repeatable
auditable
trustworthy over time

Intelligence without a decision kernel doesn’t scale reliability — it scales risk.

Author note

AI Decision Systems · Core Q&A (v1.0)

A: Traditional industry software excels when:

rules are explicit
boundaries are clear
conditions are enumerable

LLM‑based AI becomes powerful when:

information is incomplete
requirements are vaguely expressed
real‑world variables constantly change

This is a capability advantage, not an engineering maturity advantage.

Q2: You argue that “constraining LLMs” improves safety and reliability. Doesn’t that weaken their power?

Unconstrained LLMs: appear powerful, behave inconsistently, cannot be reliably audited.
System‑governed LLMs: retain intelligence, act only under permitted conditions, with decisions that can be traced, frozen, and reviewed.

In engineering, capability without control has no production value.

Q2 (Extended): You compare LLMs to powerful car engines. Does that imply most people are “using LLMs naked”? Why is that dangerous?

A high‑performance engine without transmission, brakes, or stability control becomes more dangerous as horsepower increases.
LLMs behave similarly:

stronger reasoning
better articulation
larger impact radius when things go wrong

The danger is not that LLMs make mistakes, but that those mistakes can’t be contained or audited.

Q3: So like a PC needs Windows before the CPU is useful, AI needs an OS? Is that why you’re building EDCA OS?

A CPU does not manage:

task scheduling
permission isolation
state persistence
fault recovery

That’s the operating system’s role.
When AI participates in decisions, it needs similar structure:

who may decide
under what conditions
whether a decision is allowed
whether it can be reproduced

EDCA OS focuses on turning decisions into system behavior, not making AI “smarter.”

Q4: Why did you choose the GPT client as your runtime environment? Is this your own standard?

We prioritize:

session stability
built‑in behavioral boundaries
consistent execution characteristics

At present, only a few LLM runtimes allow serious discussion of:

decision stability
repeatability
“same input → same outcome” validation

This is not a model benchmark — it’s a systems prerequisite.

Q5: What’s the real difference between traditional quantitative systems and AI‑based quant systems? Where does AI quant fail?

Traditional quant systems: fixed strategies, explicit paths, auditable and back‑testable behavior.
AI quant systems often suffer from:
- decision drift
- inconsistent behavior under identical conditions
- weak auditability

The issue is not intelligence, but missing decision‑stability structure.

Q5 (Extended): Does this mean you aim for scikit‑learn compatibility, or are you abandoning it?

scikit‑learn handles training and prediction.
EDCA‑style decision models handle whether predictions are allowed to be acted upon.

The two can coexist: use scikit‑learn for the predictive layer, then wrap it with an EDCA decision kernel to enforce repeatability, auditability, and permission checks.

Q6: Why did you build CMRE? What were you trying to validate?

Medical scenarios combine:

high risk

high responsibility

strong temptation to overstep

If a system can:

distinguish information from judgment
resist unauthorized decision‑making
remain stable under pressure

then it will be safer in less critical domains.

Q7: What’s your breakthrough in LLM‑based research assistants? Why do you disconnect online retrieval during testing?

Online retrieval often causes:

retrieval to be mistaken for reasoning

existing conclusions to masquerade as discovery

Disconnecting search forces the model to:

expose its reasoning structure
operate within known constraints
reveal gaps instead of hiding them behind citations

AI’s role in research is not to replace scientists.

Q6 (Extended): If data scarcity is no longer the bottleneck, what do you still rely on scientists for? Doesn’t AI lack cognitive bias?

What scientists uniquely provide is not data volume, but:

which variables matter

which assumptions deserve challenge

which questions are worth asking

AI expands reasoning space. Humans define research direction.