Powerful LLMs Are Not the Problem — Using Them “Raw” Is A systems-engineering view for builders

Published: (December 23, 2025 at 03:57 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Large Language Models are no longer just tools for writing text or generating code

And that’s where a systems problem begins

This post is not about which model is better, faster, or cheaper.
It asks:

What is the correct system form of AI when it starts participating in decisions, not just producing output??

Many AI systems today are used “raw”

By “raw” I don’t mean unsafe, unethical, or non‑compliant. I mean this:

We are embedding high‑capability, non‑deterministic reasoning systems directly into environments that require stable, repeatable, auditable decisions — without a real system‑level control layer in between.

Prompt engineering, RAG, rules, and agent frameworks increase capability.
For low‑stakes tasks, this distinction barely matters.

LLMs behave more like engines than finished systems

From a systems perspective, LLMs look less like complete products and more like extremely powerful engines. They offer:

  • strong generalization
  • flexible reasoning paths
  • impressive expressive power

But they do not inherently manage:

  • stability
  • permissions
  • responsibility
  • long‑term state consistency

In classical computing terms:

LLM   ≈ CPU
Prompt≈ instruction stream

Which naturally raises the real question: Where is the operating system?

The real risk isn’t hallucinations

Hallucinations get most of the attention, but they’re not the core issue.
The deeper risks are structural.

Non‑repeatability

The same inputs, under nearly identical conditions, can produce different conclusions.

Illusion of control

LLMs can convincingly explain almost any result.

Poor debuggability

When decisions matter, we need to answer:

  1. What triggered this decision?
  2. Which path was taken?
  3. Would it happen again?

If we can’t answer these, the system isn’t production‑grade.

The paradox: LLMs aren’t too weak — they’re too free

The problem isn’t intelligence.
Powerful components without system‑level constraints inevitably lead to:

  • behavior drift
  • accumulated risk
  • unclear accountability

This is not an AI problem.

Why “AI operating systems” keep coming up

We’ve seen this pattern before. CPUs alone were never enough:

Missing FeatureConsequence
No schedulingChaos
No isolationInsecurity
No state managementInstability

Operating systems didn’t weaken CPUs.
For AI, the equivalent challenge is decision rights.

Decision models are not ML models

When we talk about decision models here, we don’t mean another trained model.
We mean a system layer that:

  • does not predict
  • does not generate
  • does not optimize creatively

It answers one question only:

Is this decision allowed under the current system state?

The requirement is simple, but rare in practice:

Same conditions → same decision.

Companion models need a hard boundary

Long‑lived systems (AI phones, robots, vehicles) need continuity — preferences, habits, context.
This motivates the idea of companion models, but a strict rule is required:

  • Companion models may provide state – never authority.

Once long‑term preference gains decision power, control erodes.

Closing: this is a systems problem, not a model race

The next phase of AI isn’t about making models smarter.
It’s about making systems:

  • controllable
  • repeatable
  • auditable
  • trustworthy over time

Intelligence without a decision kernel doesn’t scale reliability — it scales risk.

Author note

AI Decision Systems · Core Q&A (v1.0)

A: Traditional industry software excels when:

  • rules are explicit
  • boundaries are clear
  • conditions are enumerable

LLM‑based AI becomes powerful when:

  • information is incomplete
  • requirements are vaguely expressed
  • real‑world variables constantly change

This is a capability advantage, not an engineering maturity advantage.

Q2: You argue that “constraining LLMs” improves safety and reliability. Doesn’t that weaken their power?

A:

  • Unconstrained LLMs: appear powerful, behave inconsistently, cannot be reliably audited.
  • System‑governed LLMs: retain intelligence, act only under permitted conditions, with decisions that can be traced, frozen, and reviewed.

In engineering, capability without control has no production value.

Q2 (Extended): You compare LLMs to powerful car engines. Does that imply most people are “using LLMs naked”? Why is that dangerous?

A:

A high‑performance engine without transmission, brakes, or stability control becomes more dangerous as horsepower increases.
LLMs behave similarly:

  • stronger reasoning
  • better articulation
  • larger impact radius when things go wrong

The danger is not that LLMs make mistakes, but that those mistakes can’t be contained or audited.

Q3: So like a PC needs Windows before the CPU is useful, AI needs an OS? Is that why you’re building EDCA OS?

A:

A CPU does not manage:

  • task scheduling
  • permission isolation
  • state persistence
  • fault recovery

That’s the operating system’s role.
When AI participates in decisions, it needs similar structure:

  • who may decide
  • under what conditions
  • whether a decision is allowed
  • whether it can be reproduced

EDCA OS focuses on turning decisions into system behavior, not making AI “smarter.”

Q4: Why did you choose the GPT client as your runtime environment? Is this your own standard?

A:

We prioritize:

  • session stability
  • built‑in behavioral boundaries
  • consistent execution characteristics

At present, only a few LLM runtimes allow serious discussion of:

  • decision stability
  • repeatability
  • “same input → same outcome” validation

This is not a model benchmark — it’s a systems prerequisite.

Q5: What’s the real difference between traditional quantitative systems and AI‑based quant systems? Where does AI quant fail?

A:

  • Traditional quant systems: fixed strategies, explicit paths, auditable and back‑testable behavior.
  • AI quant systems often suffer from:
    • decision drift
    • inconsistent behavior under identical conditions
    • weak auditability

The issue is not intelligence, but missing decision‑stability structure.

Q5 (Extended): Does this mean you aim for scikit‑learn compatibility, or are you abandoning it?

A:

  • scikit‑learn handles training and prediction.
  • EDCA‑style decision models handle whether predictions are allowed to be acted upon.

The two can coexist: use scikit‑learn for the predictive layer, then wrap it with an EDCA decision kernel to enforce repeatability, auditability, and permission checks.

Q6: Why did you build CMRE? What were you trying to validate?

Medical scenarios combine:

  • high risk
  • high responsibility
  • strong temptation to overstep

If a system can:

  • distinguish information from judgment
  • resist unauthorized decision‑making
  • remain stable under pressure

then it will be safer in less critical domains.

Q7: What’s your breakthrough in LLM‑based research assistants? Why do you disconnect online retrieval during testing?

Online retrieval often causes:

  • retrieval to be mistaken for reasoning
  • existing conclusions to masquerade as discovery

Disconnecting search forces the model to:

  • expose its reasoning structure
  • operate within known constraints
  • reveal gaps instead of hiding them behind citations

AI’s role in research is not to replace scientists.

Q6 (Extended): If data scarcity is no longer the bottleneck, what do you still rely on scientists for? Doesn’t AI lack cognitive bias?

What scientists uniquely provide is not data volume, but:

  • which variables matter
  • which assumptions deserve challenge
  • which questions are worth asking

AI expands reasoning space. Humans define research direction.

Back to Blog

Related posts

Read more »