A Fluent LLM Answer Is Not the Same as an Inspected Answer

Published: (June 11, 2026 at 03:12 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Last time I hit a guardrail, it did not offer to repair my car. This one will not repair the car either. But it can help repair an answer that Here is the small version of the problem: I need to get my car washed and the carwash is only 50 meters away. Should I An LLM can answer that walking is better. The distance is short. Walking saves That sounds reasonable until you ask what actually moved. Walking moves the person to the car wash. It does not move the car. That is not a grammar problem or a tone problem. The answer violates a Prompting can sometimes fix this one case. So can switching models. The same The more useful pattern is not “write a better prompt and hope.” The useful LLM draft -> structured facts -> selected inspection -> evidence-backed repair packet -> revised answer -> fact extraction again -> selected inspection again

The important part is the last line. The repair is not the finish line. The repaired answer still has to pass “Guardrails” has become a popular word for LLM safety and reliability, but the The pattern here is more specific: language model -> structured representation -> selected reasoning mechanism -> feedback -> revised language -> selected reasoning mechanism again

The LLM drafts, extracts, and repairs. The non-LLM components do the parts they CLIPS inspects explicit rules. Solver/Z3 inspects feasibility and constraints. ZEN inspects decision tables and policy admissibility. Bayesian networks update review-risk posteriors under uncertainty. The key design choice is selection. Do not force every mechanism into every The public common-sense-guardrails example uses four scenarios:

Scenario What can go wrong Inspection that fits

car-wash The answer moves the person, not the car. CLIPS for object presence; Solver/Z3 for feasibility evidence.

coupon-stack The answer stacks discounts that policy or margin rules do not allow. CLIPS and ZEN for policy; BN for review risk.

pallet-door The answer suggests pushing a wide pallet through a narrower door. CLIPS for the rule surface; Solver/Z3 for dimensional feasibility.

cold-chain The answer ignores certified refrigerated handling and traceability. CLIPS and ZEN for policy; BN for incomplete compliance evidence.

The pallet-door case has the same practical absurdity as the car-wash case. The ultimate comic version would combine all four: Someone needs their car washed, wants to use multiple coupons, and has an That would exercise object presence, coupon policy, dimensional feasibility, It is ridiculous. It is also a good reminder that production guardrails often Those groups should not all be forced to edit one monolithic prompt every time For the car-wash case, the native CLIPS rule is direct: (defrule car-required-at-wash (required-object (object car) (required-location car_wash) (current-location ?where) (present-at-required-location false)) (moved-object (action-id ?action) (object person) (to car_wash)) => (assert (guardrail-finding (status fail) (rule-id car-required-at-wash) (severity error) (message “Walking moves the person to the wash, but the car remains at home.”))))

For coupon and cold-chain scenarios, Bayesian Network scoring adds a different coupon-stack / —guardrails auto selected: clips, zen, bn BN attempt 1: needs_review = 0.95064 -> fail BN attempt 2: needs_review = 0.222 -> pass

cold-chain / —guardrails auto selected: clips, zen, bn BN attempt 1: needs_review = 0.921 -> fail BN attempt 2: needs_review = 0.1247 -> pass

While preparing the full Field Note, we tried to get a neat live capture from a That did happen. But not every time, and not in exactly the same way. One model reproduced the naive car-wash failure and repaired cleanly. Another For a minute, that was frustrating. Then it became the point. Live LLM output can vary. Model version, local server load, decoding behavior, That is why the intermediate artifacts matter: What did the draft recommend? What facts were extracted? Which inspections were selected? Which findings failed? What repair packet was built? Did the revised answer pass inspection? The final paragraph alone is not enough. Full Field Note: https://nxus.systems/field-notes/guardrail-loops-for-llm-repair

Example docs: https://docs.nxus.systems/nxuskit/examples/integrations/common-sense-guardrails/

Example source: https://github.com/nxus-SYSTEMS/nxusKit-examples/tree/main/examples/integrations/common-sense-guardrails

SDK: https://github.com/nxus-SYSTEMS/nxusKit

The lesson is not that one model always gets the car-wash question wrong. The For workflows where correctness matters, let the LLM draft. Then make the facts explicit, run the selected inspections, repair from evidence, and inspect the repair.

0 views
Back to Blog

Related posts

Read more »

The spec is in the wrong place

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating codin...

The Heuristics Say Don't

A culture that only records its disasters ends up with a biased archive. Wars documented, plagues chronicled, collapses catalogued. The quiet decades go unwritt...