An LLM Walks Into General Relativity - Lessons from a Devoxx Talk
Source: Dev.to
⚠️ Collection Error: Content refinement error: Error: 429 “you (bkperio) have reached your weekly usage limit, upgrade for higher limits: https://ollama.com/upgrade (ref: f23ce5c2-72c4-4b18-9775-a158151579d3)”
Why fluent AI-generated technical content can still be fundamentally incorrect, and how to fix it with system design.
At Devoxx, I presented a simple experiment:
What happens if you ask an LLM to generate an entire technical presentation on General Relativity?
The model produces something impressive:
well-structured slides
correct terminology
equations
citations
a coherent narrative
It looks like something you could present. And yet, parts of it are fundamentally wrong. Not obviously wrong, convincingly wrong.
This is the real problem with AI-generated technical content.
Large Language Models are extremely good at:
structure
storytelling
pedagogy
But they are not built to preserve:
physical constraints
invariants
measurement consistency
In physics, that becomes “obvious” very quickly 😉.
A model might say:
“Light slows down in gravity, so time slows down.”
This sounds reasonable. But it’s wrong, or at best, deeply misleading, because:
locally, the speed of light is always c, constant
time dilation is defined through clock comparisons, not metaphors
This is what I call:
Frame confusion
The model mixes:
different observers
different measurement definitions
intuitive metaphors
…into a single explanation.
Everything reads smoothly. But the reasoning is broken.
General Relativity is unforgiving.
You can’t get away with:
vague explanations
metaphor-only reasoning
mixing frames of reference
Every statement must answer:
“How would you measure that?”
If you can’t answer that, the explanation is incomplete, or wrong. This makes physics an ideal domain to expose LLM weaknesses.
Instead of trying to “prompt better”, I built a system around the model.
The goal:
Not to make the model smarter, but to make the output auditable and correctable.
The system is a multi-agent pipeline:
Sources → Chunking
→ Retrieval (RAG)
→ Author Agent (generate slides)
→ Schema Validation
→ Post-processing
→ Physics Rule Engine
→ Critic Agent
→ Refinement Loop
→ PowerPoint Rendering
The model doesn’t output free text. It must generate strict JSON: slide types bullet constraints equations citations Validated with Pydantic. If it doesn’t parse, it doesn’t ship. I implemented rule-based checks like: Time dilation must reference clocks or measurements
Gravitational waves must reference strain or detectors
No “black holes suck everything in” explanations Distinguish event horizon vs singularity
These rules catch systematic failure patterns instantly. A second LLM reviews the output: checks clarity checks reasoning suggests corrections But importantly, it runs after deterministic validation Generate → Validate → Critique → Revise
This loop runs until: errors are reduced or a maximum number of iterations is reached From a real run: Draft deck → 6 failing slides
After refinement → 4 failing slides
We didn’t achieve perfection. But we achieved something more important: We made correctness measurable. Even with this pipeline: Citations may look correct but not truly support claims Subtle reasoning errors remain Frame confusion is hard to eliminate Models can satisfy rules while staying vague Human review is still necessary. Reliable AI is not a prompting problem. These patterns generalize beyond physics: legal documents financial reports medical summaries architecture decisions Use: structured outputs deterministic validation domain-specific rules critique loops GitHub Repo: https://github.com/tase-nikol/gr-deck-agent Example commands: gr-deck-agent index gr-deck-agent draft gr-deck-agent review gr-deck-agent refine gr-deck-agent replay
You can watch the full talk here: YouTube: https://www.youtube.com/watch?v=NanGs7ZMQEE LLMs don’t “understand” systems. They generate plausible descriptions of them. If your domain has constraints, invariants, or correctness requirements: You need to build those constraints into the system,not hope the model learns them. If you’re working with AI-generated technical content, I’d love to hear: what failure modes you’ve seen how you validate outputs what worked (or didn’t) AI is fluent But reality is not optional