An LLM Walks Into General Relativity - Lessons from a Devoxx Talk

Published: 1 day ago (May 10, 2026 at 01:18 PM EDT)

4 min read

Source: Dev.to

⚠️ Collection Error: Content refinement error: Error: 429 “you (bkperio) have reached your weekly usage limit, upgrade for higher limits: https://ollama.com/upgrade (ref: f23ce5c2-72c4-4b18-9775-a158151579d3)”

Why fluent AI-generated technical content can still be fundamentally incorrect, and how to fix it with system design. At Devoxx, I presented a simple experiment: What happens if you ask an LLM to generate an entire technical presentation on General Relativity? The model produces something impressive: well-structured slides correct terminology equations citations a coherent narrative It looks like something you could present. And yet, parts of it are fundamentally wrong. Not obviously wrong, convincingly wrong. This is the real problem with AI-generated technical content. Large Language Models are extremely good at: structure storytelling pedagogy But they are not built to preserve: physical constraints invariants measurement consistency In physics, that becomes “obvious” very quickly 😉. A model might say: “Light slows down in gravity, so time slows down.” This sounds reasonable. But it’s wrong, or at best, deeply misleading, because: locally, the speed of light is always c, constant time dilation is defined through clock comparisons, not metaphors This is what I call: Frame confusion The model mixes: different observers different measurement definitions intuitive metaphors …into a single explanation. Everything reads smoothly. But the reasoning is broken. General Relativity is unforgiving. You can’t get away with: vague explanations metaphor-only reasoning mixing frames of reference Every statement must answer: “How would you measure that?” If you can’t answer that, the explanation is incomplete, or wrong. This makes physics an ideal domain to expose LLM weaknesses. Instead of trying to “prompt better”, I built a system around the model. The goal: Not to make the model smarter, but to make the output auditable and correctable. The system is a multi-agent pipeline: Sources → Chunking → Retrieval (RAG)
→ Author Agent (generate slides) → Schema Validation → Post-processing → Physics Rule Engine → Critic Agent → Refinement Loop → PowerPoint Rendering

The model doesn’t output free text. It must generate strict JSON: slide types bullet constraints equations citations Validated with Pydantic. If it doesn’t parse, it doesn’t ship. I implemented rule-based checks like: Time dilation must reference clocks or measurements

Gravitational waves must reference strain or detectors

No “black holes suck everything in” explanations Distinguish event horizon vs singularity

These rules catch systematic failure patterns instantly. A second LLM reviews the output: checks clarity checks reasoning suggests corrections But importantly, it runs after deterministic validation Generate → Validate → Critique → Revise

This loop runs until: errors are reduced or a maximum number of iterations is reached From a real run: Draft deck → 6 failing slides

After refinement → 4 failing slides

We didn’t achieve perfection. But we achieved something more important: We made correctness measurable. Even with this pipeline: Citations may look correct but not truly support claims Subtle reasoning errors remain Frame confusion is hard to eliminate Models can satisfy rules while staying vague Human review is still necessary. Reliable AI is not a prompting problem. These patterns generalize beyond physics: legal documents financial reports medical summaries architecture decisions Use: structured outputs deterministic validation domain-specific rules critique loops GitHub Repo: https://github.com/tase-nikol/gr-deck-agent Example commands: gr-deck-agent index gr-deck-agent draft gr-deck-agent review gr-deck-agent refine gr-deck-agent replay

You can watch the full talk here: YouTube: https://www.youtube.com/watch?v=NanGs7ZMQEE LLMs don’t “understand” systems. They generate plausible descriptions of them. If your domain has constraints, invariants, or correctness requirements: You need to build those constraints into the system,not hope the model learns them. If you’re working with AI-generated technical content, I’d love to hear: what failure modes you’ve seen how you validate outputs what worked (or didn’t) AI is fluent But reality is not optional

An LLM Walks Into General Relativity - Lessons from a Devoxx Talk

Related posts

How to Test MCP Servers Before They Break Your CI

ForgeOS Dojo - learn AI-assisted development, build something that matters

让 AI Agent 学会共享经验——我做了个'蚁群信息素'实验

The Gap Nobody Talks About :Students, Companies & The Technology Pressure