Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Published: (December 29, 2025 at 05:42 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

The Insight

While building LLM CodeForge, an agentic editor that allows LLMs to read, modify, and test code autonomously, after 5 000 tokens of instructions I realized something:

I wasn’t writing a prompt. I was building a domain‑specific language (DSL) embedded in natural language.

This article analyzes how and why this distinction is fundamental—and what you can learn for your own agentic systems.

How the Model Behaves in CodeForge

AspectWhat the model doesWhat the model doesn’t
Decision makingSelects which branch of the protocol to followDoes not decide what to do
Problem solvingExecutes a procedure described in natural languageDoes not solve problems creatively
NatureFunctions like a bytecode interpreter, a text‑driven finite‑state machine, or a planner with closed actionsRelies on deterministic control flow, not on open‑ended reasoning
ReliabilityWorks because I accept the LLM is fundamentally unreliableN/A

DSL Control Flow

Every request follows four steps:

[UNDERSTAND] → [GATHER] → [EXECUTE] → [RESPOND]

Step 1 – UNDERSTAND

Classify request type

TypeKeywordsNext Step
Explanation“what is”, “explain”[RESPOND] (text)
Modification“add”, “change”[GATHER] → [EXECUTE]
Analysis“analyze”, “show”[GATHER] → [RESPOND]

This is not chain‑of‑thought in the classic sense. It is deterministic task routing—a decision table mapping input → workflow. The model doesn’t “think”; it executes a conditional jump.

Invariant Rules

🚨 CRITICAL RULE: You CANNOT use update_file on a file you haven’t read in this conversation.

Self‑check before ANY update_file

  • Did I receive the content from the system?
  • Do I know the exact current state?
  • Am I modifying based on actual code?

If any answer is NOOUTPUT read_file ACTION, then STOP.

This is an attempt to define pre‑conditions in natural language. It’s akin to:

def update_file(path, content):
    assert path in conversation_state.read_files
    # ... actual update

Without a type system or automatic runtime enforcement, the rule reduces (but does not eliminate) the probability of the LLM modifying a file without having read it first. In tests I observed ~85‑90 % stability, but server‑side validation remains essential for critical cases.

Multi‑File Tasks: Prompt Regeneration

The most effective technique I implemented is dynamically regenerating the prompt to force the LLM to follow a multi‑step plan.

Concrete Scenario

User request: “Add authentication to the project.”

  1. Planning – LLM generates a plan:
{
  "plan": "I will modify these files in order:",
  "files_to_modify": ["Auth.js", "Login.jsx", "App.jsx"]
}
  1. First file – LLM works on Auth.js and completes it.

  2. Trick – Instead of asking the LLM to “remember the plan”, I regenerate the prompt with an explicit “next‑action” block:

### ⚠️ MULTI‑FILE TASK IN PROGRESS

You completed: Auth.js  
Remaining files: Login.jsx, App.jsx

### 🚨 REQUIRED ACTION
Your next output MUST be:
{"action":"continue_multi_file","next_file":{"path":"Login.jsx"}}

Do NOT do anything else. Do NOT deviate from the plan.

The LLM no longer needs to “remember” anything. The only possible action is baked into the prompt.

Why It’s Powerful

  • The state (which files are done, which remain) lives in external JavaScript, not in the LLM’s memory.
  • At every step I regenerate the prompt with the updated state.
  • The LLM always sees a single, unambiguous instruction.

Implementation Sketch

// In the code
function buildPrompt(multiFileState) {
  let prompt = BASE_PROMPT;

  if (multiFileState) {
    prompt += `
### TASK IN PROGRESS
Completed: ${multiFileState.completed.join(', ')}
Next: ${multiFileState.next}
Your ONLY valid action: continue_multi_file with ${multiFileState.next}
`;
  }

  return prompt;
}

This is state injection: external state completely controls what the LLM can do next.

Custom Delimiters for Mixed Content

To handle different “types” (JSON, code, metadata) the DSL uses custom delimiters:

#[json-data]
{"action":"create_file","file":{"path":"App.jsx"}}
#[end-json-data]

#[file-message]
This file implements the main app component.
#[end-file-message]

#[content-file]
export default function App() {
  return Hello World;
}
#[end-content-file]

Why Not Plain JSON or XML?

  • The content may contain {}, <>, etc., requiring complex escaping.
  • #[tag]…#[end-tag] is syntactically unique, easy to parse with regex, and independent of the embedded language.
  • It behaves like a context‑free grammar separating semantic layers.

Error‑Example Guidance

Embedding “error examples” directly in the DSL teaches the model the common failure modes—an inline unit‑test for the language.

❌ Incorrect✅ Correct
{"action":"start_multi_file","plan":{},"first_file":{...}}{"action":"start_multi_file","plan":{},"first_file":{...}}
#[json-data]{...}#[file-message]...#[json-data]{...}#[end-json-data]#[file-message]...

Trade‑offs of a Natural‑Language DSL

LimitationConsequence
❌ No verifiable typesNo static type checking
❌ No automatic syntactic validationErrors must be caught at runtime
❌ No AST for transformationsNo compile‑time optimizations

Compensations

  • Huge validation checklist (8+ points)
  • Semantic redundancy – same rule expressed in multiple ways
  • Extensive anti‑pattern documentation

When building a DSL in natural language, these trade‑offs are inevitable, but the resulting system can still be robust, transparent, and controllable.

Overview

The parser is a probabilistic LLM rather than a deterministic compiler.
If I were to evolve CodeForge in the future, a true mini‑DSL (JSON Schema + codegen) would reduce the prompt size by 30‑40 %. In the browser sandbox, however, the current choice is justified.

Pre‑Send Checklist

Before EVERY response, verify:

#CheckFix If Failed
1JSON validCorrect structure
2Tags completeAdd missing #[end-*] tags

Alone, this yields 40‑60 % reliability. In my system it reaches 80‑90 % because it acts as a stability multiplier when:

  • The model is already channeled (decision protocol)
  • The format is rigid (custom delimiters)
  • The next action is deterministic (state injection)

Meta‑validation is not the main feature – it is the final safety net in an already constrained system.

Model Compatibility

✅ Works well with Claude 3.5, GPT‑4  
❌ Smaller models will fail  
❌ Less‑aligned models will ignore sections

I am implicitly saying: this system requires “serious” models.
It is an architectural constraint I accepted — like saying “this library requires Python 3.10+.”

Contextual Re‑Anchoring

Take the “read‑before‑write” rule:

  • Appears in the Decision Protocol (when planning)
  • Appears in Available Actions (when executing)
  • Appears in Pre‑Send Validation (when verifying)
  • Appears in Golden Rules (as a general principle)

This is strategic repetition, not random redundancy. It mirrors safety‑critical systems:

  • Same invariant
  • Verified at multiple levels
  • With specific phrasing for each context

Pattern Examples

Bad vs. Good Patterns

// BAD: Relying on the model's "memory"
"Remember that you have already read these files..."

// GOOD: Injecting explicit state
prompt += `Files already read: ${readFiles.join(', ')}`

// BAD: Giving open choices
"Decide which operation to perform"

// GOOD: Forcing the only legal move
"Your NEXT action MUST be: continue_multi_file"

Input → Action → Next State

Input PatternActionNext State
"add X"GATHEREXECUTE
"explain Y"RESPONDEND

Instead of “think what to do”, use “if X then Y”.

Embedding Arbitrary Content

  • Don’t use JSON – escaping nightmare
  • Don’t use XML – conflicts with HTML/JSX
  • Use unique tags: #[content]…#[end-content]

Pattern #5 – Redundancy = Coverage, Not Noise

  • Repeat critical rules
    • In different formulations (semantic reinforcement)
    • In different contexts (contextual re‑anchoring)
    • With different rationales (why, not just what)

After 5 000 tokens and months of iterations, the most important lesson is:

This prompt is not “beautiful.” It is effective.

Optimization Shift

❌ Stopped looking for✅ Started optimizing for
The shortest possible promptRobustness in edge cases
The most elegant formulationFailure‑mode coverage
The most general abstractionDebugging clarity when it fails

Result:

  • Redundant? Yes.
  • Verbose? Absolutely.
  • Works? Consistently.

Future Directions: Where I Could Go

If I were to evolve CodeForge 2.0, I would explore a two‑agent architecture instead of a single 5 000‑token agent:

AgentToken BudgetRole
Planner Agent2 000Decides strategy
Executor Agent2 000Implements actions

Benefits

  • Separation of concerns
  • Less context per agent
  • Parallel execution possible

Key Techniques that Work

RatingTechnique
⭐⭐⭐⭐⭐State injection + forced next action
⭐⭐⭐⭐⭐Decision tables for task routing
⭐⭐⭐⭐⭐Custom delimiters for structured output
⭐⭐⭐⭐⭐Contextual re‑anchoring of invariants
⭐⭐⭐⭐Meta‑validation as a safety net
⭐⭐⭐Visual hierarchy (useful but not critical)

Fundamental principle:
Don’t ask the LLM to “understand” — force it to “execute”.

Treat the protocol as a DSL, not a conversation. External state must constrain possible actions, and validation must happen server‑ or client‑side, always, without exceptions. Redundancy can be a feature, not a bug.

Call to Action

Try CodeForge:

The project is open source – the full prompt and validation system implementation are available in the repository.

Questions for the Community

  1. Have you ever built embedded DSLs in natural language?
  2. What is the cost of “cognitive overhead” in your prompts?
  3. Two‑agent architecture vs. single‑agent: experiences?

Share in the comments — this is still largely unexplored territory.

Back to Blog

Related posts

Read more »

Bga Buses (MUX Challenge)

What I am Building Bus route finder, from where you are to your destination. Navigate Bucaramanga with breeze. My Pitch Video link to video can be added here D...