Building an AI-Powered Code Editor: (part 2) LLM like interpreter
Source: Dev.to
The Insight
While building LLM CodeForge, an agentic editor that allows LLMs to read, modify, and test code autonomously, after 5 000 tokens of instructions I realized something:
I wasn’t writing a prompt. I was building a domain‑specific language (DSL) embedded in natural language.
This article analyzes how and why this distinction is fundamental—and what you can learn for your own agentic systems.
How the Model Behaves in CodeForge
| Aspect | What the model does | What the model doesn’t |
|---|---|---|
| Decision making | Selects which branch of the protocol to follow | Does not decide what to do |
| Problem solving | Executes a procedure described in natural language | Does not solve problems creatively |
| Nature | Functions like a bytecode interpreter, a text‑driven finite‑state machine, or a planner with closed actions | Relies on deterministic control flow, not on open‑ended reasoning |
| Reliability | Works because I accept the LLM is fundamentally unreliable | N/A |
DSL Control Flow
Every request follows four steps:
[UNDERSTAND] → [GATHER] → [EXECUTE] → [RESPOND]
Step 1 – UNDERSTAND
Classify request type
| Type | Keywords | Next Step |
|---|---|---|
| Explanation | “what is”, “explain” | [RESPOND] (text) |
| Modification | “add”, “change” | [GATHER] → [EXECUTE] |
| Analysis | “analyze”, “show” | [GATHER] → [RESPOND] |
This is not chain‑of‑thought in the classic sense. It is deterministic task routing—a decision table mapping input → workflow. The model doesn’t “think”; it executes a conditional jump.
Invariant Rules
🚨 CRITICAL RULE: You CANNOT use update_file on a file you haven’t read in this conversation.
Self‑check before ANY update_file
- Did I receive the content from the system?
- Do I know the exact current state?
- Am I modifying based on actual code?
If any answer is NO → OUTPUT read_file ACTION, then STOP.
This is an attempt to define pre‑conditions in natural language. It’s akin to:
def update_file(path, content):
assert path in conversation_state.read_files
# ... actual update
Without a type system or automatic runtime enforcement, the rule reduces (but does not eliminate) the probability of the LLM modifying a file without having read it first. In tests I observed ~85‑90 % stability, but server‑side validation remains essential for critical cases.
Multi‑File Tasks: Prompt Regeneration
The most effective technique I implemented is dynamically regenerating the prompt to force the LLM to follow a multi‑step plan.
Concrete Scenario
User request: “Add authentication to the project.”
- Planning – LLM generates a plan:
{
"plan": "I will modify these files in order:",
"files_to_modify": ["Auth.js", "Login.jsx", "App.jsx"]
}
-
First file – LLM works on
Auth.jsand completes it. -
Trick – Instead of asking the LLM to “remember the plan”, I regenerate the prompt with an explicit “next‑action” block:
### ⚠️ MULTI‑FILE TASK IN PROGRESS
You completed: Auth.js
Remaining files: Login.jsx, App.jsx
### 🚨 REQUIRED ACTION
Your next output MUST be:
{"action":"continue_multi_file","next_file":{"path":"Login.jsx"}}
Do NOT do anything else. Do NOT deviate from the plan.
The LLM no longer needs to “remember” anything. The only possible action is baked into the prompt.
Why It’s Powerful
- The state (which files are done, which remain) lives in external JavaScript, not in the LLM’s memory.
- At every step I regenerate the prompt with the updated state.
- The LLM always sees a single, unambiguous instruction.
Implementation Sketch
// In the code
function buildPrompt(multiFileState) {
let prompt = BASE_PROMPT;
if (multiFileState) {
prompt += `
### TASK IN PROGRESS
Completed: ${multiFileState.completed.join(', ')}
Next: ${multiFileState.next}
Your ONLY valid action: continue_multi_file with ${multiFileState.next}
`;
}
return prompt;
}
This is state injection: external state completely controls what the LLM can do next.
Custom Delimiters for Mixed Content
To handle different “types” (JSON, code, metadata) the DSL uses custom delimiters:
#[json-data]
{"action":"create_file","file":{"path":"App.jsx"}}
#[end-json-data]
#[file-message]
This file implements the main app component.
#[end-file-message]
#[content-file]
export default function App() {
return Hello World;
}
#[end-content-file]
Why Not Plain JSON or XML?
- The content may contain
{},<>, etc., requiring complex escaping. #[tag]…#[end-tag]is syntactically unique, easy to parse with regex, and independent of the embedded language.- It behaves like a context‑free grammar separating semantic layers.
Error‑Example Guidance
Embedding “error examples” directly in the DSL teaches the model the common failure modes—an inline unit‑test for the language.
| ❌ Incorrect | ✅ Correct |
|---|---|
{"action":"start_multi_file","plan":{},"first_file":{...}} | {"action":"start_multi_file","plan":{},"first_file":{...}} |
#[json-data]{...}#[file-message]... | #[json-data]{...}#[end-json-data]#[file-message]... |
Trade‑offs of a Natural‑Language DSL
| Limitation | Consequence |
|---|---|
| ❌ No verifiable types | No static type checking |
| ❌ No automatic syntactic validation | Errors must be caught at runtime |
| ❌ No AST for transformations | No compile‑time optimizations |
Compensations
- ✅ Huge validation checklist (8+ points)
- ✅ Semantic redundancy – same rule expressed in multiple ways
- ✅ Extensive anti‑pattern documentation
When building a DSL in natural language, these trade‑offs are inevitable, but the resulting system can still be robust, transparent, and controllable.
Overview
The parser is a probabilistic LLM rather than a deterministic compiler.
If I were to evolve CodeForge in the future, a true mini‑DSL (JSON Schema + codegen) would reduce the prompt size by 30‑40 %. In the browser sandbox, however, the current choice is justified.
Pre‑Send Checklist
Before EVERY response, verify:
| # | Check | Fix If Failed |
|---|---|---|
| 1 | JSON valid | Correct structure |
| 2 | Tags complete | Add missing #[end-*] tags |
Alone, this yields 40‑60 % reliability. In my system it reaches 80‑90 % because it acts as a stability multiplier when:
- The model is already channeled (decision protocol)
- The format is rigid (custom delimiters)
- The next action is deterministic (state injection)
Meta‑validation is not the main feature – it is the final safety net in an already constrained system.
Model Compatibility
✅ Works well with Claude 3.5, GPT‑4
❌ Smaller models will fail
❌ Less‑aligned models will ignore sections
I am implicitly saying: this system requires “serious” models.
It is an architectural constraint I accepted — like saying “this library requires Python 3.10+.”
Contextual Re‑Anchoring
Take the “read‑before‑write” rule:
- Appears in the Decision Protocol (when planning)
- Appears in Available Actions (when executing)
- Appears in Pre‑Send Validation (when verifying)
- Appears in Golden Rules (as a general principle)
This is strategic repetition, not random redundancy. It mirrors safety‑critical systems:
- Same invariant
- Verified at multiple levels
- With specific phrasing for each context
Pattern Examples
Bad vs. Good Patterns
// BAD: Relying on the model's "memory"
"Remember that you have already read these files..."
// GOOD: Injecting explicit state
prompt += `Files already read: ${readFiles.join(', ')}`
// BAD: Giving open choices
"Decide which operation to perform"
// GOOD: Forcing the only legal move
"Your NEXT action MUST be: continue_multi_file"
Input → Action → Next State
| Input Pattern | Action | Next State |
|---|---|---|
"add X" | GATHER | EXECUTE |
"explain Y" | RESPOND | END |
Instead of “think what to do”, use “if X then Y”.
Embedding Arbitrary Content
- Don’t use JSON – escaping nightmare
- Don’t use XML – conflicts with HTML/JSX
- Use unique tags:
#[content]…#[end-content]
Pattern #5 – Redundancy = Coverage, Not Noise
- Repeat critical rules
- In different formulations (semantic reinforcement)
- In different contexts (contextual re‑anchoring)
- With different rationales (why, not just what)
After 5 000 tokens and months of iterations, the most important lesson is:
This prompt is not “beautiful.” It is effective.
Optimization Shift
| ❌ Stopped looking for | ✅ Started optimizing for |
|---|---|
| The shortest possible prompt | Robustness in edge cases |
| The most elegant formulation | Failure‑mode coverage |
| The most general abstraction | Debugging clarity when it fails |
Result:
- Redundant? Yes.
- Verbose? Absolutely.
- Works? Consistently.
Future Directions: Where I Could Go
If I were to evolve CodeForge 2.0, I would explore a two‑agent architecture instead of a single 5 000‑token agent:
| Agent | Token Budget | Role |
|---|---|---|
| Planner Agent | 2 000 | Decides strategy |
| Executor Agent | 2 000 | Implements actions |
Benefits
- Separation of concerns
- Less context per agent
- Parallel execution possible
Key Techniques that Work
| Rating | Technique |
|---|---|
| ⭐⭐⭐⭐⭐ | State injection + forced next action |
| ⭐⭐⭐⭐⭐ | Decision tables for task routing |
| ⭐⭐⭐⭐⭐ | Custom delimiters for structured output |
| ⭐⭐⭐⭐⭐ | Contextual re‑anchoring of invariants |
| ⭐⭐⭐⭐ | Meta‑validation as a safety net |
| ⭐⭐⭐ | Visual hierarchy (useful but not critical) |
Fundamental principle:
Don’t ask the LLM to “understand” — force it to “execute”.
Treat the protocol as a DSL, not a conversation. External state must constrain possible actions, and validation must happen server‑ or client‑side, always, without exceptions. Redundancy can be a feature, not a bug.
Call to Action
Try CodeForge:
The project is open source – the full prompt and validation system implementation are available in the repository.
Questions for the Community
- Have you ever built embedded DSLs in natural language?
- What is the cost of “cognitive overhead” in your prompts?
- Two‑agent architecture vs. single‑agent: experiences?
Share in the comments — this is still largely unexplored territory.