How My Team Aligns on Prompting for Production
Source: Dev.to
My team at Google is automating sample code generation and maintenance. Part of that is using Generative AI to produce and assess instructional code. This introduces a challenge: how do we trust the system to meet our specific standards when core components are non‑deterministic?
Establishing Trust
We need to isolate and understand each large language model (LLM) request. That means knowing exactly what goes into the model and having a guarantee of what comes out. This challenge isn’t different from other feature development; the key is to stop treating prompting like casual chatting and start treating it like coding.
Prompting as “Natural Language Programming”
Prompting an LLM is effectively natural language programming: we are programming in English. English is ambiguous, subjective, and open to interpretation.
- In C++, a missing semicolon breaks the build.
- In English, a missing comma can change the meaning entirely: “I don’t know, John” vs. “I don’t know John”.
In a programming language, syntax is binary— it works or it doesn’t. In English, the difference between “Ensure variables are immutable” and “Make sure variables never change” might yield different results depending on the model’s training data. When you combine the fuzziness of human language with the “black box” probabilistic processing of an LLM, you face a difficult question: What is the weather going to be today in the land of AI?
The answer is to make the intentions behind your prompts explicit.
Pairing and Review
Writing a prompt is an exploratory process of finding words that trigger the best response. Relying on a single writer is risky because:
- One engineer’s “clear instruction” may be another’s loophole.
- Blind spots can lead to unintended model behavior.
Prompt pairing—having two engineers collaborate on a prompt—covers gaps that a single brain might miss.
Prompt reviews go beyond typo checks; they are logic checks:
- Does the prompt align with business requirements?
- Do disagreements about requirements surface, forcing the team to align before shipping?
Because English is fuzzy, the intent behind specific word choices isn’t always obvious. Document the “why” behind each decision.
Documentation
- Avoid treating prompts as canonical business requirements.
- An LLM request combines system instructions, user input, context, and deterministic post‑processing; the prompt alone isn’t sufficient for onboarding a developer.
Comments
Comment complex prompts just as you would complex code. Spotlight specific constraints or punctuation to explain the problem they solve. The model is a moving target, so any unintentional changes can make troubleshooting hard.
Commit Messages
Use commit messages to explain what was wrong with a prompt, e.g.:
fix: missing comma caused “John” to be dropped
Separation of Concerns (Use Dedicated Files)
Writing code and writing prompts require distinct mindsets:
- Code focuses on syntax and execution flow.
- Prompts focus on semantics and intent.
Embedding long English instructions inside code creates distraction. We keep prompts in dedicated files to disentangle application logic from LLM interaction configuration, which requires frequent tuning.
Tools like dotprompt treat these files as first‑class artifacts containing text, model parameters, and schema definitions. This highlights that invoking a model isn’t just a function call; it’s an integration with a distinct system that needs its own configuration.
Structured Output
To bridge unpredictable LLM output and deterministic computers, we rely on structured output to guide the model to emit JSON according to a schema. Even if only a single field is needed, defining a schema provides a guardrail that helps the model output conform to a shape we can validate programmatically. This is critical for code generation, where models often add unwanted preambles, conversational filler, or inconsistent markdown fences.
If the output doesn’t match the schema, we fail fast or retry. This allows us to integrate the LLM output into our process with the same confidence we have in detecting a bad API response.
Moving Toward Reliable Systems
Moving from one successful prompting to a reliable system requires acknowledging that prompts are code. You need to manage, review, and test them with the same rigor applied to the rest of your stack. While we are still working on better ways to benchmark quality, treating prompts as first‑class codebase assets is our first step toward building confidence in AI‑assisted automation.
Discussion prompts
- How does your team handle the fuzzy nature of LLMs in production?
- How do you review and test prompts?
- Do you treat prompts as configuration, code, or something else entirely?
References
- The lumberjack paradox: From theory to practice by Jennifer Davis
- Google Sandwich Manager, and the hallucinated SDK by Katie McLaughlin and Brian Dorsey
- Thanks to Jennifer Davis & Shawn Jones for review and contributions.
This links to how structured output works in the Genkit framework, which my team is using. A succinct example. ↩