I stopped writing prompts and started writing Python
Source: Dev.to
The Prompt Chaos
For a year I treated LLMs like a command line: type instructions, pray for output, tweak wording, add “IMPORTANT:”, move sentences around like a ritual. I ended up with folders of prompts:
v1.txt
v2_final.txt
v2_final_REALLY_final.txtNone of them documented why they worked. When something broke, I couldn’t tell if the issue was the prompt, the model, or the data. There was no version control, no tests—just vibes.
Enter DSPy
DSPy (from Stanford NLP) flips the model: you don’t write prompts, you write Python.
class AnalyzeStartup(dspy.Signature):
"""Analyze a startup pitch."""
pitch: str = dspy.InputField()
viability_score: int = dspy.OutputField()
strengths: list[str] = dspy.OutputField()
weaknesses: list[str] = dspy.OutputField()
verdict: str = dspy.OutputField()That’s it—no “You are an expert startup analyst…”, no “Respond in JSON format…”. DSPy compiles this signature into a prompt. When you need better prompts, you run an optimizer; DSPy rewrites them based on examples that work.
From Prompt Tricks to Signatures
Before:
“If I put examples before instructions, it works better. Sometimes. Unless it’s GPT‑4o.”
After:
I write a signature. DSPy figures out the best prompt format. The prompt becomes an implementation detail; I care about inputs, outputs, and behavior—not phrasing.
Testable LLM Code
Before:
Manually checking output.
After:
def test_startup_analyzer():
result = startup_analyzer(pitch="We're building AI for dog grooming...")
assert 1 0
assert len(result.weaknesses) > 0Real tests live in my test suite with assertions.
Swapping Models in One Line
Before:
Each model required its own prompt tuning (GPT‑4, Claude, Gemini, etc.).
After:
# Swap models
lm = dspy.LM("openai/gpt-4o-mini")
# lm = dspy.LM("anthropic/claude-3-sonnet")
# lm = dspy.LM("gemini/gemini-2.0-flash")
dspy.configure(lm=lm)Same code, different model. DSPy handles the prompt translation.
Optimizers Do the Tuning
Instead of manually tweaking prompts, I give DSPy examples of good outputs and let it figure out the best prompt:
optimizer = BootstrapFewShot(metric=my_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(StartupAnalyzer(), trainset=train_examples)DSPy runs experiments, finds examples that work, builds the prompt, and I simply review the results.
The Paradigm Shift
- Old way: LLMs are magic boxes you talk to in English; success depends on prompting skill.
- DSPy way: LLMs are function calls. You declare the interface, and the framework handles the implementation.
It’s the difference between scattering raw SQL queries across a codebase and using an ORM: one is brittle, untyped, and hard to refactor; the other is structured, testable, and maintainable.
Going Deeper
I wrote a full guide on building with DSPy—practical chapters, real code, and the hard‑won lessons. It’s called Harmless DSPy; Chapter 1 is free if you want to see if it’s your thing.
DSPy is developed by Omar Khattab and the Stanford NLP team. It’s open source, actively maintained, and has genuinely changed how I build with LLMs.