Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Published: (December 28, 2025 at 03:42 AM EST)
7 min read
Source: Dev.to

Source: Dev.to

Why this series is different

Most blogs hand you ready‑made templates:

  • “Try this prompt!”
  • “Use these 10 techniques!”

That’s not what we’re doing.

We’re digging deep into the mechanics of large language models (LLMs):

  • How do LLMs actually process your prompts?
  • What makes a prompt effective at the mechanical level?
  • Where do LLMs fail and why?

The goal is to give you mental models that let you engineer prompts yourself, instead of copying someone else’s examples.

Series outline (so far)

  1. The Foundation – How LLMs Really Work
  2. The Art & Science of Prompting
  3. Prompting Techniques and Optimization
  4. Prompt Evaluation and Scaling
  5. Tips, Tricks, and Experience

We may add more parts later.

Let’s jump into Part 1

A quick reality check

Do you think LLMs are intelligent like humans?
Do they have a “brain” that understands your questions and thinks through answers?

If you answered yes, you’re wrong.

LLMs don’t think. They don’t understand. They are next‑token predictors—sophisticated autocomplete machines that guess the next token based on patterns they have seen.

“Wait, how can simple next‑word prediction answer complex questions, write code, or hold conversations?”

That’s a great question. The answer involves fascinating engineering, but we’ll keep the focus on what you need to know to write better prompts, nothing more, nothing less.

If you’re really interested in how machines learn, check out my longer write‑up here (link placeholder).

The basic workflow of a prompt

When you type a prompt and hit Enter, the model performs the following simplified steps:

  1. Tokenization – Your text is split into tokens (chunks of text).
  2. Embedding – Each token is turned into a vector of numbers.
  3. Transformer processing – Layers of attention compute relationships between tokens.
  4. Probability distribution – The model assigns a probability to every possible next token.
  5. Sampling / Decoding – A token is selected, appended to the sequence, and the process repeats until a stop condition is met.

Below we unpack each step.

Tokenization

A token is roughly a chunk of text—sometimes a whole word, sometimes part of a word, sometimes punctuation.

"Hello world" → ["Hello", " world"]          # 2 tokens
"apple"       → ["apple"]                    # 1 token
"12345"       → ["123", "45"]                # 2 tokens

Why tokenization matters

  • Context windows are measured in tokens, not words.
  • The way text is tokenized determines how the model “sees” it.
  • Some words are single tokens (handled more efficiently); others are split into multiple tokens.

Embedding

Models can’t work with raw text; they only understand numbers.
Each token is converted into a high‑dimensional vector (an embedding) that captures its semantic properties.

Transformer layers & attention

The token embeddings pass through a stack of Transformer layers.
The attention mechanism lets the model decide which tokens are relevant to each other.

Example: In the sentence “The bank of the river was muddy,” attention links bankriver and muddy, helping the model interpret “bank” as a riverbank rather than a financial institution.

Probability distribution over the vocabulary

After the attention passes, the model produces a probability for every token in its vocabulary (often > 50 000 tokens).

Paris      → 0.85   (85 %)
the        → 0.05   (5 %)
beautiful  → 0.03   (3 %)
London     → 0.02   (2 %)
… (thousands of other tokens with tiny probabilities)

The model does not “know” that Paris is the capital of France; it simply learned that, in the training data, “Paris” follows the phrase “The capital of France is” with high frequency.

Sampling / decoding

A token is chosen according to the probability distribution (e.g., greedy, top‑k, nucleus sampling), appended to the output, and the whole process repeats until an end‑of‑sequence token or another stop condition is reached.

Bottom line: An LLM is a repeated next‑token predictor driven by learned statistical patterns.

From “guess the next word” to useful behavior

If LLMs are just probability machines, how can they:

  • Answer questions correctly?
  • Write functional code?
  • Hold coherent conversations?
  • Follow complex instructions?

The answer lies in two major training phases that shape model behavior.

Pre‑training (unsupervised)

The model reads trillions of tokens from diverse sources:

  • Websites (Wikipedia, forums, blogs)
  • Books
  • Code repositories (GitHub)
  • Research papers
  • Social media

What it learns: statistical co‑occurrence patterns.
What it doesn’t learn: how to act as a helpful assistant.

Example: Ask a raw base model, “What is the capital of France?” It might reply with a list of other capitals because it’s merely continuing a pattern it saw in quiz‑style data.

Instruction‑tuning (supervised + RLHF)

  1. Supervised Fine‑Tuning (SFT) – Humans write thousands of example conversations (question → good answer). The model learns to map a question to a helpful answer rather than just continuing the prompt.
  2. Reinforcement Learning from Human Feedback (RLHF) – Humans rank multiple model responses as “good” or “bad”. The model is optimized to produce responses that are helpful, harmless, and honest.

These steps shift the probability mass: the model now predicts tokens that look like helpful assistant replies because that pattern has become the most probable one.

Recap

  • LLMs are next‑token predictors built on tokenization → embedding → transformer → probability distribution → sampling.
  • Pre‑training gives them raw knowledge (statistical patterns).
  • SFT + RLHF turn that knowledge into useful, aligned behavior.

Understanding this pipeline gives you the foundation to engineer prompts that work with the model’s mechanics rather than trying to “out‑smart” a black box.

Next up

In Part 2 we’ll explore The Art & Science of Prompting – how to phrase, structure, and frame prompts to get the best results from the system you now understand.

Stay tuned! 🚀

Controlling generation

It’s all still just next‑token prediction. The training only shapes which predictions have high probability.

Remember that probability distribution we talked about? Here’s where you get control. The model gives you probabilities, but configuration parameters decide how tokens are actually selected from those probabilities.

Temperature

Controls how “random” the model’s choices are.

Example scenario – the model predicts:

TokenProbability
Paris85 %
beautiful3 %
London2 %
  • Low Temperature (e.g., 0.2) – the model becomes more “confident” and almost always picks the top choice. Paris might effectively become 95 %+ likely.

    • Result: Deterministic, focused, repetitive outputs.
    • Use for: Code generation, data extraction, factual answers.
  • High Temperature (e.g., 0.8) – the probability curve flattens. Paris might drop to ~60 %, beautiful rises to ~10 %, London to ~8 %.

    • Result: More varied, creative, unpredictable outputs.
    • Use for: Creative writing, brainstorming, multiple perspectives.

Real example

Prompt: "The sky is"
Temperature 0.2 → "blue" (almost always)
Temperature 0.8 → "blue" or "cloudy" or "vast" or "filled with stars" (varies)

Top‑P (Nucleus Sampling)

Sets a probability threshold.

If you set Top‑P = 0.9, the model only considers tokens that together make up the top 90 % of probability, ignoring everything else.

Why this matters

  • Without Top‑P, even with a reasonable temperature, the model might occasionally pick a token with a 0.001 % probability → complete nonsense.
  • With Top‑P = 0.9, those ultra‑low‑probability tokens are never considered. The model stays coherent while still being creative.

Practical combinations

TemperatureTop‑PEffect
0.70.9Creative but coherent
0.21.0Deterministic and focused

Top‑K

Limits the model to the K most likely tokens.

Example: Top‑K = 50 → only the 50 highest‑probability tokens are considered; the rest are ignored.
This is a simpler version of Top‑P and is less commonly used in modern systems.

Full walk‑through example

Prompt: “Explain photosynthesis in simple terms”

  1. Tokenization
    ["Explain", " photosynthesis", " in", " simple", " terms"]

  2. Model processing – the transformer calculates relationships between tokens.

  3. Probability distribution for the next token

TokenProbability
Photosynthesis40 %
It15 %
The12 %
In8 %
  1. Configuration applied – Temperature 0.3, Top‑P 0.9

    • Low temperature sharpens the distribution: Photosynthesis → 65 %
    • Model picks Photosynthesis.
  2. Repeat – Now the sequence is

    Explain photosynthesis in simple terms Photosynthesis

    Calculate probabilities for the next token, pick based on the same configuration, and continue until the answer is complete.

The model never “understood” photosynthesis. It predicted tokens that statistically form explanations based on patterns from its training data.

Key takeaway

LLMs are probability engines, not reasoning machines. Every response is a statistical prediction of the next token, shaped by training data and controlled by configuration parameters.

Engineering the probabilities

Your prompt does more than ask a question—it shapes the entire probability landscape the model uses. By tweaking wording, reordering instructions, or adding examples, you can make different tokens more likely, leading to different outputs.

In the next part, we’ll explore The Art & Science of Prompting—how to deliberately craft prompts that steer those probabilities in your favor.

Resources

I’ve set up a GitHub repository for this series, where I’ll share code and additional resources. Check it out and give it a star!

Feel free to share your thoughts, comments, and insights below. Let’s learn and grow together!

Back to Blog

Related posts

Read more »