Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Published: 1 month ago (December 28, 2025 at 03:42 AM EST)

7 min read

Source: Dev.to

Source: Dev.to

Why this series is different

Most blogs hand you ready‑made templates:

“Try this prompt!”
“Use these 10 techniques!”

That’s not what we’re doing.

We’re digging deep into the mechanics of large language models (LLMs):

How do LLMs actually process your prompts?
What makes a prompt effective at the mechanical level?
Where do LLMs fail and why?

The goal is to give you mental models that let you engineer prompts yourself, instead of copying someone else’s examples.

Series outline (so far)

The Foundation – How LLMs Really Work
The Art & Science of Prompting
Prompting Techniques and Optimization
Prompt Evaluation and Scaling
Tips, Tricks, and Experience

We may add more parts later.

Let’s jump into Part 1

A quick reality check

Do you think LLMs are intelligent like humans?
Do they have a “brain” that understands your questions and thinks through answers?

If you answered yes, you’re wrong.

LLMs don’t think. They don’t understand. They are next‑token predictors—sophisticated autocomplete machines that guess the next token based on patterns they have seen.

“Wait, how can simple next‑word prediction answer complex questions, write code, or hold conversations?”

That’s a great question. The answer involves fascinating engineering, but we’ll keep the focus on what you need to know to write better prompts, nothing more, nothing less.

If you’re really interested in how machines learn, check out my longer write‑up here (link placeholder).

The basic workflow of a prompt

When you type a prompt and hit Enter, the model performs the following simplified steps:

Tokenization – Your text is split into tokens (chunks of text).
Embedding – Each token is turned into a vector of numbers.
Transformer processing – Layers of attention compute relationships between tokens.
Probability distribution – The model assigns a probability to every possible next token.
Sampling / Decoding – A token is selected, appended to the sequence, and the process repeats until a stop condition is met.

Below we unpack each step.

Tokenization

A token is roughly a chunk of text—sometimes a whole word, sometimes part of a word, sometimes punctuation.

"Hello world" → ["Hello", " world"]          # 2 tokens
"apple"       → ["apple"]                    # 1 token
"12345"       → ["123", "45"]                # 2 tokens

Why tokenization matters

Context windows are measured in tokens, not words.
The way text is tokenized determines how the model “sees” it.
Some words are single tokens (handled more efficiently); others are split into multiple tokens.

Embedding

Models can’t work with raw text; they only understand numbers.
Each token is converted into a high‑dimensional vector (an embedding) that captures its semantic properties.

Transformer layers & attention

The token embeddings pass through a stack of Transformer layers.
The attention mechanism lets the model decide which tokens are relevant to each other.

Example: In the sentence “The bank of the river was muddy,” attention links bank ↔ river and muddy, helping the model interpret “bank” as a riverbank rather than a financial institution.

Probability distribution over the vocabulary

After the attention passes, the model produces a probability for every token in its vocabulary (often > 50 000 tokens).

Paris      → 0.85   (85 %)
the        → 0.05   (5 %)
beautiful  → 0.03   (3 %)
London     → 0.02   (2 %)
… (thousands of other tokens with tiny probabilities)

The model does not “know” that Paris is the capital of France; it simply learned that, in the training data, “Paris” follows the phrase “The capital of France is” with high frequency.

Sampling / decoding

A token is chosen according to the probability distribution (e.g., greedy, top‑k, nucleus sampling), appended to the output, and the whole process repeats until an end‑of‑sequence token or another stop condition is reached.

Bottom line: An LLM is a repeated next‑token predictor driven by learned statistical patterns.

From “guess the next word” to useful behavior

If LLMs are just probability machines, how can they:

Answer questions correctly?
Write functional code?
Hold coherent conversations?
Follow complex instructions?

The answer lies in two major training phases that shape model behavior.

Pre‑training (unsupervised)

The model reads trillions of tokens from diverse sources:

Websites (Wikipedia, forums, blogs)
Books
Code repositories (GitHub)
Research papers
Social media

What it learns: statistical co‑occurrence patterns.
What it doesn’t learn: how to act as a helpful assistant.

Example: Ask a raw base model, “What is the capital of France?” It might reply with a list of other capitals because it’s merely continuing a pattern it saw in quiz‑style data.

Instruction‑tuning (supervised + RLHF)

Supervised Fine‑Tuning (SFT) – Humans write thousands of example conversations (question → good answer). The model learns to map a question to a helpful answer rather than just continuing the prompt.
Reinforcement Learning from Human Feedback (RLHF) – Humans rank multiple model responses as “good” or “bad”. The model is optimized to produce responses that are helpful, harmless, and honest.

These steps shift the probability mass: the model now predicts tokens that look like helpful assistant replies because that pattern has become the most probable one.

Recap

LLMs are next‑token predictors built on tokenization → embedding → transformer → probability distribution → sampling.
Pre‑training gives them raw knowledge (statistical patterns).
SFT + RLHF turn that knowledge into useful, aligned behavior.

Understanding this pipeline gives you the foundation to engineer prompts that work with the model’s mechanics rather than trying to “out‑smart” a black box.

Next up

In Part 2 we’ll explore The Art & Science of Prompting – how to phrase, structure, and frame prompts to get the best results from the system you now understand.

Stay tuned! 🚀

Controlling generation

It’s all still just next‑token prediction. The training only shapes which predictions have high probability.

Remember that probability distribution we talked about? Here’s where you get control. The model gives you probabilities, but configuration parameters decide how tokens are actually selected from those probabilities.

Temperature

Controls how “random” the model’s choices are.

Example scenario – the model predicts:

Token	Probability
Paris	85 %
beautiful	3 %
London	2 %

Low Temperature (e.g., 0.2) – the model becomes more “confident” and almost always picks the top choice. Paris might effectively become 95 %+ likely.
- Result: Deterministic, focused, repetitive outputs.
- Use for: Code generation, data extraction, factual answers.
High Temperature (e.g., 0.8) – the probability curve flattens. Paris might drop to ~60 %, beautiful rises to ~10 %, London to ~8 %.
- Result: More varied, creative, unpredictable outputs.
- Use for: Creative writing, brainstorming, multiple perspectives.

Real example

Prompt: "The sky is"
Temperature 0.2 → "blue" (almost always)
Temperature 0.8 → "blue" or "cloudy" or "vast" or "filled with stars" (varies)

Top‑P (Nucleus Sampling)

Sets a probability threshold.

If you set Top‑P = 0.9, the model only considers tokens that together make up the top 90 % of probability, ignoring everything else.

Why this matters

Without Top‑P, even with a reasonable temperature, the model might occasionally pick a token with a 0.001 % probability → complete nonsense.
With Top‑P = 0.9, those ultra‑low‑probability tokens are never considered. The model stays coherent while still being creative.

Practical combinations

Temperature	Top‑P	Effect
0.7	0.9	Creative but coherent
0.2	1.0	Deterministic and focused

Top‑K

Limits the model to the K most likely tokens.

Example: Top‑K = 50 → only the 50 highest‑probability tokens are considered; the rest are ignored.
This is a simpler version of Top‑P and is less commonly used in modern systems.

Full walk‑through example

Prompt: “Explain photosynthesis in simple terms”

Tokenization
["Explain", " photosynthesis", " in", " simple", " terms"]
Model processing – the transformer calculates relationships between tokens.
Probability distribution for the next token

Token	Probability
Photosynthesis	40 %
It	15 %
The	12 %
In	8 %
…	…

Configuration applied – Temperature 0.3, Top‑P 0.9
- Low temperature sharpens the distribution: Photosynthesis → 65 %
- Model picks Photosynthesis.
Repeat – Now the sequence is

Explain photosynthesis in simple terms Photosynthesis

Calculate probabilities for the next token, pick based on the same configuration, and continue until the answer is complete.

The model never “understood” photosynthesis. It predicted tokens that statistically form explanations based on patterns from its training data.

Key takeaway

LLMs are probability engines, not reasoning machines. Every response is a statistical prediction of the next token, shaped by training data and controlled by configuration parameters.

Engineering the probabilities

Your prompt does more than ask a question—it shapes the entire probability landscape the model uses. By tweaking wording, reordering instructions, or adding examples, you can make different tokens more likely, leading to different outputs.

In the next part, we’ll explore The Art & Science of Prompting—how to deliberately craft prompts that steer those probabilities in your favor.

Resources

I’ve set up a GitHub repository for this series, where I’ll share code and additional resources. Check it out and give it a star!

Feel free to share your thoughts, comments, and insights below. Let’s learn and grow together!

Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1

Why this series is different

Series outline (so far)

Let’s jump into Part 1

A quick reality check

The basic workflow of a prompt

Tokenization

Embedding

Transformer layers & attention

Probability distribution over the vocabulary

Sampling / decoding

From “guess the next word” to useful behavior

Pre‑training (unsupervised)

Instruction‑tuning (supervised + RLHF)

Recap

Next up

Controlling generation

Temperature

Top‑P (Nucleus Sampling)

Top‑K

Full walk‑through example

Key takeaway

Engineering the probabilities

Resources

Related posts

Implementing Vibe Proving with Reinforcement Learning

📌 Day 21: 21 Days of Building a Small Language Model: Complete Journey Recap: Book Giveaway📌

TOON for LLMs: A Benchmark Performance Analysis

Mixtral of Experts

Why this series is different

Series outline (so far)

Let’s jump into Part 1

A quick reality check

The basic workflow of a prompt

Tokenization

Embedding

Transformer layers & attention

Probability distribution over the vocabulary

Sampling / decoding

From “guess the next word” to useful behavior

Pre‑training (unsupervised)

Instruction‑tuning (supervised + RLHF)

Recap

Next up

Controlling generation

Temperature

Top‑P (Nucleus Sampling)

Top‑K

Full walk‑through example

Key takeaway

Engineering the probabilities

Resources

Related posts

Implementing Vibe Proving with Reinforcement Learning

📌 Day 21: 21 Days of Building a Small Language Model: Complete Journey Recap: Book Giveaway📌

TOON for LLMs: A Benchmark Performance Analysis

Mixtral of Experts

Let’s jump into Part 1