Rethinking Learning Dynamics in AI Models: An Early Theory from Experimentation

Published: 3 weeks ago (January 14, 2026 at 10:39 PM EST)

2 min read

Source: Dev.to

Core Assumption in Deep Learning

Typical view: Minimize loss → improve performance.
Observation: Loss minimization alone does not always correlate with meaningful representation learning, especially in the early phases of training.

Proposed Idea

Representation Instability Phase – gradients first optimize surface‑level patterns before stable internal abstractions emerge.
I’m not sure whether this phenomenon already has a name, or if I’m mistaking training noise for structure.

Empirical Observations

Training Phase	Observed Behavior
Early epochs	Highly volatile embeddings
Mid‑training	Sudden clustering of representations
Late training	Stabilization of embeddings even as loss improvement slows

These dynamics were observed while training a small transformer‑like model on synthetic data, by logging intermediate layer activations.

Training Loop Example

for epoch in range(num_epochs):
    optimizer.zero_grad()

    outputs = model(inputs)
    loss = criterion(outputs, targets)

    loss.backward()
    optimizer.step()

    # Track embedding norm without gradients
    with torch.no_grad():
        embedding_norm = model.encoder.weight.norm().item()

    print(f"Epoch {epoch} | Loss: {loss.item():.4f} | Embedding Norm: {embedding_norm:.2f}")

Note: Embedding norms and cosine similarities changed more drastically than loss values, especially early on.

Cosine Similarity Tracker

prev_embedding = None
for epoch in range(num_epochs):
    current_embedding = model.encoder.weight.clone().detach()

    if prev_embedding is not None:
        similarity = cosine_similarity(
            prev_embedding.view(-1),
            current_embedding.view(-1)
        )
        print(f"Epoch {epoch} | Embedding Stability: {similarity.item():.4f}")

    prev_embedding = current_embedding

The similarity score jumps erratically at first, then converges toward ~0.98–0.99 even when loss improvement becomes marginal.

Working Hypothesis

Gradient descent initially prioritizes optimization shortcuts rather than semantic structure.
Later it converges toward representations that are robust and generalizable.

Potential Implications

Early stopping might prevent meaningful abstraction.
Some overfitting phases could be necessary for representation formation.
Regularization may delay, rather than prevent, representation collapse.

Open Questions

Dataset size: Is this an artifact of small datasets?
Existing theory: Does this align with concepts like loss‑landscape flatness, mode connectivity, or the information bottleneck?
Metrics: Are there better metrics than loss for tracking learning quality?
Noise vs. structure: Am I confusing emergent structure with random alignment?
Formal recognition: Is the “instability → abstraction → stabilization” pattern formally recognized in the literature?

I’m treating this more as a question than a claim. Any insights into where the reasoning might break—or how to frame it more rigorously—are greatly appreciated.

Rethinking Learning Dynamics in AI Models: An Early Theory from Experimentation

Core Assumption in Deep Learning

Proposed Idea

Empirical Observations

Training Loop Example

Cosine Similarity Tracker

Working Hypothesis

Potential Implications

Open Questions

Related posts

Understanding ReLU Through Visual Python Examples

Starting from scratch: Training a 30M Topological Transformer

Show HN: The Hessian of tall-skinny networks is easy to invert

Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting