Rethinking Learning Dynamics in AI Models: An Early Theory from Experimentation

Published: (January 14, 2026 at 10:39 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Core Assumption in Deep Learning

  • Typical view: Minimize loss → improve performance.
  • Observation: Loss minimization alone does not always correlate with meaningful representation learning, especially in the early phases of training.

Proposed Idea

Representation Instability Phase – gradients first optimize surface‑level patterns before stable internal abstractions emerge.
I’m not sure whether this phenomenon already has a name, or if I’m mistaking training noise for structure.

Empirical Observations

Training PhaseObserved Behavior
Early epochsHighly volatile embeddings
Mid‑trainingSudden clustering of representations
Late trainingStabilization of embeddings even as loss improvement slows

These dynamics were observed while training a small transformer‑like model on synthetic data, by logging intermediate layer activations.

Training Loop Example

for epoch in range(num_epochs):
    optimizer.zero_grad()

    outputs = model(inputs)
    loss = criterion(outputs, targets)

    loss.backward()
    optimizer.step()

    # Track embedding norm without gradients
    with torch.no_grad():
        embedding_norm = model.encoder.weight.norm().item()

    print(f"Epoch {epoch} | Loss: {loss.item():.4f} | Embedding Norm: {embedding_norm:.2f}")

Note: Embedding norms and cosine similarities changed more drastically than loss values, especially early on.

Cosine Similarity Tracker

prev_embedding = None
for epoch in range(num_epochs):
    current_embedding = model.encoder.weight.clone().detach()

    if prev_embedding is not None:
        similarity = cosine_similarity(
            prev_embedding.view(-1),
            current_embedding.view(-1)
        )
        print(f"Epoch {epoch} | Embedding Stability: {similarity.item():.4f}")

    prev_embedding = current_embedding

The similarity score jumps erratically at first, then converges toward ~0.98–0.99 even when loss improvement becomes marginal.

Working Hypothesis

  • Gradient descent initially prioritizes optimization shortcuts rather than semantic structure.
  • Later it converges toward representations that are robust and generalizable.

Potential Implications

  • Early stopping might prevent meaningful abstraction.
  • Some overfitting phases could be necessary for representation formation.
  • Regularization may delay, rather than prevent, representation collapse.

Open Questions

  1. Dataset size: Is this an artifact of small datasets?
  2. Existing theory: Does this align with concepts like loss‑landscape flatness, mode connectivity, or the information bottleneck?
  3. Metrics: Are there better metrics than loss for tracking learning quality?
  4. Noise vs. structure: Am I confusing emergent structure with random alignment?
  5. Formal recognition: Is the “instability → abstraction → stabilization” pattern formally recognized in the literature?

I’m treating this more as a question than a claim. Any insights into where the reasoning might break—or how to frame it more rigorously—are greatly appreciated.

Back to Blog

Related posts

Read more »