Rethinking Learning Dynamics in AI Models: An Early Theory from Experimentation
Source: Dev.to
Core Assumption in Deep Learning
- Typical view: Minimize loss → improve performance.
- Observation: Loss minimization alone does not always correlate with meaningful representation learning, especially in the early phases of training.
Proposed Idea
Representation Instability Phase – gradients first optimize surface‑level patterns before stable internal abstractions emerge.
I’m not sure whether this phenomenon already has a name, or if I’m mistaking training noise for structure.
Empirical Observations
| Training Phase | Observed Behavior |
|---|---|
| Early epochs | Highly volatile embeddings |
| Mid‑training | Sudden clustering of representations |
| Late training | Stabilization of embeddings even as loss improvement slows |
These dynamics were observed while training a small transformer‑like model on synthetic data, by logging intermediate layer activations.
Training Loop Example
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# Track embedding norm without gradients
with torch.no_grad():
embedding_norm = model.encoder.weight.norm().item()
print(f"Epoch {epoch} | Loss: {loss.item():.4f} | Embedding Norm: {embedding_norm:.2f}")
Note: Embedding norms and cosine similarities changed more drastically than loss values, especially early on.
Cosine Similarity Tracker
prev_embedding = None
for epoch in range(num_epochs):
current_embedding = model.encoder.weight.clone().detach()
if prev_embedding is not None:
similarity = cosine_similarity(
prev_embedding.view(-1),
current_embedding.view(-1)
)
print(f"Epoch {epoch} | Embedding Stability: {similarity.item():.4f}")
prev_embedding = current_embedding
The similarity score jumps erratically at first, then converges toward ~0.98–0.99 even when loss improvement becomes marginal.
Working Hypothesis
- Gradient descent initially prioritizes optimization shortcuts rather than semantic structure.
- Later it converges toward representations that are robust and generalizable.
Potential Implications
- Early stopping might prevent meaningful abstraction.
- Some overfitting phases could be necessary for representation formation.
- Regularization may delay, rather than prevent, representation collapse.
Open Questions
- Dataset size: Is this an artifact of small datasets?
- Existing theory: Does this align with concepts like loss‑landscape flatness, mode connectivity, or the information bottleneck?
- Metrics: Are there better metrics than loss for tracking learning quality?
- Noise vs. structure: Am I confusing emergent structure with random alignment?
- Formal recognition: Is the “instability → abstraction → stabilization” pattern formally recognized in the literature?
I’m treating this more as a question than a claim. Any insights into where the reasoning might break—or how to frame it more rigorously—are greatly appreciated.