[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been ...