[Paper] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Published: 3 days ago (February 9, 2026 at 01:41 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08996v1

Overview

The paper tackles a surprisingly hard problem for modern video‑language models: automatically generating useful coaching feedback for athletes. Using rock‑climbing as a testbed, the authors show how to boost performance without gathering costly sport‑specific annotations, and they introduce new ways to evaluate feedback that go beyond generic BLEU‑style scores.

Key Contributions

Cross‑domain data augmentation: Leverages freely available climbing competition videos and coaching manuals to supplement a small set of existing feedback from a completely different sport.
Two novel evaluation metrics:
1. Specificity – measures how detailed and sport‑relevant the feedback is.
2. Actionability – measures whether the feedback suggests concrete, executable improvements.
Demonstrated generalization: Shows that a video‑LLM fine‑tuned on one sport can be adapted to another (rock climbing) with minimal extra supervision.
Open‑source pipeline: Provides code and data processing scripts that can be reused for other sports or activity‑based domains.

Methodology

Base Model: Starts from a state‑of‑the‑art video‑LLM (e.g., Flamingo, Video‑ChatGPT) pre‑trained on large web video‑text corpora.
Source‑domain feedback: Uses an existing dataset of expert feedback from a different sport (e.g., gymnastics) to give the model a notion of “what good feedback looks like.”
Target‑domain auxiliary data: Collects two kinds of publicly available climbing material:
- Competition footage (raw video clips with timestamps).
- Coaching manuals / guidebooks (textual descriptions of technique, common mistakes, and drills).
Multi‑modal alignment: The model is jointly trained to (a) associate video frames with relevant textual snippets from manuals and (b) imitate the style of the source‑domain feedback. This is done via a contrastive loss that encourages correct video‑text pairs and a language‑model loss that shapes the output style.
Evaluation suite: In addition to standard NLG metrics, the authors compute specificity (using a domain‑specific term frequency‑inverse document frequency score) and actionability (via a classifier trained on manually labeled “actionable vs. generic” feedback). Human judges also rate a subset of outputs.

Results & Findings

Metric	Baseline (source‑only)	+Auxiliary climbing data	Human rating (out of 5)
BLEU‑4	12.3	14.8	—
BERTScore	0.71	0.78	—
Specificity	0.42	0.68	—
Actionability	0.35	0.71	—
Human overall quality	2.8	3.9	5 = perfect

Adding competition videos and manuals raises specificity and actionability dramatically, confirming that the model learns domain‑relevant details rather than generic praise.
Human evaluators note that the augmented model produces feedback such as “keep your hips close to the wall on the crux move” instead of vague statements like “good job”.
The approach works even when only ≈5 % of the target‑domain data is annotated, highlighting strong data efficiency.

Practical Implications

Coaching platforms: Companies building AI‑assisted sports apps can bootstrap feedback for new disciplines by crawling public competition streams and rulebooks, avoiding expensive expert labeling.
Real‑time video analysis: The method can be integrated into live‑streaming pipelines to give climbers on‑the‑fly tips, similar to “instant replay” analysis used in broadcasting.
Cross‑sport transfer: The same recipe could be applied to gymnastics, skiing, or e‑sports, where abundant broadcast footage exists but expert commentary is scarce.
Metric adoption: Specificity and actionability provide more meaningful signals for product teams than BLEU; they can be incorporated into A/B testing loops to monitor AI coach quality.

Limitations & Future Work

Domain bias: The auxiliary data is limited to English‑language manuals and high‑production competition footage; niche climbing styles (e.g., bouldering in remote gyms) may be under‑represented.
Evaluation scope: While specificity/actionability correlate with human judgments, they are still proxy metrics; a larger-scale user study with actual climbers is needed.
Scalability to multimodal sensors: The current pipeline only consumes RGB video; integrating depth or wearable IMU data could further improve feedback granularity.
Generalization beyond sports: Extending the framework to non‑sport activities (e.g., musical instrument practice) remains an open question.

Bottom line: By cleverly mixing freely available competition videos and coaching literature, the authors demonstrate a practical path to AI‑driven, sport‑specific feedback without the usual data‑collection nightmare—an insight that could accelerate intelligent coaching tools across many physical domains.

Authors

Arushi Rai
Adriana Kovashka

Paper Information

arXiv ID: 2602.08996v1
Categories: cs.CV
Published: February 9, 2026
PDF: Download PDF

[Paper] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] GENIUS: Generative Fluid Intelligence Evaluation Suite

[Paper] From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers