[Paper] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
Source: arXiv - 2602.08996v1
Overview
The paper tackles a surprisingly hard problem for modern video‑language models: automatically generating useful coaching feedback for athletes. Using rock‑climbing as a testbed, the authors show how to boost performance without gathering costly sport‑specific annotations, and they introduce new ways to evaluate feedback that go beyond generic BLEU‑style scores.
Key Contributions
- Cross‑domain data augmentation: Leverages freely available climbing competition videos and coaching manuals to supplement a small set of existing feedback from a completely different sport.
- Two novel evaluation metrics:
- Specificity – measures how detailed and sport‑relevant the feedback is.
- Actionability – measures whether the feedback suggests concrete, executable improvements.
- Demonstrated generalization: Shows that a video‑LLM fine‑tuned on one sport can be adapted to another (rock climbing) with minimal extra supervision.
- Open‑source pipeline: Provides code and data processing scripts that can be reused for other sports or activity‑based domains.
Methodology
- Base Model: Starts from a state‑of‑the‑art video‑LLM (e.g., Flamingo, Video‑ChatGPT) pre‑trained on large web video‑text corpora.
- Source‑domain feedback: Uses an existing dataset of expert feedback from a different sport (e.g., gymnastics) to give the model a notion of “what good feedback looks like.”
- Target‑domain auxiliary data: Collects two kinds of publicly available climbing material:
- Competition footage (raw video clips with timestamps).
- Coaching manuals / guidebooks (textual descriptions of technique, common mistakes, and drills).
- Multi‑modal alignment: The model is jointly trained to (a) associate video frames with relevant textual snippets from manuals and (b) imitate the style of the source‑domain feedback. This is done via a contrastive loss that encourages correct video‑text pairs and a language‑model loss that shapes the output style.
- Evaluation suite: In addition to standard NLG metrics, the authors compute specificity (using a domain‑specific term frequency‑inverse document frequency score) and actionability (via a classifier trained on manually labeled “actionable vs. generic” feedback). Human judges also rate a subset of outputs.
Results & Findings
| Metric | Baseline (source‑only) | +Auxiliary climbing data | Human rating (out of 5) |
|---|---|---|---|
| BLEU‑4 | 12.3 | 14.8 | — |
| BERTScore | 0.71 | 0.78 | — |
| Specificity | 0.42 | 0.68 | — |
| Actionability | 0.35 | 0.71 | — |
| Human overall quality | 2.8 | 3.9 | 5 = perfect |
- Adding competition videos and manuals raises specificity and actionability dramatically, confirming that the model learns domain‑relevant details rather than generic praise.
- Human evaluators note that the augmented model produces feedback such as “keep your hips close to the wall on the crux move” instead of vague statements like “good job”.
- The approach works even when only ≈5 % of the target‑domain data is annotated, highlighting strong data efficiency.
Practical Implications
- Coaching platforms: Companies building AI‑assisted sports apps can bootstrap feedback for new disciplines by crawling public competition streams and rulebooks, avoiding expensive expert labeling.
- Real‑time video analysis: The method can be integrated into live‑streaming pipelines to give climbers on‑the‑fly tips, similar to “instant replay” analysis used in broadcasting.
- Cross‑sport transfer: The same recipe could be applied to gymnastics, skiing, or e‑sports, where abundant broadcast footage exists but expert commentary is scarce.
- Metric adoption: Specificity and actionability provide more meaningful signals for product teams than BLEU; they can be incorporated into A/B testing loops to monitor AI coach quality.
Limitations & Future Work
- Domain bias: The auxiliary data is limited to English‑language manuals and high‑production competition footage; niche climbing styles (e.g., bouldering in remote gyms) may be under‑represented.
- Evaluation scope: While specificity/actionability correlate with human judgments, they are still proxy metrics; a larger-scale user study with actual climbers is needed.
- Scalability to multimodal sensors: The current pipeline only consumes RGB video; integrating depth or wearable IMU data could further improve feedback granularity.
- Generalization beyond sports: Extending the framework to non‑sport activities (e.g., musical instrument practice) remains an open question.
Bottom line: By cleverly mixing freely available competition videos and coaching literature, the authors demonstrate a practical path to AI‑driven, sport‑specific feedback without the usual data‑collection nightmare—an insight that could accelerate intelligent coaching tools across many physical domains.
Authors
- Arushi Rai
- Adriana Kovashka
Paper Information
- arXiv ID: 2602.08996v1
- Categories: cs.CV
- Published: February 9, 2026
- PDF: Download PDF