[Paper] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Published: (February 9, 2026 at 01:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08996v1

Overview

The paper tackles a surprisingly hard problem for modern video‑language models: automatically generating useful coaching feedback for athletes. Using rock‑climbing as a testbed, the authors show how to boost performance without gathering costly sport‑specific annotations, and they introduce new ways to evaluate feedback that go beyond generic BLEU‑style scores.

Key Contributions

  • Cross‑domain data augmentation: Leverages freely available climbing competition videos and coaching manuals to supplement a small set of existing feedback from a completely different sport.
  • Two novel evaluation metrics:
    1. Specificity – measures how detailed and sport‑relevant the feedback is.
    2. Actionability – measures whether the feedback suggests concrete, executable improvements.
  • Demonstrated generalization: Shows that a video‑LLM fine‑tuned on one sport can be adapted to another (rock climbing) with minimal extra supervision.
  • Open‑source pipeline: Provides code and data processing scripts that can be reused for other sports or activity‑based domains.

Methodology

  1. Base Model: Starts from a state‑of‑the‑art video‑LLM (e.g., Flamingo, Video‑ChatGPT) pre‑trained on large web video‑text corpora.
  2. Source‑domain feedback: Uses an existing dataset of expert feedback from a different sport (e.g., gymnastics) to give the model a notion of “what good feedback looks like.”
  3. Target‑domain auxiliary data: Collects two kinds of publicly available climbing material:
    • Competition footage (raw video clips with timestamps).
    • Coaching manuals / guidebooks (textual descriptions of technique, common mistakes, and drills).
  4. Multi‑modal alignment: The model is jointly trained to (a) associate video frames with relevant textual snippets from manuals and (b) imitate the style of the source‑domain feedback. This is done via a contrastive loss that encourages correct video‑text pairs and a language‑model loss that shapes the output style.
  5. Evaluation suite: In addition to standard NLG metrics, the authors compute specificity (using a domain‑specific term frequency‑inverse document frequency score) and actionability (via a classifier trained on manually labeled “actionable vs. generic” feedback). Human judges also rate a subset of outputs.

Results & Findings

MetricBaseline (source‑only)+Auxiliary climbing dataHuman rating (out of 5)
BLEU‑412.314.8
BERTScore0.710.78
Specificity0.420.68
Actionability0.350.71
Human overall quality2.83.95 = perfect
  • Adding competition videos and manuals raises specificity and actionability dramatically, confirming that the model learns domain‑relevant details rather than generic praise.
  • Human evaluators note that the augmented model produces feedback such as “keep your hips close to the wall on the crux move” instead of vague statements like “good job”.
  • The approach works even when only ≈5 % of the target‑domain data is annotated, highlighting strong data efficiency.

Practical Implications

  • Coaching platforms: Companies building AI‑assisted sports apps can bootstrap feedback for new disciplines by crawling public competition streams and rulebooks, avoiding expensive expert labeling.
  • Real‑time video analysis: The method can be integrated into live‑streaming pipelines to give climbers on‑the‑fly tips, similar to “instant replay” analysis used in broadcasting.
  • Cross‑sport transfer: The same recipe could be applied to gymnastics, skiing, or e‑sports, where abundant broadcast footage exists but expert commentary is scarce.
  • Metric adoption: Specificity and actionability provide more meaningful signals for product teams than BLEU; they can be incorporated into A/B testing loops to monitor AI coach quality.

Limitations & Future Work

  • Domain bias: The auxiliary data is limited to English‑language manuals and high‑production competition footage; niche climbing styles (e.g., bouldering in remote gyms) may be under‑represented.
  • Evaluation scope: While specificity/actionability correlate with human judgments, they are still proxy metrics; a larger-scale user study with actual climbers is needed.
  • Scalability to multimodal sensors: The current pipeline only consumes RGB video; integrating depth or wearable IMU data could further improve feedback granularity.
  • Generalization beyond sports: Extending the framework to non‑sport activities (e.g., musical instrument practice) remains an open question.

Bottom line: By cleverly mixing freely available competition videos and coaching literature, the authors demonstrate a practical path to AI‑driven, sport‑specific feedback without the usual data‑collection nightmare—an insight that could accelerate intelligent coaching tools across many physical domains.

Authors

  • Arushi Rai
  • Adriana Kovashka

Paper Information

  • arXiv ID: 2602.08996v1
  • Categories: cs.CV
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »