Pose as Data: Building a Real-Time AI Physical Therapist with MediaPipe and GPT-4o

Published: (January 30, 2026 at 08:00 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Beck_Moulton

Doing physical therapy (PT) at home is a double‑edged sword. On one hand, you’re in your pajamas; on the other, you have no idea if your “squat” looks more like a graceful crane or a folding lawn chair. Bad form isn’t just ineffective—it’s dangerous.

In this tutorial, we’re bridging the gap between raw Computer Vision and Generative AI. We will build a system that uses MediaPipe for real‑time pose estimation, serializes skeletal data into JSON, and pipes it through WebSockets to GPT‑4o‑mini to provide professional‑grade corrective feedback. Whether you are interested in AI‑driven fitness, human‑computer interaction, or real‑time multimodal LLMs, this guide covers the full stack of modern vision‑to‑text pipelines.

The Architecture: From Pixels to Prescription

How do we turn a video stream into actionable medical advice? The secret lies in treating “pose as data.” Instead of sending raw video frames to an LLM (which is expensive and slow), we extract the 3D coordinates of skeletal joints and send the mathematical representation of the movement.

graph TD
    A[User Webcam] -->|Video Frame| B(MediaPipe Pose)
    B -->|3D Landmarks| C{Data Processor}
    C -->|JSON Skeleton| D[FastAPI WebSocket]
    D -->|Contextual Prompt| E[GPT‑4o‑mini]
    E -->|Corrective Text/Speech| F[Frontend UI]
    F -->|Real‑time Feedback| A

Prerequisites

To follow along, you’ll need:

  • MediaPipe – high‑fidelity body tracking.
  • OpenCV – video‑stream handling.
  • FastAPI / WebSockets – low‑latency communication.
  • OpenAI SDK – access to GPT‑4o‑mini’s reasoning capabilities.

Step 1: Extracting Skeletal Landmarks with MediaPipe

First, we need to extract the 33 essential landmarks (shoulders, knees, ankles, etc.) from the video feed. MediaPipe gives us (x, y, z) coordinates for each.

import cv2
import mediapipe as mp

mp_pose = mp.solutions.pose
pose = mp_pose.Pose(static_image_mode=False, min_detection_confidence=0.5)

def get_skeletal_data(frame):
    # Convert the BGR image to RGB
    image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = pose.process(image_rgb)

    if not results.pose_landmarks:
        return None

    # Extract key points relevant to the exercise (e.g., squat)
    landmarks = results.pose_landmarks.landmark
    data = {
        "left_knee":  {"x": landmarks[25].x, "y": landmarks[25].y, "z": landmarks[25].z},
        "right_knee": {"x": landmarks[26].x, "y": landmarks[26].y, "z": landmarks[26].z},
        "hip":       {"x": landmarks[24].x, "y": landmarks[24].y}
    }
    return data

Step 2: The Real‑Time Pipeline (WebSockets)

Since we want “real‑time” feedback, we can’t wait for a standard REST request. We’ll use WebSockets to stream landmark data to our backend.

from fastapi import FastAPI, WebSocket
import json

app = FastAPI()

@app.websocket("/ws/rehab")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    while True:
        # Receive skeletal data from the frontend/client
        client_data = await websocket.receive_text()
        skeleton = json.loads(client_data)

        # Trigger GPT‑4o analysis every 30 frames or on specific movement triggers
        feedback = analyze_pose_with_gpt(skeleton)
        await websocket.send_text(feedback)

Step 3: Feeding the “Physical Therapist” (GPT‑4o‑mini)

The magic happens in the prompt. We provide the LLM with the coordinate data and tell it to act as a Doctor of Physical Therapy.

import openai

def analyze_pose_with_gpt(skeleton_data):
    prompt = f"""
    You are a professional Physical Therapist.
    Analyze these 3D coordinates of a patient performing a squat:
    {skeleton_data}

    If the knee y‑coordinate is higher than the hip y‑coordinate, they aren't deep enough.
    If the knees are converging (check x‑coordinates), warn about valgus stress.
    Provide a 1‑sentence concise correction.
    """

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50
    )
    return response.choices[0].message.content

Pro‑Tip: Advanced Implementation Patterns

While the above gets you an MVP, a production‑ready medical or fitness app must handle:

  • Jitter & occlusion (when a body part is hidden)
  • Temporal analysis (comparing the current frame to the last 10 frames)

Consider implementing Kalman filters for smoothing landmarks or optimizing LLM token usage for vision tasks. For deeper dives, check out the technical articles at WellAlly Blog—a fantastic resource for developers bridging the gap between “cool demo” and “robust AI product.”

Conclusion

By treating pose as data, we decouple the “vision” from the “intelligence.” MediaPipe handles the heavy lifting of spatial geometry, while GPT‑4o‑mini delivers nuanced human instruction.

What’s next?

  • Add TTS (Text‑to‑Speech) – use OpenAI’s Whisper or ElevenLabs so the “AI Coach” can talk to the user in real‑time.
  • Temporal logic – send a sequence of coordinates so the AI can analyze the tempo of the exercise, not just a static frame.

Are you building something in the AI‑driven fitness space? Share your progress! 🚀

Health space? Drop a comment below or share your repo!
Let’s build the future of movement together.

Back to Blog

Related posts

Read more »