Beyond Just a Photo: Building a Pixel-Perfect Calorie Estimator with SAM and GPT-4o

Published: (January 27, 2026 at 07:45 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

We’ve all been there: staring at a delicious plate of pasta, trying to manually log every gram into a fitness app. It’s tedious, prone to “optimistic” human error, and frankly, ruins the meal. What if we could turn those pixels directly into nutritional data?

In this tutorial we build a Multimodal Dietary Analysis Engine by combining Meta’s Segment Anything Model (SAM) with the reasoning power of GPT‑4o. The system isolates food items, uses reference‑based scaling to estimate volume, and outputs a detailed nutritional breakdown.

Architecture Overview

graph TD
    A[User Uploads Image] --> B[OpenCV Preprocessing]
    B --> C[SAM: Segment Anything Model]
    C --> D{Mask Generation}
    D -->|Isolate Food| E[GPT-4o Multimodal Analysis]
    D -->|Reference Object| E
    E --> F[Nutritional Estimation Engine]
    F --> G[FastAPI Response: Calories, Macros, Confidence Score]

Required Stack

  • PyTorch – for running SAM weights.
  • Segment Anything (SAM) – Meta’s pre‑trained vision model.
  • GPT‑4o API – the multimodal “brain.”
  • FastAPI – to expose a production‑ready microservice.
  • OpenCV – for image manipulation.

Food Segmentation with SAM

import torch
from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np

# Load the SAM model
sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda" if torch.cuda.is_available() else "cpu"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam)

def get_food_segment(image_path):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    predictor.set_image(image)

    # Simple prompt: center of the image
    input_point = np.array([[image.shape[1] // 2, image.shape[0] // 2]])
    input_label = np.array([1])

    masks, scores, logits = predictor.predict(
        point_coords=input_point,
        point_labels=input_label,
        multimask_output=True,
    )
    return masks[0]  # Most confident mask

Nutritional Analysis with GPT‑4o

import base64
from openai import OpenAI

client = OpenAI()

def analyze_nutrition(image_path, mask_data):
    # Encode image as base64
    with open(image_path, "rb") as f:
        base64_image = base64.b64encode(f.read()).decode('utf-8')

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a professional nutritionist. Analyze the food in the segmented area. Use surrounding objects (forks, plates) to estimate volume."
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Estimate the calories and macronutrients for the food highlighted in this image."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

FastAPI Endpoint

from fastapi import FastAPI, UploadFile, File
import shutil

app = FastAPI()

@app.post("/analyze-meal")
async def analyze_meal(file: UploadFile = File(...)):
    # 1. Save uploaded file temporarily
    temp_path = f"temp_{file.filename}"
    with open(temp_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    # 2. Run SAM segmentation
    mask = get_food_segment(temp_path)

    # 3. Call GPT‑4o for nutritional analysis
    nutrition_data = analyze_nutrition(temp_path, mask)

    return {"status": "success", "data": nutrition_data}

Production Considerations

While the code works for a hobby project, production‑grade health apps need:

  • Robust error handling (e.g., low‑light images, overlapping foods).
  • Pydantic models for request/response validation.
  • Real‑time feedback loops for user corrections.

For deeper architectural patterns and AI observability in health tech, see the WellAlly Tech Blog (a great resource for production‑ready AI health solutions).

Next Steps

  • Add a Reference Object Detection step (e.g., YOLOv8) to improve scaling accuracy.
  • Implement a feedback loop where users can confirm or adjust the estimated portion size.

What are you building with multimodal AI? Share your project or ask questions in the comments!

Back to Blog

Related posts

Read more »