Beyond Just a Photo: Building a Pixel-Perfect Calorie Estimator with SAM and GPT-4o
Source: Dev.to
Introduction
We’ve all been there: staring at a delicious plate of pasta, trying to manually log every gram into a fitness app. It’s tedious, prone to “optimistic” human error, and frankly, ruins the meal. What if we could turn those pixels directly into nutritional data?
In this tutorial we build a Multimodal Dietary Analysis Engine by combining Meta’s Segment Anything Model (SAM) with the reasoning power of GPT‑4o. The system isolates food items, uses reference‑based scaling to estimate volume, and outputs a detailed nutritional breakdown.
Architecture Overview
graph TD
A[User Uploads Image] --> B[OpenCV Preprocessing]
B --> C[SAM: Segment Anything Model]
C --> D{Mask Generation}
D -->|Isolate Food| E[GPT-4o Multimodal Analysis]
D -->|Reference Object| E
E --> F[Nutritional Estimation Engine]
F --> G[FastAPI Response: Calories, Macros, Confidence Score]
Required Stack
- PyTorch – for running SAM weights.
- Segment Anything (SAM) – Meta’s pre‑trained vision model.
- GPT‑4o API – the multimodal “brain.”
- FastAPI – to expose a production‑ready microservice.
- OpenCV – for image manipulation.
Food Segmentation with SAM
import torch
from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np
# Load the SAM model
sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda" if torch.cuda.is_available() else "cpu"
sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam)
def get_food_segment(image_path):
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# Simple prompt: center of the image
input_point = np.array([[image.shape[1] // 2, image.shape[0] // 2]])
input_label = np.array([1])
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
return masks[0] # Most confident mask
Nutritional Analysis with GPT‑4o
import base64
from openai import OpenAI
client = OpenAI()
def analyze_nutrition(image_path, mask_data):
# Encode image as base64
with open(image_path, "rb") as f:
base64_image = base64.b64encode(f.read()).decode('utf-8')
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a professional nutritionist. Analyze the food in the segmented area. Use surrounding objects (forks, plates) to estimate volume."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Estimate the calories and macronutrients for the food highlighted in this image."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
]
}
],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
FastAPI Endpoint
from fastapi import FastAPI, UploadFile, File
import shutil
app = FastAPI()
@app.post("/analyze-meal")
async def analyze_meal(file: UploadFile = File(...)):
# 1. Save uploaded file temporarily
temp_path = f"temp_{file.filename}"
with open(temp_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
# 2. Run SAM segmentation
mask = get_food_segment(temp_path)
# 3. Call GPT‑4o for nutritional analysis
nutrition_data = analyze_nutrition(temp_path, mask)
return {"status": "success", "data": nutrition_data}
Production Considerations
While the code works for a hobby project, production‑grade health apps need:
- Robust error handling (e.g., low‑light images, overlapping foods).
- Pydantic models for request/response validation.
- Real‑time feedback loops for user corrections.
For deeper architectural patterns and AI observability in health tech, see the WellAlly Tech Blog (a great resource for production‑ready AI health solutions).
Next Steps
- Add a Reference Object Detection step (e.g., YOLOv8) to improve scaling accuracy.
- Implement a feedback loop where users can confirm or adjust the estimated portion size.
What are you building with multimodal AI? Share your project or ask questions in the comments!