Building a Multimodal Food Analysis System on Qubrid AI

Published: 3 days ago (February 12, 2026 at 03:37 AM EST)

4 min read

Source: Dev.to

NutriVision AI

NutriVision AI is an example application from the Qubrid AI Cookbook that demonstrates how to build a multimodal vision‑language nutrition analyzer from the ground up. It uses a multimodal model to provide comprehensive nutritional insights from a food image, then lets users query those insights conversationally.

This app is more than just a playful tool; it serves as a reference implementation that demonstrates how to seamlessly integrate authentic multimodal inference into a practical interface. It features structured outputs that you can further develop and expand upon.

Why NutriVision Matters

A lot of nutrition and diet‑tracking applications still rely on manually entered text. NutriVision removes that friction by letting users take or upload a photo and receive a meaningful, structured analysis automatically.

Behind the scenes, a multimodal model analyzes the image and generates a clean representation of calories, macronutrients, health score, dish name, and more. That structured data is then used for both display and grounded follow‑up conversation.

This pattern—strict structured inference + grounded chat—is powerful and generalizable beyond nutrition. It shows how vision + language models can be applied to everyday tasks.

Overview

NutriVision supports two core capabilities:

Image‑based nutritional analysis using a multimodal model
Context‑aware follow‑up conversation grounded in structured nutrition data

The system enforces strict JSON output during analysis and uses streaming for conversational interaction.

Prerequisites

Before running the application, ensure you have:

Python 3.9 or higher
pip installed
Your API key from the Qubrid dashboard (required to access the models)

Clone the Repository

git clone https://github.com/QubridAI-Inc/qubrid-cookbook.git
cd qubrid-cookbook/Multimodal/nutri_vision_app

Create a Virtual Environment

python -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Configure Environment Variables

Set your Qubrid API key so the app can authenticate inference requests.

macOS / Linux

export QUBRID_API_KEY="your_api_key_here"

Windows

setx QUBRID_API_KEY "your_api_key_here"

Run the Application

streamlit run app.py

The application will launch locally in your browser.

Multimodal API Integration

NutriVision integrates Qubrid’s multimodal endpoint for image‑based nutrition analysis.

Image Analysis Call (Non‑Streaming)

import os
import requests

QUBRID_API_KEY = os.getenv("QUBRID_API_KEY")
BASE_URL = "https://platform.qubrid.com/v1/chat/completions"

def call_qubrid_api(messages):
    payload = {
        "model": "your-multimodal-model-name",
        "messages": messages,
        "temperature": 0.2
    }

    headers = {
        "Authorization": f"Bearer {QUBRID_API_KEY}",
        "Content-Type": "application/json"
    }

    response = requests.post(BASE_URL, json=payload, headers=headers)
    response.raise_for_status()

    return response.json()["choices"][0]["message"]["content"]

Inside app.py, the request is constructed as:

messages = [{
    "role": "user",
    "content": DETAILED_NUTRITION_PROMPT,
    "image": st.session_state.image_base64
}]

response_text = call_qubrid_api(messages)

The call returns structured JSON containing dish name, calories, macronutrients, and health score.

Streaming Chat Integration

After analysis, the structured nutrition data is injected into the system prompt and streamed for conversational reasoning.

Recommended Model: Qwen3-VL-30B – a high‑capacity vision‑language model optimized for advanced image understanding, structured extraction, OCR, and multimodal reasoning tasks.

def call_qubrid_api_stream(messages):
    payload = {
        "model": "your-chat-model-name",
        "messages": messages,
        "temperature": 0.4,
        "stream": True
    }

    headers = {
        "Authorization": f"Bearer {QUBRID_API_KEY}",
        "Content-Type": "application/json"
    }

    with requests.post(BASE_URL, json=payload, headers=headers, stream=True) as response:
        response.raise_for_status()
        for line in response.iter_lines():
            if line:
                decoded = line.decode("utf-8")
                if decoded.startswith("data: "):
                    chunk = decoded.replace("data: ", "")
                    if chunk != "[DONE]":
                        yield eval(chunk)["choices"][0]["delta"].get("content", "")

Used in the chat layer:

full_response = ""
for chunk in call_qubrid_api_stream(api_messages):
    full_response += chunk

This enables real‑time, token‑by‑token streaming of the assistant’s replies, grounded in the previously extracted nutrition data.

Real‑time Conversational Responses Grounded in Previously Parsed Nutrition Data

Design Approach

NutriVision follows a deterministic inference pipeline:

Structured constrained generation for reliable JSON output
Dedicated parsing layer for validation
Context injection to reduce hallucination
Streaming for conversational UX

The model performs multimodal reasoning, while the application layer ensures reliability and usability.

Real‑World Applications

Although NutriVision focuses on nutrition, the general pattern it implements—vision input + structured generation + context‑aware chat—can be applied to many domains:

Health and fitness tracking tools
Diet coaching assistants
Industrial quality inspection
Medical image interpretation
Educational visual assistants

The Qubrid Cookbook contains other multimodal examples that apply this same pattern to different use cases.

Where to Learn More

This app is part of a broader set of cookbooks provided by Qubrid AI, offering examples ranging from OCR agents to reasoning chatbots.

👉 Explore the full source code and related projects in our cookbooks.
👉 Watch implementation tutorials and walkthroughs on YouTube for step‑by‑step demos and model integrations.

Thanks for reading!

If you found this helpful, feel free to like the post 👍, star ⭐ the repository, try the app, and experiment with your own multimodal builds using Qubrid AI. We’d love to see what you create!