Privacy First: Chat with Your Medical Reports Locally using Llama-3 and MLX on Mac 🍎

Published: 3 days ago (February 16, 2026 at 08:20 PM EST)

5 min read

Source: Dev.to

Introduction

Your health data is probably the most sensitive information you own. Yet, in the age of AI, most people blindly upload their blood work and MRI results to cloud‑based LLMs just to get a summary. Stop right there! 🛑

In this tutorial, we are going to build a Local RAG (Retrieval‑Augmented Generation) system. We will leverage the power of Apple Silicon’s unified memory, the high‑performance MLX framework, and Llama‑3 to create a private medical assistant that never leaks a single byte to the internet. By using Local RAG and MLX‑optimized Llama‑3, you can perform complex semantic search and data extraction on your medical PDFs while keeping your data strictly on‑device.

The Architecture: Why MLX?

Traditional RAG stacks often rely on heavy Docker containers or cloud APIs. However, if you are on a Mac (M1/M2/M3), the MLX framework (developed by Apple Machine Learning Research) allows you to run Llama‑3 with incredible efficiency by utilizing the GPU and unified memory architecture.

Here is how the data flows from your dusty PDF report to a meaningful conversation:

graph TD
    A[Medical PDF Report] -->|PyMuPDF| B(Text Extraction & Cleaning)
    B --> C{Chunking Strategy}
    C -->|Sentence Splitting| D[ChromaDB Vector Store]
    E[User Query: 'Is my cholesterol high?'] -->|MLX Embedding| F(Vector Search)
    D -->|Retrieve Relevant Context| G[Prompt Augmentation]
    G -->|Context + Query| H[Llama-3-8B via MLX]
    H --> I[Private Local Answer]

    style H fill:#f96,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px

Prerequisites

Before we dive into the code, ensure you have an Apple Silicon Mac and the following stack installed:

Llama‑3‑8B – 4‑bit quantized version for speed.
MLX – Apple’s native array framework.
ChromaDB – Lightweight vector database.
PyMuPDF (fitz) – High‑accuracy PDF parsing.

pip install mlx-lm chromadb pymupdf sentence-transformers

Step 1: Parsing Sensitive PDFs with PyMuPDF

Medical reports are notoriously messy—tables, signatures, and odd formatting. We use PyMuPDF for its speed and reliability in extracting clean text.

import fitz  # PyMuPDF

def extract_medical_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"

    # Simple cleaning: remove extra whitespaces
    clean_text = " ".join(text.split())
    return clean_text

# Usage
raw_data = extract_medical_text("my_blood_report_2024.pdf")
print(f"Extracted {len(raw_data)} characters.")

Step 2: Vector Embeddings and Local Storage

To find relevant information (e.g., “What was my Glucose level?”), we convert text into vectors and store them in ChromaDB.

💡 Pro‑Tip: For more production‑ready examples and advanced RAG patterns, check out the detailed guides on the WellAlly Tech Blog, where we dive deep into optimizing local inference.

import chromadb
from chromadb.utils import embedding_functions

# Initialize local ChromaDB
client = chromadb.PersistentClient(path="./medical_db")

# Use a local embedding model
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.get_or_create_collection(
    name="medical_reports",
    embedding_function=emb_fn,
)

def add_to_vector_store(text, metadata):
    # Chunking text into 500‑character pieces
    chunks = [text[i:i+500] for i in range(0, len(text), 500)]
    ids = [f"id_{i}" for i in range(len(chunks))]

    collection.add(
        documents=chunks,
        ids=ids,
        metadatas=[metadata] * len(chunks)
    )

add_to_vector_store(raw_data, {"source": "annual_checkup_2024"})

Step 3: Local Inference with Llama‑3 & MLX

Now for the magic. We use mlx‑lm to load a quantized Llama‑3‑8B. This allows the model to run comfortably even on a MacBook Air with 16 GB of RAM. 🚀

from mlx_lm import load, generate

# Load the model and tokenizer
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

def query_private_ai(user_question):
    # 1. Retrieve context from ChromaDB
    results = collection.query(query_texts=[user_question], n_results=3)
    context = "\n".join(results["documents"][0])

    # 2. Construct the prompt
    prompt = f"""
You are a private medical assistant. Use the provided medical report context to answer the user's question.
If you don't know the answer based on the context, say so.
Context: {context}
---
Question: {user_question}
Answer:
"""

    # 3. Generate response using MLX
    response = generate(
        model,
        tokenizer,
        prompt=prompt,
        verbose=False,
        max_tokens=500,
    )
    return response

# Example Query
print(query_private_ai("What are the key concerns in my blood report?"))

Taking it Further: The “Official” Way

While this script gets you started, building a production‑grade medical AI requires handling multimodal data (e.g., X‑rays) and ensuring rigorous HIPAA‑like compliance even on local edge devices.

The team at WellAlly has been pioneering “Privacy‑First AI” architectures. If you’re interested in scaling this to multiple users or integrating it into a secure healthcare workflow, reach out to us or explore our deeper technical posts.

I highly recommend reading their latest deep‑dives on Wellally Blog. They cover how to fine‑tune Llama‑3 specifically for clinical terminology, which significantly reduces hallucinations.

Conclusion 🥑

You just built a private, high‑performance medical RAG system! By combining Llama‑3, MLX, and ChromaDB, you’ve achieved:

Zero Data Leakage – Your health data never leaves your Mac.
High Performance – MLX makes local LLMs feel snappy.
Intelligence – Llama‑3 provides reasoning that simple keyword searches can’t match.

What’s next? 🛠️

Try implementing a Table Parser for more accurate lab‑result extraction.
Add a Streamlit UI to make it look like a real app.
Let me know in the comments: What’s your biggest concern with Cloud AI?

Stay private, stay healthy! 💻🛡️

Privacy First: Chat with Your Medical Reports Locally using Llama-3 and MLX on Mac 🍎

Introduction

The Architecture: Why MLX?

Prerequisites

Step 1: Parsing Sensitive PDFs with PyMuPDF

Step 2: Vector Embeddings and Local Storage

Step 3: Local Inference with Llama‑3 & MLX

Taking it Further: The “Official” Way

Conclusion 🥑

What’s next? 🛠️

Related posts

The Job Isn't Writing Code. It's Knowing When the AI Is Wrong.

Why Your Backend Needs an Agentic Loop: My Research on the Musfique Decision Loop (MDL).

How to Lower Your AI Costs When Scaling Your Business

Face Avatar Generator — DEV x Google AI Studio Submission