Privacy First: Chat with Your Medical Reports Locally using Llama-3 and MLX on Mac 🍎
Source: Dev.to
Introduction
Your health data is probably the most sensitive information you own. Yet, in the age of AI, most people blindly upload their blood work and MRI results to cloud‑based LLMs just to get a summary. Stop right there! 🛑
In this tutorial, we are going to build a Local RAG (Retrieval‑Augmented Generation) system. We will leverage the power of Apple Silicon’s unified memory, the high‑performance MLX framework, and Llama‑3 to create a private medical assistant that never leaks a single byte to the internet. By using Local RAG and MLX‑optimized Llama‑3, you can perform complex semantic search and data extraction on your medical PDFs while keeping your data strictly on‑device.
The Architecture: Why MLX?
Traditional RAG stacks often rely on heavy Docker containers or cloud APIs. However, if you are on a Mac (M1/M2/M3), the MLX framework (developed by Apple Machine Learning Research) allows you to run Llama‑3 with incredible efficiency by utilizing the GPU and unified memory architecture.
Here is how the data flows from your dusty PDF report to a meaningful conversation:
graph TD
A[Medical PDF Report] -->|PyMuPDF| B(Text Extraction & Cleaning)
B --> C{Chunking Strategy}
C -->|Sentence Splitting| D[ChromaDB Vector Store]
E[User Query: 'Is my cholesterol high?'] -->|MLX Embedding| F(Vector Search)
D -->|Retrieve Relevant Context| G[Prompt Augmentation]
G -->|Context + Query| H[Llama-3-8B via MLX]
H --> I[Private Local Answer]
style H fill:#f96,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
Prerequisites
Before we dive into the code, ensure you have an Apple Silicon Mac and the following stack installed:
- Llama‑3‑8B – 4‑bit quantized version for speed.
- MLX – Apple’s native array framework.
- ChromaDB – Lightweight vector database.
- PyMuPDF (fitz) – High‑accuracy PDF parsing.
pip install mlx-lm chromadb pymupdf sentence-transformers
Step 1: Parsing Sensitive PDFs with PyMuPDF
Medical reports are notoriously messy—tables, signatures, and odd formatting. We use PyMuPDF for its speed and reliability in extracting clean text.
import fitz # PyMuPDF
def extract_medical_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text("text") + "\n"
# Simple cleaning: remove extra whitespaces
clean_text = " ".join(text.split())
return clean_text
# Usage
raw_data = extract_medical_text("my_blood_report_2024.pdf")
print(f"Extracted {len(raw_data)} characters.")
Step 2: Vector Embeddings and Local Storage
To find relevant information (e.g., “What was my Glucose level?”), we convert text into vectors and store them in ChromaDB.
💡 Pro‑Tip: For more production‑ready examples and advanced RAG patterns, check out the detailed guides on the WellAlly Tech Blog, where we dive deep into optimizing local inference.
import chromadb
from chromadb.utils import embedding_functions
# Initialize local ChromaDB
client = chromadb.PersistentClient(path="./medical_db")
# Use a local embedding model
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
name="medical_reports",
embedding_function=emb_fn,
)
def add_to_vector_store(text, metadata):
# Chunking text into 500‑character pieces
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
ids = [f"id_{i}" for i in range(len(chunks))]
collection.add(
documents=chunks,
ids=ids,
metadatas=[metadata] * len(chunks)
)
add_to_vector_store(raw_data, {"source": "annual_checkup_2024"})
Step 3: Local Inference with Llama‑3 & MLX
Now for the magic. We use mlx‑lm to load a quantized Llama‑3‑8B. This allows the model to run comfortably even on a MacBook Air with 16 GB of RAM. 🚀
from mlx_lm import load, generate
# Load the model and tokenizer
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")
def query_private_ai(user_question):
# 1. Retrieve context from ChromaDB
results = collection.query(query_texts=[user_question], n_results=3)
context = "\n".join(results["documents"][0])
# 2. Construct the prompt
prompt = f"""
You are a private medical assistant. Use the provided medical report context to answer the user's question.
If you don't know the answer based on the context, say so.
Context: {context}
---
Question: {user_question}
Answer:
"""
# 3. Generate response using MLX
response = generate(
model,
tokenizer,
prompt=prompt,
verbose=False,
max_tokens=500,
)
return response
# Example Query
print(query_private_ai("What are the key concerns in my blood report?"))
Taking it Further: The “Official” Way
While this script gets you started, building a production‑grade medical AI requires handling multimodal data (e.g., X‑rays) and ensuring rigorous HIPAA‑like compliance even on local edge devices.
The team at WellAlly has been pioneering “Privacy‑First AI” architectures. If you’re interested in scaling this to multiple users or integrating it into a secure healthcare workflow, reach out to us or explore our deeper technical posts.
I highly recommend reading their latest deep‑dives on Wellally Blog. They cover how to fine‑tune Llama‑3 specifically for clinical terminology, which significantly reduces hallucinations.
Conclusion 🥑
You just built a private, high‑performance medical RAG system! By combining Llama‑3, MLX, and ChromaDB, you’ve achieved:
- Zero Data Leakage – Your health data never leaves your Mac.
- High Performance – MLX makes local LLMs feel snappy.
- Intelligence – Llama‑3 provides reasoning that simple keyword searches can’t match.
What’s next? 🛠️
- Try implementing a Table Parser for more accurate lab‑result extraction.
- Add a Streamlit UI to make it look like a real app.
- Let me know in the comments: What’s your biggest concern with Cloud AI?
Stay private, stay healthy! 💻🛡️