Biohack Your Health: Building a Science-Backed Supplement Advisor with Qdrant & PubMed 🧪
Source: Dev.to
If you’ve ever spent hours scrolling through Reddit or fitness forums trying to figure out if NMN or Creatine actually works, you know the struggle. There is a massive gap between bro‑science and peer‑reviewed clinical data. In the world of Biohacking, information is power, but only if it’s accurate.
Today, we are building a production‑grade RAG architecture (Retrieval‑Augmented Generation) to bridge that gap. We will use a Vector Database to store high‑fidelity embeddings from PubMed, allowing us to perform Semantic Search across thousands of medical abstracts. By the end of this guide, you’ll have a local knowledge base that answers your supplement questions with real scientific citations. 🚀
The Architecture 🏗️
To build a reliable biohacking tool, we need a pipeline that handles data ingestion, embedding, and retrieval. Here is how the data flows from a PubMed research paper to your terminal:
graph TD
A[PubMed Search Query] --> B[BeautifulSoup Scraper]
B --> C[Text Chunking - LangChain]
C --> D[Sentence Transformers - Embeddings]
D --> E[(Qdrant Vector DB)]
F[User Question] --> G[Query Embedding]
G --> H{Similarity Search}
E --> H
H --> I[Context + Prompt]
I --> J[LLM Response with Citations]
Prerequisites 🛠️
Make sure you have the following in your tech_stack:
- Python 3.9+
- Qdrant – high‑performance vector database
- Sentence Transformers – for generating local embeddings
- LangChain – the glue for our RAG pipeline
- BeautifulSoup – for parsing PubMed’s HTML
pip install qdrant-client sentence-transformers beautifulsoup4 langchain langchain-community
Step 1: Scraping PubMed Research 📄
PubMed is the gold standard for medical research. While they have an API (Entrez), sometimes we need to scrape specific metadata or handle dynamic queries. Here’s a robust snippet to get us started.
import requests
from bs4 import BeautifulSoup
def fetch_pubmed_abstracts(query: str, max_results: int = 10):
base_url = f"https://pubmed.ncbi.nlm.nih.gov/?term={query}"
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
# Grab article links (limit to max_results)
links = [
f"https://pubmed.ncbi.nlm.nih.gov{a['href']}"
for a in soup.select('.docsum-title', limit=max_results)
]
abstracts = []
for link in links:
page = requests.get(link)
page_soup = BeautifulSoup(page.text, "html.parser")
abstract_div = page_soup.find('div', id='eng-abstract')
if abstract_div:
abstracts.append({
"source": link,
"content": abstract_div.get_text().strip()
})
return abstracts
# Example: Fetching data for NMN
data = fetch_pubmed_abstracts("NMN supplement longevity", max_results=5)
print(f"Fetched {len(data)} abstracts!")
Step 2: Vectorizing the Evidence with Qdrant 🧠
Storing raw text isn’t enough; we need to store the meaning of the text. This is where Qdrant shines. We’ll use Sentence Transformers to turn our abstracts into 384‑dimensional vectors.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.schema import Document
# Initialise (or use :memory: for testing)
client = QdrantClient(path="./qdrant_db")
# Create a collection for our supplements
client.recreate_collection(
collection_name="biohacking_science",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
# Initialise embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Prepare documents for LangChain
docs = [
Document(page_content=item["content"], metadata={"source": item["source"]})
for item in data
]
# Upload to Qdrant
vectorstore = Qdrant(
client=client,
collection_name="biohacking_science",
embeddings=embeddings,
)
vectorstore.add_documents(docs)
print("Vector database is ready! 🥑")
Step 3: The RAG Implementation 🤖
Now we can query our database. Instead of a keyword search, we perform a semantic search. If you ask about “muscle recovery,” the system will find papers on “Creatine monohydrate” even if the word recovery isn’t in the title.
from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI # or use a local model like Llama3
# Setup the retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Example query
query = "What are the benefits of NMN for mitochondrial health?"
found_docs = retriever.get_relevant_documents(query)
for i, doc in enumerate(found_docs, start=1):
print(f"Source {i}: {doc.metadata['source']}")
print(f"Snippet: {doc.page_content[:200]}...\n")
Going Beyond the Basics 🚀
While this script is a great start, production‑ready biohacking tools require more advanced patterns—like hybrid search (combining keyword and vector search) and reranking to ensure the most clinically relevant papers appear first.
💡 Developer Pro‑Tip: For more production‑ready examples and advanced patterns in AI‑driven healthcare data engineering, check out the engineering resources linked in the original article.
Deep Dives
Check out the WellAlly Blog for detailed articles on scaling these architectures for real‑world medical applications.
Conclusion
By moving away from static bookmarks and toward a Qdrant‑powered RAG system, you’ve turned a chaotic library of PDFs and URLs into a queryable, intelligent research assistant. Biohacking is fundamentally a data‑engineering challenge—the more clean, evidence‑based data you can retrieve, the better your decisions will be.
What’s Next?
- Add a confidence score based on vector distance.
- Integrate a cron job to auto‑update your PubMed database every week.
- Deploy as a FastAPI endpoint to your mobile health dashboard.
Happy hacking! Stay scientific. 🧬💻
Did you find this tutorial helpful? Drop a comment below with the supplement you’re researching next! 👇