Unstructured Text is the Final Boss: Parsing Doctor's Notes with LLMs đ„
Source: Dev.to
Hey devs! đ
Letâs be honest. We all live in a bubble where we think data looks like this:
{
"patient_id": 1024,
"symptoms": ["headache", "nausea"],
"severity": "moderate",
"is_critical": false
}
Itâs beautiful. Itâs parsable. Itâs typeâsafe. đ
But if youâve ever worked in HealthTech (or scraped any legacy enterprise system), you know the reality is usually a terrifying block of free text written by a tired human at 3âŻAM.
Iâve been deep in the trenches lately trying to standardize clinical notes, and dealing with doctorsâ notes makes parsing HTML with regex look like a vacation.
The Reality Check: âPt c/oâŠâ
Doctors donât write JSON. They write in a secret code of abbreviations, typos, and shorthand.
The âdataâ actually looks like this:
âPt 45yo m, c/o SOB x 2d. Denies CP. Hx of HTN, on lisinopril. Exam: wheezing b/l. Plan: nebs + steroids.â
- If you run a standard keyword search for âHigh Blood Pressure,â you might miss this record entirely because the doctor wrote âHTNâ (Hypertension).
- If you search for âPain,â you might get a false positive because the note says âDenies CPâ (Chest Pain).
Traditional NLP struggles here because context is everything. âSOBâ means âShortness of Breathâ in a hospital, but something very different in a Reddit comment section. đ
The Hallucination Trap đ»
The modern solution is often phrased as: âJust throw it into ChatGPT/LLM, right?â
Well⊠yes and no.
If you ask a generic LLM to âSummarize this patientâs status,â it can do a great jobâuntil it doesnât. The biggest risk in medical AI is hallucination.
Example: A model read a note mentioning a âfamily history of diabetesâ and output a structured JSON stating the patient currently has diabetes.
Big yikes. In healthcare, that kind of error is unacceptable.
The Fix: The RAG + FineâTuning Sandwich đ„Ș
To make the data queryable (e.g., âShow me all patients with respiratory issuesâ) without the AI lying, we need a strict pipeline.
1. FineâTuning (Teaching the Language)
Outâofâtheâbox models like gpt-3.5-turbo often lack the nuance of niche specialties. Fineâtuning a smaller model (e.g., LlamaâŻ3 or Mistral) on medical texts teaches it that bid means âtwice a dayâ (bis in die), not an auction offer.
2. Structured Extraction (The Translator)
Instead of asking the LLM to âchat,â we force it to extract data into a predefined schema using tools like Pydantic or Instructor.
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
# Define the structure we WANT (The Dream)
class ClinicalNote(BaseModel):
patient_age: int
symptoms: list[str] = Field(description="List of physical complaints")
medications: list[str]
diagnosis_confirmed: bool = Field(description="Is the diagnosis final or just suspected?")
client = instructor.patch(OpenAI())
text_blob = "Pt 45yo m, c/o SOB x 2d. Denies CP. Hx of HTN, on lisinopril."
resp = client.chat.completions.create(
model="gpt-4",
response_model=ClinicalNote,
messages=[
{"role": "system", "content": "You are a medical scribe. Extract data accurately."},
{"role": "user", "content": text_blob},
],
)
print(resp.model_dump_json(indent=2))
Output
{
"patient_age": 45,
"symptoms": ["Shortness of Breath"],
"medications": ["lisinopril"],
"diagnosis_confirmed": false
}
Now we have SQLâqueryable data! đ
3. RAG for Verification (The Guardrail)
Even with extraction, we need to trust the result. We embed the original notes into a vector database (e.g., Pinecone or Weaviate). When a user asks, âDoes this patient have heart issues?â, the system:
- Retrieves the specific chunk mentioning âDenies CPâ and âHx of HTNâ.
- Feeds only that chunk to the LLM.
- Cites the source.
If the AI canât find a relevant chunk, it is programmed to say âI donât knowâ rather than guessing.
Conclusion
Standardizing freeâtext clinical notes is painful, but itâs the only way to unlock the value in medical records. We must move away from âmagic blackâboxâ AI toward structured AI pipelinesâvalidating inputs, enforcing JSON schemas, and grounding everything in retrieved context.
Itâs messy work, but someoneâs gotta do it! đ»âš
Want to go deeper?
Check out my personal blog for the deep dives: wellally.tech/blog