Python Regex Explained Simply — Extract Anything From Text
Source: Dev.to
Regex sounds intimidating. It is not. Once you understand the 5 core concepts, you can extract any pattern from any text in seconds. Here is everything you need to know. Regex is a pattern language. You describe what you are looking for using special characters and Python finds it for you — in any block of text, any size. Real example: your client sends you a document with 500 customer records mixed with random text. They need all email addresses extracted into Excel. Without regex this takes hours. With regex it takes 3 lines. import re
text = “Contact john@gmail.com or sales@company.com for details” emails = re.findall(r’[\w.-]+@[\w.-]+.\w+’, text) print(emails)
[‘john@gmail.com’, ‘sales@company.com’]
re.findall(r’\d’, ‘abc123def456’)
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’]
re.findall(r’\w+’, ‘hello world_123’)
[‘hello’, ‘world_123’]
re.findall(r’\d+’, ‘price is 45000 and tax is 8100’)
[‘45000’, ‘8100’]
re.findall(r’[aeiou]’, ‘hello world’)
[‘e’, ‘o’, ‘o’]
re.findall(r’c.t’, ‘cat cut cot bat’)
[‘cat’, ‘cut’, ‘cot’]
Returns a list of everything that matches the pattern. text = “Prices: ₹45,000 and ₹12,500 and ₹8,750” prices = re.findall(r’[\d,]+’, text) print(prices)
[‘45,000’, ‘12,500’, ‘8,750’]
Replaces every match with something else. messy = “phone: 98-765-43210” clean = re.sub(r’\D’, ”, messy) # remove all non-digits print(clean)
‘9876543210’
Returns just the first match with its position. text = “Order #A12345 placed successfully” match = re.search(r’#(\w+)’, text) if match: print(match.group(1)) # A12345
Client problem: they have a spreadsheet with phone numbers in 6 different formats. They need them all standardised to 10 digits. import pandas as pd import re
df = pd.DataFrame({ ‘Phone’: [‘9876543210’, ‘+91-9876543210’, ‘(080) 4567-8901’, ‘91 98765 43210’] })
def clean_phone(phone): digits = re.sub(r’\D’, ”, phone) if len(digits) == 10: return digits elif len(digits) == 12 and digits.startswith(‘91’): return digits[2:] return None
df[‘Clean’] = df[‘Phone’].apply(clean_phone) print(df)
Output: Regex is a pattern language — you describe what you are looking for and Python finds every instance of it in any text, any size. Learn these 5 patterns and 3 functions and you can handle 90% of real data extraction gigs immediately. Written by Raaga Priya Madhan — CSE student, Bangalore. I build Python automation and data extraction scripts. See my work on GitHub and connect on LinkedIn