Feature Engineering
Source: Dev.to
What is Feature Engineering?
- A feature is just a column of data (e.g., age, salary, number of purchases).
- Feature engineering means creating, modifying, or selecting the right features so that your model learns better.
- Think of it as preparing ingredients before cooking—you want them clean, chopped, and ready to make the dish tasty.
Why Do We Need It?
- Raw data is often messy, incomplete, or not in the right format.
- Good features help algorithms see patterns more clearly.
- Better features → better predictions, faster training, and more accurate results.
Common Techniques in Feature Engineering
| Technique | What It Means | Simple Example |
|---|---|---|
| Handling Missing Values | Fill in blanks or remove incomplete data | Replace missing ages with the average age |
| Encoding Categorical Data | Convert text labels into numbers | “Red, Blue, Green” → 0, 1, 2 |
| Scaling / Normalization | Put numbers on similar ranges | Salary (₹10,000–₹1,00,000) scaled to 0–1 |
| Feature Creation | Combine or transform existing data into new features | From “Date of Birth” → create “Age” |
| Feature Selection | Keep only the most useful features | Drop irrelevant columns like “User ID” |
| Binning | Group continuous values into categories | Age 0–12 = Child, 13–19 = Teen, 20+ = Adult |
Simple Example
Imagine you have this dataset:
| Name | Date of Birth | Salary | City |
|---|---|---|---|
| Alice | 1995-06-12 | 50,000 | Delhi |
| Bob | 1988-03-05 | 80,000 | Mumbai |
After feature engineering:
- Age is calculated from Date of Birth.
- City is encoded as numbers (Delhi = 0, Mumbai = 1).
- Salary is scaled between 0 and 1.
The data is now cleaner and easier for the model to understand.
Key Takeaways
- Feature engineering = preparing and improving data features.
- It makes models smarter and predictions more accurate.
- Core techniques include handling missing values, encoding, scaling, creating new features, and selecting the best ones.
Feature Engineering in Python
Make sure you have pandas and scikit-learn installed:
pip install pandas scikit-learn
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
# Example dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Date_of_Birth': ['1995-06-12', '1988-03-05', '2000-12-20'],
'Salary': [50000, 80000, None], # Missing value
'City': ['Delhi', 'Mumbai', 'Delhi']
}
df = pd.DataFrame(data)
print("Original Data:\n", df)
# 🔹 Handling Missing Values
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# 🔹 Feature Creation (Age from Date of Birth)
df['Date_of_Birth'] = pd.to_datetime(df['Date_of_Birth'])
df['Age'] = pd.Timestamp.now().year - df['Date_of_Birth'].dt.year
# 🔹 Encoding Categorical Data (City)
label_encoder = LabelEncoder()
df['City_encoded'] = label_encoder.fit_transform(df['City'])
# 🔹 Scaling Numerical Data (Salary)
scaler = MinMaxScaler()
df['Salary_scaled'] = scaler.fit_transform(df[['Salary']])
print("\nAfter Feature Engineering:\n", df)
What This Code Does
- Handles missing values by filling in the average salary.
- Creates a new feature (Age) from
Date_of_Birth. - Encodes categorical data (City) into numbers.
- Scales numerical data (Salary) between 0 and 1.
Final Note
Think of feature engineering like polishing a diamond. The raw stone (data) is valuable, but shaping and refining it (features) unlocks its true brilliance.