Exploratory Data Analysis (EDA)
Source: Dev.to
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a systematic approach to analyzing data sets in order to summarize their main characteristics, discover patterns, detect anomalies, test assumptions, and check data quality before applying formal statistical models or machine‑learning algorithms. EDA was popularised by John W. Tukey, who emphasized exploration before confirmation.
Key Ideas
- Flexible and investigative
- Uses both numerical and graphical methods
- Helps guide further analysis and modelling
Objectives of EDA
- Understand data structure
- Summarise key characteristics
- Detect outliers and anomalies
- Identify patterns and trends
- Check assumptions (normality, linearity, etc.)
- Assess data quality
- Guide feature selection and transformation
- Support decision‑making
Types of Exploratory Data Analysis
Based on Number of Variables
(EDA can be classified according to the number of variables involved, e.g., univariate, bivariate, multivariate.)
Steps in Exploratory Data Analysis
Step 1: Understand the Data
- Variable types (categorical, numerical)
- Units and scale
- Data source
- Size of dataset
Step 2: Data Cleaning
- Remove duplicates
- Correct inconsistent data
- Detect invalid entries
Note: EDA often reveals that real‑world data is messy.
Step 3: Univariate Analysis
Numerical Methods
- Variance, Standard Deviation
- Range, IQR
- Skewness, Kurtosis
- Percentiles, Z‑scores
Graphical Methods
- Box plots
- Bar charts
Step 4: Bivariate Analysis
Numerical Methods
- Covariance
- Cross‑tabulation
Graphical Methods
- Line plots
- Grouped bar charts
Step 5: Multivariate Analysis
- Pair plots
- Principal Component Analysis (PCA)
- Heatmaps
Key Components of EDA
Measures of Central Tendency
- Mean
- Median
- Mode
Measures of Dispersion
- Range
- Variance
- Standard deviation
- IQR
Measures of Position
- Percentiles
- Quartiles
- Deciles
- Z‑scores
Distribution Shape
- Skewness (symmetry)
- Kurtosis (peakedness)
Outlier Detection in EDA
Common Methods
- IQR method
- Z‑score method
- Visual inspection (box plot)
Outliers may indicate:
- Data entry errors
- Rare events
- Important insights
Graphical Tools Used in EDA
| Tool | Purpose |
|---|---|
| Histogram | Distribution |
| Box plot | Spread & outliers |
| Scatter plot | Relationships |
| Bar chart | Categorical data |
| Line plot | Trends over time |
| Heatmap | Correlation strength |
Importance of EDA
- Prevents incorrect modelling
- Improves data quality
- Reveals hidden insights
- Guides feature engineering
- Saves time and resources
Without EDA, conclusions may be misleading.
EDA in Data Science & Machine Learning
EDA helps in:
- Feature selection
- Data transformation
- Handling skewness
- Detecting multicollinearity
- Understanding target‑variable behaviour
Advantages of EDA
- Flexible and intuitive
- Minimal assumptions
- Works with small and large datasets
- Helps explain data to stakeholders
Limitations of EDA
- Subjective interpretation
- Cannot prove causation
- Time‑consuming for large datasets
- Results depend on analyst experience
Real‑World Example
Dataset: Customer purchase data
EDA might reveal:
- Most customers buy on weekends
- Sales are right‑skewed
- A few customers contribute most revenue
- Strong correlation between discounts and sales volume
EDA vs. Confirmatory Data Analysis
| Aspect | EDA (Exploratory) | Confirmatory Analysis |
|---|---|---|
| Goal | Exploration | Hypothesis testing |
| Approach | Flexible | Structured |
| Focus | Pattern discovery | Model validation |
| Assumptions | Minimal/none | Strong assumptions |
Summary
Exploratory Data Analysis is the foundation of all data analysis. It helps analysts understand, clean, summarize, and interpret data, enabling better modelling and accurate decision‑making.
“EDA lets the data speak before we impose our theories.”