Descriptive Analysis
Source: Dev.to
Basic Concepts
| Concept | Definition |
|---|---|
| Population | The set of all elements under study. |
| Sample | A subset of elements of the population (it should be representative of the population). |
| Individuals | Every individual element in the population. |
| Variables | Characteristics of the individuals. |
Example: Titanic Dataset
- Population: 2 224 individuals (all passengers and crew).
- Samples:
train.csv: 891 representative individuals of the population, used for training machine‑learning models.test.csv: 418 representative individuals of the population, used for testing machine‑learning models.
- Individuals: Each passenger or crew member (every row of the sample or population data).
- Variables: The characteristics collected for each individual (every column of the data), e.g.,
Survived,Sex,Age, etc.
Types of Variables
1. Numerical
The data are represented by numbers that are metric or measure a quantity.
| Subtype | Description | Example |
|---|---|---|
| Continuous | Can take an infinite (or indeterminate) number of values. | Fare – can have values with up to 4 decimal places. |
| Discrete | Can take only countable values from a list. | SibSp (number of siblings/spouses aboard). |
2. Categorical
The data are represented by texts or numbers whose meaning is not metric but rather denotes a category.
| Subtype | Description | Example |
|---|---|---|
| Nominal | No intrinsic order among categories. | Embarked – port of embarkation (C, Q, S). |
| Ordinal | Categories have a natural order. | Pclass – passenger class (1 = first, 2 = second, 3 = third). |
Note of interest: The typology of a variable is not always clear and may depend on the analyst’s objectives. For instance,
Agecan be treated as a continuous numerical variable, a discrete numerical variable (if rounded), or an ordinal variable (if grouped into age ranges).
Data Visualization (Brief Overview)
Two of the most useful visual representations in data analysis are histograms and bar charts.
- Histogram – used for continuous variables.
- Bar chart – used for discrete (categorical) variables.
[Image: Bar chart]
[Image: Histogram]
These plots help us understand the distribution of a variable (e.g., whether it is symmetric or asymmetric).
Descriptive Statistics
Descriptive statistics summarize a dataset through measures that characterize its central tendency and dispersion.
Measures of Central Tendency
| Measure | Formula | Description |
|---|---|---|
| Population mean | $$\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$$ | Average of all values in the population ( (N) = total number of values). |
| Sample mean | $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$ | Average of the sample ( (n) = sample size). |
Example (Titanic – Age in train.csv)
# Pseudocode
mean_age = train['Age'].mean()
Median
The median is the middle value when the data are ordered:
$$ x_{(m)} = x_{\left(\frac{n+1}{2}\right)} $$
It is less affected by extreme values (outliers) than the mean.
Example (Titanic – Age):
median_age = train['Age'].median()
Mode
The mode is the most frequently occurring value in the dataset.
Example (Titanic – Age):
mode_age = train['Age'].mode()[0]
Interpreting the Three Measures
- Symmetric distribution: mean ≈ median ≈ mode.
- Positive (right‑skewed) asymmetry: mean > median > mode.
- Negative (left‑skewed) asymmetry: mean < median < mode, the distribution is negatively skewed (more passengers are younger than the mean age).
Measures of Variability
Range
The range gives an idea of how far apart the data values are. It is calculated by subtracting the minimum value in a set from the maximum one:
[ \text{Range}= \max (i) - \min (i) ]
Example (Titanic dataset) – Age variable:
[ 80 - 0.42 = 79.58 \text{ years} ]
Variance
The variance measures the dispersion of the values with respect to their mean. It is obtained by averaging the squared residuals (differences between each value and the mean).
Population variance
[ \sigma^{2}= \frac{1}{N}\sum_{i=1}^{N}\bigl(x_i-\mu\bigr)^{2} ]
Unbiased sample variance
[ s^{2}= \frac{1}{n-1}\sum_{i=1}^{n}\bigl(x_i-\bar{x}\bigr)^{2} ]
We square the differences because the sum of the raw differences would be zero.
The denominator (n-1) (instead of (n)) yields an unbiased estimate of the population variance; using (n) would tend to underestimate the true variance.
Example (Titanic dataset) – Age variable in train.csv:
[ s^{2}= 211.01 ]
Standard Deviation
The standard deviation expresses dispersion in the same units as the original data. It is simply the square root of the variance:
[ \sigma = \sqrt{\sigma^{2}} \qquad\text{(population)} \ s = \sqrt{s^{2}} \qquad\text{(sample)} ]
Example (Titanic dataset) – Age variable in train.csv:
[ s = 14.52 \text{ years} ]
Standard Error
The standard error indicates how well a sample represents the population. It is calculated by dividing the standard deviation by the square root of the sample size:
[ \displaystyle SE_{\bar{x}} = \frac{s}{\sqrt{n}} ]
Example (Titanic dataset) – Age variable in train.csv:
[ SE_{\bar{x}} = 0.54 ]
Interpreting Dispersion
These measures (range, variance, standard deviation, standard error) provide the first clues about the variability of a distribution. They are especially useful when:
- Comparing datasets on the same variable.
- Combining them with measures of central tendency (mean, median, mode) to characterize a variable’s distribution.
Visualising Distributions
Numbers alone can be complemented by visualisations, which often reveal patterns that are not obvious from summary statistics. Typical distribution shapes you may encounter are:
1. Symmetric (Mean ≈ Median ≈ Mode)
Values are evenly spread around the centre.
2. Skewed Right (Positive Skew)
Mean > Median > Mode – more values lie below the mean.
3. Skewed Left (Negative Skew)
Mean < Median < Mode – more values lie above the mean.
4. Uniform (Flat)
Almost all values are identical across the range. This can happen when bin widths are too large or when the variable actually aggregates several underlying variables. Adjusting the bin size or plotting a different type of chart can reveal hidden structure.
5. Multimodal
Two or more distinct peaks appear, suggesting the presence of multiple sub‑populations or sources of variation. Examining each mode separately can be informative.
6. Normal (Gaussian)
A symmetric, bell‑shaped curve. Many natural phenomena follow this pattern, and it is especially convenient because a large proportion of data falls within known multiples of the standard deviation:
- ~68 % within ±1 σ
- ~95 % within ±2 σ
- ~99.7 % within ±3 σ
When data approximate a normal distribution, many statistical tests and confidence‑interval calculations become straightforward.
Reference
Kaggle – Titanic: Machine Learning from Disaster
Feel free to adapt the visualisations (histograms, density plots, box‑plots, etc.) to the specific characteristics of your data.