Descriptive Analysis

Published: 1 week ago (December 30, 2025 at 10:10 AM EST)

5 min read

Source: Dev.to

Basic Concepts

Concept	Definition
Population	The set of all elements under study.
Sample	A subset of elements of the population (it should be representative of the population).
Individuals	Every individual element in the population.
Variables	Characteristics of the individuals.

Example: Titanic Dataset

Population: 2 224 individuals (all passengers and crew).
Samples:
- train.csv: 891 representative individuals of the population, used for training machine‑learning models.
- test.csv: 418 representative individuals of the population, used for testing machine‑learning models.
Individuals: Each passenger or crew member (every row of the sample or population data).
Variables: The characteristics collected for each individual (every column of the data), e.g., Survived, Sex, Age, etc.

Types of Variables

1. Numerical

The data are represented by numbers that are metric or measure a quantity.

Subtype	Description	Example
Continuous	Can take an infinite (or indeterminate) number of values.	`Fare` – can have values with up to 4 decimal places.
Discrete	Can take only countable values from a list.	`SibSp` (number of siblings/spouses aboard).

2. Categorical

The data are represented by texts or numbers whose meaning is not metric but rather denotes a category.

Subtype	Description	Example
Nominal	No intrinsic order among categories.	`Embarked` – port of embarkation (C, Q, S).
Ordinal	Categories have a natural order.	`Pclass` – passenger class (1 = first, 2 = second, 3 = third).

Note of interest: The typology of a variable is not always clear and may depend on the analyst’s objectives. For instance, Age can be treated as a continuous numerical variable, a discrete numerical variable (if rounded), or an ordinal variable (if grouped into age ranges).

Data Visualization (Brief Overview)

Two of the most useful visual representations in data analysis are histograms and bar charts.

Histogram – used for continuous variables.
Bar chart – used for discrete (categorical) variables.

[Image: Bar chart]
[Image: Histogram]

These plots help us understand the distribution of a variable (e.g., whether it is symmetric or asymmetric).

Descriptive Statistics

Descriptive statistics summarize a dataset through measures that characterize its central tendency and dispersion.

Measures of Central Tendency

Measure	Formula	Description
Population mean	$$\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$$	Average of all values in the population ( (N)  = total number of values).
Sample mean	$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$	Average of the sample ( (n)  = sample size).

Example (Titanic – `Age` in `train.csv`)

# Pseudocode
mean_age = train['Age'].mean()

Median

The median is the middle value when the data are ordered:

$$ x_{(m)} = x_{\left(\frac{n+1}{2}\right)} $$

It is less affected by extreme values (outliers) than the mean.

Example (Titanic – Age):

median_age = train['Age'].median()

Mode

The mode is the most frequently occurring value in the dataset.

Example (Titanic – Age):

mode_age = train['Age'].mode()[0]

Interpreting the Three Measures

Symmetric distribution: mean ≈ median ≈ mode.
Positive (right‑skewed) asymmetry: mean > median > mode.
Negative (left‑skewed) asymmetry: mean < median < mode, the distribution is negatively skewed (more passengers are younger than the mean age).

Measures of Variability

Range

The range gives an idea of how far apart the data values are. It is calculated by subtracting the minimum value in a set from the maximum one:

[ \text{Range}= \max (i) - \min (i) ]

Example (Titanic dataset) – Age variable:

[ 80 - 0.42 = 79.58 \text{ years} ]

Variance

The variance measures the dispersion of the values with respect to their mean. It is obtained by averaging the squared residuals (differences between each value and the mean).

Population variance

[ \sigma^{2}= \frac{1}{N}\sum_{i=1}^{N}\bigl(x_i-\mu\bigr)^{2} ]

Unbiased sample variance

[ s^{2}= \frac{1}{n-1}\sum_{i=1}^{n}\bigl(x_i-\bar{x}\bigr)^{2} ]

We square the differences because the sum of the raw differences would be zero.
The denominator (n-1) (instead of (n)) yields an unbiased estimate of the population variance; using (n) would tend to underestimate the true variance.

Example (Titanic dataset) – Age variable in train.csv:

[ s^{2}= 211.01 ]

Standard Deviation

The standard deviation expresses dispersion in the same units as the original data. It is simply the square root of the variance:

[ \sigma = \sqrt{\sigma^{2}} \qquad\text{(population)} \ s = \sqrt{s^{2}} \qquad\text{(sample)} ]

Example (Titanic dataset) – Age variable in train.csv:

[ s = 14.52 \text{ years} ]

Standard Error

The standard error indicates how well a sample represents the population. It is calculated by dividing the standard deviation by the square root of the sample size:

[ \displaystyle SE_{\bar{x}} = \frac{s}{\sqrt{n}} ]

Example (Titanic dataset) – Age variable in train.csv:

[ SE_{\bar{x}} = 0.54 ]

Interpreting Dispersion

These measures (range, variance, standard deviation, standard error) provide the first clues about the variability of a distribution. They are especially useful when:

Comparing datasets on the same variable.
Combining them with measures of central tendency (mean, median, mode) to characterize a variable’s distribution.

Visualising Distributions

Numbers alone can be complemented by visualisations, which often reveal patterns that are not obvious from summary statistics. Typical distribution shapes you may encounter are:

1. Symmetric (Mean ≈ Median ≈ Mode)

Values are evenly spread around the centre.

2. Skewed Right (Positive Skew)

Mean > Median > Mode – more values lie below the mean.

3. Skewed Left (Negative Skew)

Mean < Median < Mode – more values lie above the mean.

4. Uniform (Flat)

Almost all values are identical across the range. This can happen when bin widths are too large or when the variable actually aggregates several underlying variables. Adjusting the bin size or plotting a different type of chart can reveal hidden structure.

5. Multimodal

Two or more distinct peaks appear, suggesting the presence of multiple sub‑populations or sources of variation. Examining each mode separately can be informative.

6. Normal (Gaussian)

A symmetric, bell‑shaped curve. Many natural phenomena follow this pattern, and it is especially convenient because a large proportion of data falls within known multiples of the standard deviation:

~68 % within ±1 σ
~95 % within ±2 σ
~99.7 % within ±3 σ

When data approximate a normal distribution, many statistical tests and confidence‑interval calculations become straightforward.

Reference
Kaggle – Titanic: Machine Learning from Disaster

Feel free to adapt the visualisations (histograms, density plots, box‑plots, etc.) to the specific characteristics of your data.

Descriptive Analysis

Basic Concepts

Example: Titanic Dataset

Types of Variables

1. Numerical

2. Categorical

Data Visualization (Brief Overview)

Descriptive Statistics

Measures of Central Tendency

Example (Titanic – `Age` in `train.csv`)

Median

Mode

Interpreting the Three Measures

Measures of Variability

Range

Variance

Population variance

Unbiased sample variance

Standard Deviation

Standard Error

Interpreting Dispersion

Visualising Distributions

1. Symmetric (Mean ≈ Median ≈ Mode)

2. Skewed Right (Positive Skew)

3. Skewed Left (Negative Skew)

4. Uniform (Flat)

5. Multimodal

6. Normal (Gaussian)

Related posts

Congrats to the AI Agents Intensive Course Writing Challenge Winners!

How GitHub Pull Requests in VS Code Improved My Open-Source Workflow

AI SEO agencies Nordic

How do I discover new music that actually fits my taste?

Basic Concepts

Example: Titanic Dataset

Types of Variables

1. Numerical

2. Categorical

Data Visualization (Brief Overview)

Descriptive Statistics

Measures of Central Tendency

Example (Titanic – Age in train.csv)

Median

Mode

Interpreting the Three Measures

Measures of Variability

Range

Variance

Population variance

Unbiased sample variance

Standard Deviation

Standard Error

Interpreting Dispersion

Visualising Distributions

1. Symmetric (Mean ≈ Median ≈ Mode)

2. Skewed Right (Positive Skew)

3. Skewed Left (Negative Skew)

4. Uniform (Flat)

5. Multimodal

6. Normal (Gaussian)

Related posts

Congrats to the AI Agents Intensive Course Writing Challenge Winners!

How GitHub Pull Requests in VS Code Improved My Open-Source Workflow

AI SEO agencies Nordic

How do I discover new music that actually fits my taste?

Example (Titanic – `Age` in `train.csv`)