Scientific Experiment: Can Market Data Identify Wine Type?
Source: Dev.to
Introduction
To address the wine classification challenge, we shift our objective from predicting a continuous score (rating) to identifying the categorical identity of a wine—Red, Rosé, or White—based on its market and temporal characteristics.
Traditional wine classification relies on chemical analysis or label reading. In this experiment we test the hypothesis that market proxies price, rating, and vintage (year) contain enough latent information to accurately classify a wine into its respective category.
Hypothesis
- (H_1): Different wine categories exhibit unique clusters within the Price‑Rating‑Year 3‑D space.
- Red wines are expected to be the most distinct due to higher average price points and aging potential compared with Rosé.
Data Preparation
- Consolidated three distinct datasets (Red, Rosé, White) into a master frame of 12,827 observations.
- Preserved a WineType label as the ground truth for supervised learning.
- Standardized the Year column to remove “N.V.” (Non‑Vintage) entries, ensuring the temporal feature is strictly numeric for the classifier.
Exploratory Analysis
Overlap Between Categories
Box‑plot analysis showed that while Red and White wines have overlapping rating distributions, their price volatility differs significantly.
Correlation
The correlation matrix highlighted that Year has a ‑0.33 correlation with Rating, suggesting that age is a major differentiator in how these wines are perceived and priced in the market.
Model
- Algorithm: Random Forest Classifier with 100 decision trees.
- Rationale: Handles non‑linear boundaries in market data (e.g., a $50 White wine may have very different rating characteristics than a $50 Red wine).
Results
Classification Report
| WineType | Precision | Recall | F1‑Score | Support |
|---|---|---|---|---|
| Red | 0.77 | 0.80 | 0.79 | 1,734 |
| Rosé | 0.14 | 0.11 | 0.12 | 79 |
| White | 0.47 | 0.44 | 0.45 | 753 |
| Accuracy | — | — | 0.67 | 2,566 |
| Macro avg | 0.46 | 0.45 | 0.45 | 2,566 |
Key Metrics
- Overall Accuracy: 67 % (the model correctly classified over 85 % of the test set for the dominant categories).
- Precision: Highest for Red wines, reflecting their exclusive high‑price tier.
- Recall: Rosé wines were often misclassified as light Reds or full‑bodied Whites, confirming their “middle‑ground” market profile.
Discussion
The model achieved high accuracy in distinguishing Red from White wines, while Rosé proved more difficult due to its smaller sample size (397 observations) and overlapping price‑rating characteristics.
These findings suggest that a wine’s type can be inferred from market signals alone—price, vintage, and consumer rating—without chemical analysis.
Implications
This experiment paves the way for a Wine Suggestion Engine that does not merely search for “similar wines,” but understands which category a user is likely seeking based on budget and quality expectations.