Not All RecSys Problems Are Created Equal
Source: Towards Data Science
Candidate Generation
Most recommendation systems start with a candidate‑generation phase that reduces millions of possible items to a manageable set for later re‑ranking.
Key insight: Candidate generation isn’t always the uphill battle it’s made out to be, and it doesn’t necessarily require machine learning.
-
Hard‑filter‑driven contexts – When the scope is well defined, simple filters can prune the catalog dramatically.
- Example: Booking.com – a query like “4‑star hotels in Barcelona, September 12‑15” already narrows millions of properties down to a few hundred based on geography and availability. The real ML challenge is then ranking those hotels with precision.
-
Soft‑filter or open‑ended contexts – No hard constraints exist, so the system must rely on semantic intent or past behavior to surface relevant candidates from a massive catalog before any re‑ranking can occur.
- Examples: Amazon product search, YouTube homepage.
Re‑ranking Complexity
Re‑ranking can be understood along two orthogonal dimensions (illustrated in the image below):
- Observable outcomes & catalog stability – Determines how strong a baseline you can establish.
- Subjectivity of preferences & learnability – Determines how complex your personalization solution must be.
| Dimension | What it means for your model |
|---|---|
| Observable outcomes | If you have clear, frequent signals (e.g., clicks, purchases), you can build a robust baseline with simpler models. |
| Catalog stability | A stable catalog (e.g., hotels, movies) lets you pre‑compute many features; a rapidly changing catalog (e.g., news articles) often requires dynamic methods. |
| Subjectivity | Highly subjective domains (e.g., music taste) demand richer user representations and possibly deep‑learning architectures. |
| Learnability | When preferences are easy to infer from past behavior, shallow models may suffice; otherwise, you may need more expressive models. |
Visual Summary

Takeaways
- Most RecSys jobs involve tabular data, gradient‑boosted trees, and a clear separation between candidate generation (often rule‑based) and re‑ranking.
- Industry giants push the envelope with hybrid deep‑learning pipelines, but they operate in domains where hard filters are scarce and the catalog is massive and fluid.
- Use the two‑dimensional framework to assess where your problem sits on the spectrum and choose the appropriate level of model complexity.
Happy modeling!
Observable Outcomes and Catalog Stability
Directly Observable Outcomes
Businesses that can directly observe their most important outcomes have a strong, reliable baseline.
- Example: IKEA knows exactly which sofa sells better because each purchase is a clear signal (e.g., an ESKILSTUNA versus a KIVIK).
- When users “vote with their wallets,” the company can aggregate these signals and rank products with confidence.
“When you can directly observe users voting with their wallets, you have a strong baseline that’s hard to beat.”
Indirect or Upper‑Funnel Signals
Platforms that cannot see the final conversion rely on weaker, upper‑funnel signals, which introduces position bias:
| Platform | Observable Signal | Limitation |
|---|---|---|
| Tinder / Bumble | Matches | No insight into whether the pair actually “hits it off.” |
| Yelp / Google Maps | Click‑throughs | No guarantee the user visited the restaurant; clicks are placement‑driven |
| Other engines | Impressions / clicks | High‑visibility items get more interactions regardless of true quality |
- Users may click the first restaurant on Yelp simply because it appears at the top, not because it’s the best choice.
- Without a hard conversion event, you lose a reliable leaderboard and must extract signal from noisy, weak data.
Typical workarounds (e.g., reviews) are often too sparse to serve as primary signals, forcing teams to run endless experiments on ranking heuristics and constantly tune proxies for quality.
High‑Churn Catalogs
Even when outcomes are observable, a high‑churn catalog can prevent the accumulation of enough data to build a robust leaderboard.
- Zillow (real‑estate) and Vinted (second‑hand) listings often have an inventory of 1 and disappear the moment they sell.
- The rapid turnover pushes these platforms toward simplistic sorts such as “newest first” or “lowest price per square meter,” which are far weaker than conversion‑based rankings.
What’s needed?
- Predictive ML models that estimate conversion probability immediately.
- Combine intrinsic item attributes (size, location, price, etc.) with debiased short‑term performance metrics.
- Surface the highest‑potential inventory before it disappears, turning a volatile catalog into a predictable revenue driver.
The Ubiquity of Feature‑Based Models
Regardless of catalog stability or signal strength, the core challenge remains the same: improve upon whatever baseline is available. This is typically achieved by training a machine‑learning (ML) model to predict the probability of engagement or conversion given a specific context.
Gradient‑boosted trees (GBDTs) are the pragmatic choice—they are much faster to train and tune than deep‑learning alternatives.
How GBDTs Work
GBDTs predict outcomes from engineered item features (categorical and numerical attributes that describe a product). Even before individual preferences are known, GBDTs can adapt recommendations using basic user features such as:
- Country
- Device type
With just these item and user features, an ML model can already improve upon the baseline—whether that means debiasing a popularity leaderboard or ranking a high‑churn feed.
Example: In fashion e‑commerce, models commonly use location and time of year to surface season‑appropriate items, while simultaneously using country and device to calibrate price points.
Combating Position Bias
These features let the model separate true quality from mere visibility. By learning which intrinsic attributes drive conversion, the model can correct for the position bias inherent in a popularity baseline. It learns to promote items that perform on merit rather than simply because they were ranked at the top.
Caution: Over‑correcting can demote proven winners too aggressively, potentially degrading the user experience.
Personalization with Feature‑Based Models
Contrary to popular belief, feature‑based models can also drive personalization—provided the items contain enough semantic information. Platforms such as Booking.com and Yelp accumulate rich descriptions, multiple photos, and user reviews. These can be encoded into semantic embeddings and used as features:
- Compute embeddings for each item.
- Derive similarity scores between a user’s recent interactions and candidate items.
- Feed those similarity scores into the GBDT as additional features.
Limitations
- Feature‑based models can recommend based on similarity to recent interactions, but they do not directly learn which items tend to be liked by similar users (as collaborative filtering does).
- To capture that collaborative signal, you must provide item‑similarity scores as input features.
Whether this limitation matters depends on a more fundamental question: how much do users actually disagree? If preferences are highly divergent, the lack of an explicit collaborative signal may become a bottleneck; otherwise, a well‑engineered feature‑based GBDT can be both fast and effective.
Subjectivity
Not all domains are equally personal or controversial. In some, users largely agree on what makes a good product once basic constraints are satisfied. We call these convergent preferences, and they occupy the bottom half of the chart.
Convergent Preferences
-
Booking.com – Travelers may have different budgets and location preferences, but once those are revealed through filters and map interactions, ranking criteria converge:
- higher prices → bad
- more amenities → good
- better reviews → better
-
Staples – When a user needs printer paper or AA batteries, brand and price dominate, making preferences remarkably consistent.
Fragmented (Subjective) Preferences
At the opposite extreme – the top half of the chart – are domains defined by highly fragmented taste.
- Spotify – One user’s favorite track is another’s immediate skip.
- Somewhere in the data is a user on your exact wavelength.
- Machine‑learning bridges the gap, turning their discoveries from yesterday into your recommendations for today.
In these cases the value of personalization is enormous, and so is the technical investment required.
The Right Data
Subjective taste is only actionable if you have enough data to observe it.
Many domains involve distinct preferences but lack a feedback loop to capture them. Examples:
| Domain | Challenge | Typical proxy metric |
|---|---|---|
| Niche content platform / new marketplace / B2B | Divergent tastes but sparse interaction data | Limited or noisy signals |
| Yelp (restaurant recommendations) | Preferences are subjective, but only clicks are seen | Click‑through rate (CTR) – can be misleading |
| YouTube (dense behavioral data) | Billions of daily interactions provide rich signal | Watch time, likes, shares – enable deep‑learning‑driven personalization |
When dense behavioral data exist, failing to personalize leaves money on the table. You’ll see large teams coordinating over Jira, cloud bills that require VP approval, and deep‑learning pipelines that become unavoidable. Whether that complexity is justified comes down entirely to the quality and quantity of the data you have.
Know Where You Stand
Understanding where your problem sits on this spectrum is far more valuable than blindly chasing the latest architecture. The industry’s “state‑of‑the‑art” is largely defined by outliers—the tech giants dealing with massive, subjective inventories and dense user data. Their solutions are famous because their problems are extreme, not because they are universally correct.
However, you’ll likely face different constraints in your own work. If your domain is defined by a stable catalog and observable outcomes, you land in the bottom‑left quadrant alongside companies like IKEA and Booking.com. Here, popularity baselines are so strong that the challenge is simply building upon them with machine‑learning models that can drive measurable A/B‑test wins.
If, instead, you face high churn (e.g., Vinted) or weak signals (e.g., Yelp), machine learning becomes a necessity just to keep up.
But that doesn’t mean you need deep learning. That added complexity only truly pays off in territories where preferences are deeply subjective and there’s enough data to model them. We often treat systems like Netflix or Spotify as the gold standard, but they are specialized solutions to rare conditions.
For the rest of us, excellence isn’t about deploying the most complex architecture available; it’s about recognizing the constraints of the terrain and having the confidence to choose the solution that solves your problems.
Images by the author.