Making Ads Count: Using MMoE and Auxiliary Tasks to Better Connect Buyers & Sellers
Source: Etsy Engineering
When buyers search on Etsy, they need to quickly and easily find the perfect item. At the same time, sellers need to be confident their unique products are being seen by the right customers. Our Ads Search ranking model, which is built on a multitask learning foundation, is the critical link in this connection. Recently, we identified an opportunity to drive more meaningful buyer engagement by enhancing our model’s ability to predict purchase intent. We achieved this via a dual-pronged improvement: introducing Multigate Mixture of Experts (MMoE) to our model architecture and leveraging add-to-cart as an auxiliary signal. By providing our downstream systems with more accurate predictions, we improved matching in our marketplace, surfacing more relevant listings for buyers while helping sellers reach customers who are genuinely interested in their products. Background When a buyer searches for an item on Etsy, we want them to find exactly what they’re looking for from our inventory containing tens of millions of listings. In order to help them do this, we surface high-quality listings that are relevant to a user’s search query by ranking a small subset of items from a much larger group. This includes advertisements purchased by sellers that enables them to promote their listings across Etsy placements, including search. While these results are sponsored, the items go through their own ranking process to surface the listings most likely to meet a buyer’s needs. The final result on the search page utilizes our auto-bidding system, which helps decide which listings get shown and the cost-per-click. After a user views an ad (known as an “impression”), clicking on the ad is often the first engagement in their purchase journey. However, each subsequent step – from click to cart addition to purchase – represents a progressively smaller subset of users. The increasing data sparsity that exists further along in the purchase journey can make it difficult for our model to pick up on a strong signal to learn from. When ranking ads, our machine learning models optimize for click-through rate (CTR) and post-click conversion rate (PCCVR). Clicks and purchases are the primary behaviors we use to predict and drive user engagement, but other actions in the buyer’s purchase journey, such as adding an item to a cart, are important and often predictive of a purchase.
Figure 1. The Ads Search user journey.
Some post-impression actions, such as favoriting an item, are not directly related to a buyer’s purchase journey but can provide valuable signals to enhance our model’s predictive capability. A click can be a strong indicator of a future purchase, but it can also be noisy – meaning it doesn’t always reliably predict purchase intent. For example, a user may click on an ad purely out of curiosity with no intention to buy. These are just a few reasons why user behavior is complex, and we are constantly trying to improve our prediction models to better capture these patterns and recommend the most relevant ads. Multitask Model Architecture The Ads Search ranking model is a multitask learning framework containing four major components: feature representation, explicit feature interaction, implicit feature interaction, and task prediction. Figure 2 is a depiction of our model architecture prior to the enhancements this post will describe. We start with raw numerical, categorical, and high-cardinality ID features for query, user, and listing entities, which are converted through the feature representation layer - including text embeddings and sequence encodings - to generate dense feature representations. These are concatenated and fed to a Deep and Cross Network (DCN) that learns explicit feature interactions. The explicitly crossed features then pass through a shallow feed-forward network for the model to learn additional implicit feature interactions. Finally, the latent feature representations are fed into task-specific towers to output CTR and PCCVR predictions.
Figure 2. The initial multitask architecture used for the Ads Search ranking model which has since been upgraded with Multigate Mixture of Experts (MMoE).
Since the CTR and PCCVR predictions are used in downstream ads ranking and auto-bidding systems, we need the predictions to be well-calibrated. After the underlying model is trained, we individually calibrate the CTR and PCCVR towers to probability distributions using Platt scaling layers. As user behaviors vary significantly across ad placements, the model learns distinct parameters for different placements. Optimizing for Purchase Intent The multitasked architecture we use has several advantages, as it helps the model learn shared patterns across tasks, reduces overfitting by allowing the model to learn more generalizable features, and decreases training and serving infrastructure costs by consolidating the two separate models into a single model. We originally deployed this multitask ranking model online in July 2023 and had not made major changes to its architecture since then. In the second half of 2025, the team identified an opportunity to better optimize for meaningful buyer engagement beyond click and purchase signals alone. Our goal was to not only surface listings that resonate with buyers and drive conversions but also encourage them to return to Etsy – creating a positive feedback loop that benefits both buyers and sellers. We hypothesized that optimizing our models for engagement actions that signaled both purchase intent and buyer satisfaction would surface more relevant ads to buyers. Engagement actions include behavior that goes beyond a simple click, such as an add-to-cart or a favorited listing. To more effectively prioritize listings that led to this meaningful engagement for our buyers, we experimented with: A model architecture that would better predict purchase intent and Adding additional signals to boost high-quality listings that buyers are more likely to purchas
e These two enhancements, in the form of Multigate Mixture of Experts (MMoE) and add-to-cart as an auxiliary task, worked well together in our model to drive a sizable product improvement in Q4 2025. Enabling Task-Specific Learning with Multigate Mixture of Experts (MMoE) While the introduction of our initial multitask model was a large success overall, it also had a limitation: since the model learns and shares the same feature representations across tasks, it is not always able to learn task-specific nuances, and this is more pronounced the less related the tasks are. When one task sees improved performance, other tasks can see performance degradation. This behavior is known as the “seesaw phenomenon,” and we encountered this when we first brought the multitask model online. One solution to this limitation is to add a Multigate Mixture of Experts (MMoE) layer. In this architecture, the model still employs a shared bottom architecture where the feature representation and interaction layers remain unchanged. However, the MMoE layer introduces two key additional components in place of the shared feed-forward network: experts and gates. Experts are parallel subnetworks that, unlike the shared representations before them in the network, are able to specialize and learn different patterns of the data. Experts are not specifically assigned tasks but rather this learning happens organically during training - some experts learn more about click-specific behavior, others learn more about purchase-specific behavior, and others learn patterns that are important for both of these actions. Each task has one softmax gating network which controls how that task combines expert outputs. This allows the CTR and PCCVR tasks to use and activate different subsets of experts differently, and this weighted expert information is then sent to our task-specific towers.
Figure 3. A comparison of our shared bottom multitask architecture for the Ads Search ranking model (left) and our MMoE architecture (right).
Tuning the Experts The main hyperparameters to tune in the MMoE layer are the number of experts, the size of the experts, and the expert type. The number of experts used depends on several factors. One factor is a tradeoff between having too few experts – which can underfit and fail to capture distinct patterns needed for each task – and too many – which can overfit to training data and fail to generalize well. Another factor to consider is that adding experts increases model capacity which in turn increases latency and infrastructure costs. Our initial configuration included only multilayer perceptrons (MLP)-based experts (i.e., feed-forward neural networks), but we experimented offline with heterogenous experts and saw an offline lift in purchase and click metrics through introducing a mixture of DCN- and MLP-based experts. There are other pitfalls when employing an MMoE architecture which oftentimes require additional hyperparameter tuning to resolve. Specifically, common issues in an MMoE structure are expert utilization (the ability of each task to use multiple experts) and expert specialization (the ability of each expert to learn differently). If experts are not well-utilized, the model has wasted capacity and fails to leverage the full representational power of the architecture. If experts are not specialized, the model effectively reduces to a shared bottom architecture with redundant experts. On the other hand, if experts specialize too strictly and are not shared between tasks, the model loses some of the benefits of multitask transfer learning. We ran into these issues when training our new model. To build a successful model with MMoE, we needed each task to utilize more experts and utilize some of the same experts as the other tasks so that they could benefit from both specialized and shared learning. We experimented offline with two regularization techniques to try and solve this issue: expert dropout and temperature scaling. In expert dropout, we randomly disable some experts during training to force the model to learn more diverse representations. Expert dropout differs from typical “dropout” in neural networks (where we randomly remove a percent of connections in a given layer during training), as we fully remove the utilization of a number of experts during the forward pass. Using expert dropout, utilization did improve a bit: each gate was selecting a primary and a secondary expert for each task. Still, we did not see any sharing of experts between the tasks. We then tried temperature scaling, which modifies the raw logits of the expert gates by dividing them by a temperature (T) to control the smoothness of the resulting probability distribution. By applying this before the softmax function (which converts logits to probabilities) in the gates with a T > 1, we softened the distribution, making it more likely to select multiple experts. Expert dropout is random and only applied at training while temperature scaling is deterministic and applied at both training and inference. Temperature scaling achieved better utilization and specialization than expert dropout, leading us to deploy this approach. Auxiliary Tasks Our multitask model already leveraged user click and purchase engagements to train CTR and PCCVR towers. However, we also have access to other rich user interactions, namely add-to-cart and favorites, that reflect the meaningful buyer engagement described above. Purchases are quite sparse compared to clicks, and one of the major benefits of our original multitasking model was for this sparse purchase action to benefit with additional signal from the more common click action. Our goal with adding auxiliary tasks was to help the model learn more generalizable representations of user engagement, again leaning on actions that are more plentiful than purchase. We hypothesized that add-to-cart and favorite actions were indications of high purchase intent that would help the model be
tter learn the purchase task without hurting the click task. Since we do not use add-to-cart and favorite predictions for downstream use cases like ranking or bidding, we did not need to calibrate these predictions or serve them online. This makes them relatively straightforward to add to our existing uncalibrated model architecture. In the shared bottom version of our model, we simply add one tower for each additional task. In the MMoE version of our model, we add one gate and one tower for each additional task. We experimented with both versions offline and found that MMoE in combination with auxiliary tasks performed better than the shared bottom model with auxiliary tasks. It makes sense that MMoE would outperform the shared bottom when we added more tasks due to the “seesaw phenomenon” described earlier. Through experimentation, we learned that while add-to-cart as an auxiliary task boosted purchase metrics by bridging the gap between purchases and clicks in terms of relatedness, favorites actually had a negative impact (known as negative transfer) on the model. With further analysis, we found that favorites can actually be quite noisy and not indicative of high purchase intent. As a result, the version of the model was ultimately ramped up in production only included add-to-cart as an auxiliary task.
Figure 4. A simplified version of the MMoE piece of our model architecture with add-to-cart (ATC) as an auxiliary task.
Results and Impact Offline, our new model showed promising improvements in Purchase and Click Area Under the Precision-Recall Curves (PR AUCs) and Purchase Area Under the Receiver Operating Characteristic Curve (ROC AUC) metrics. Together, these metrics measure how well our model predicts buyer behavior – PR AUC evaluates its ability to rank relevant listings at the top of the search results, and Purchase ROC AUC evaluates its ability to distinguish between listings buyers will and will not purchase. We saw average increases of 3.5% and 1% to Purchase and Click PR AUCs, respectively, and a 0.5% increase to Purchase ROC AUC, meaningful lifts for an industry-level ranking system. When we deployed the model online, we saw three meaningful improvements across the marketplace. First, the model drove purchases, improving buyer experience by more accurately predicting which listings from our inventory would resonate with them. Second, the ads marketplace became more efficient due to an improvement in purchase calibration metrics. More accurate PCCVR predictions served as better inputs to our auto-bidding system, which helped sellers reach buyers who are genuinely interested in their listings. Finally, the MMoE architecture is more flexible than the shared bottom architecture, so we were able to keep the overall model size flat by pruning other parts of the model when adding in MMoE. At serving time, inference became less costly, likely due to differences in the distribution of compute across model components. What’s Next The MMoE architecture provides the flexibility to add a variety of tasks to our ranking model by reducing the risk for negative transfer by encouraging some experts to learn task-specific patterns and others to learn shared representations. After seeing success with the add-to-cart task in our new modeling framework, we plan on experimenting with several additional auxiliary tasks, such as dwell time, to further improve our model’s ability to connect buyers with listings they’ll love. Ads give sellers an additional opportunity to stand out to buyers seeking their unique creations on Etsy. With each improvement to our ranking model, we continue to strengthen the marketplace connection between buyers and sellers – facilitating matches that help our sellers’ businesses grow and buyers discover products that feel made for them.