Migrating Etsy’s database sharding to Vitess
Source: Etsy Engineering
Etsy’s Sharded MySQL Architecture
Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years it has grown significantly:
- Combined size: > 425 TB of data
- Throughput: ≈ 1.7 million requests per second
How the ORM Works
- Etsy engineers access MySQL data through a proprietary object‑relational mapping (ORM).
- The ORM has a corresponding model for each MySQL table.
- When a table is sharded, its rows are partitioned among different databases (shards).
- Each shard has identical table schemas and contains a distinct subset of all rows.
Shardifier IDs
- Sharded tables require a unique id field on the model, called the “shardifier id”, which determines the shard for each record.
- Shardifier IDs are designed to co‑locate related data to minimize the number of databases needed for a query (e.g., all records for a single shop or user live on the same shard).
- The majority of models use
shop_idoruser_idas the shardifier ID, but > 30 different options are in use.
Legacy Index Database
- Prior to Vitess, the ORM stored mappings of records → shards in a single, unsharded “index” database.
- When a record was created, the ORM randomly chose a shard, stored the mapping in the index DB, and later retrieved it for routing.
Query flow:
- ORM queries the index DB to get the shard mapping.
- ORM sends the actual query to the appropriate shard.
Benefits of the Shard Architecture
- Scalability – horizontal growth across many shards.
- Resilience – a single shard outage affects only ~1/1000 of traffic.
Drawbacks
- Manual, complex scaling – adding new shards took months.
- Single point of failure – the index database could cause a full site outage if it went down.
- Developer friction – product developers had to understand and manage sharding, which was often confusing and toilsome.
As Etsy and the index database grew, incidents related to index‑DB unavailability increased, making these vulnerabilities a high priority to address.
Introducing Vitess
In 2018 we added Vitess (an open‑source abstraction layer for scaling, deploying, and managing large MySQL clusters) to our architecture.
Migration Steps
- Initial integration – the ORM still told Vitess which shard to query, allowing us to validate the new components before moving shard logic into Vitess.
- Exploring vindexes – Vitess “vindexes” define sharding strategies inside Vitess, analogous to our previous shardifier‑id → shard mapping.
We first scaled several unsharded payments tables by creating a new sharded cluster that could use Vitess vindexes out‑of‑the‑box. The success of that pilot paved the way for migrating our existing, in‑house shard infrastructure to Vitess vindexes.
Choosing a Vindex
Vitess ships with many vindexes. We focused on those that calculate the shard algorithmically (e.g., the hash vindex) to eliminate reliance on an external lookup store.
- The ORM’s shard mappings are random, not algorithmic.
- Using a built‑in algorithmic vindex would have required re‑sharding all data, a manual effort that could take years.
Our Approach
- Write custom vindexes that ported our existing shard logic into Vitess, allowing us to test vindexes without moving data.
- Modify the ORM’s shard‑assignment algorithm to match the Vitess hash vindex algorithm.
- After this change, new shard mappings no longer needed an index‑DB lookup.
- Store existing mappings in a read‑only SQLite database:
- Low‑latency reads, tiny footprint, can be copied onto each Vitess server, avoiding external‑DB latency.
- Build a custom SQLite lookup vindex that reads shard information from SQLite.
- Create a hybrid vindex that selects between two vindexes based on a threshold:
- SQLite vindex for IDs ≤ threshold (old records).
- Hash vindex for IDs > threshold (new records).
The threshold was set to the first ID created after we changed the shard‑assignment algorithm.
Introducing Vindexes into Our Environment
With the custom vindexes in place we could add Vitess vindexes without moving any data. The next step was to verify that all existing queries (written for the legacy ORM) would still succeed and return identical results under Vitess‑managed sharding.
Compatibility Concerns
- Vitess requires the shardifier ID to appear in the
WHEREclause of a SQL query so it can route the query to the correct shard. - The legacy ORM did not require this; developers often supplied the shardifier ID separately from the SQL statement.
- Consequently, many existing queries did not include the shardifier ID in the
WHEREclause, leading to potential routing failures.
We addressed this by:
- Auditing all production queries to identify missing shardifier IDs.
- Updating ORM helpers to automatically inject the required
WHEREclause when possible. - Providing migration guides for developers to adjust custom queries.
After these steps, the system could safely transition to Vitess‑managed sharding while preserving functional parity with the legacy architecture.
Scaling Etsy Payments with Vitess – Part 3
Incremental Migration with Vindexes
The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time‑consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in design and purpose.
Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This allowed us to:
- Test on a smaller scale, ensuring each data‑access pattern succeeded with Vitess.
- Monitor how vindexes performed with a specific set of queries.
- Surface incompatibilities gradually, reducing the risk of large‑scale outages.
Because we needed to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. Etsy’s experimentation framework let us gradually ramp up the change by incrementally increasing the percentage of traffic that used Vitess vindexes for each table. This enabled us to:
- Compare query performance when using vindexes vs. ORM shard routing.
- Quickly ramp down to 0 % if any issues appeared.
Minimizing risk by starting small and having a fast rollback path helped mitigate the danger of not testing all queries upfront.
Working Through Challenges: Database Transactions
Early on we ran into a challenge with database transactions.
- In‑house shard routing: The ORM treated each shard as a separate database and queried specific shards directly.
- Vitess‑managed sharding: The ORM queried the shards as if they were a single database; Vitess presented them as separate databases only behind the scenes.
This difference meant the two approaches created different connections to the database, which broke atomicity guarantees (atomicity is per‑connection). To avoid data‑integrity issues, we required that tables written to in a single transaction be ramped onto vindexes simultaneously, ensuring they used the same connection.
In theory this was simple, but in practice it was consequential. A small number of tables—representing the most complex and critical data models such as receipts, listings, and transactions—accounted for a large portion of traffic. While preparing one of them for vindexes, we discovered that 27 models (3 % of tables) generated one‑third of our database traffic, and all were linked via transactions.
Despite our table‑by‑table strategy, these high‑risk changes were tightly coupled, so we collaborated across the company to ramp the 27 models in unison.
Reaping the Benefits: Cross‑Shard Queries
One major advantage of Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier ID will “scatter”: Vitess sends it to all shards in parallel, sorts the results, and returns a single result set.
- Example: a model’s query time dropped from ~2 seconds to ~20 ms using cross‑shard queries.
However, at Etsy’s scale, unintentionally scattering an expensive query to all 1,000 shards could become a problem. To protect against this, we:
- Disabled scatter queries by default in our environment.
- Provided a way for developers to explicitly allow scattering via the ORM when needed.
Bulk Primary‑Key Lookups
Utilizing scatter queries brought notable improvements to bulk primary‑key lookups for the 27‑table ramp mentioned above.
- Old approach: The ORM took an array of primary keys, batched them by shard, issued a query to each shard, and merged the results.
- New approach: Vitess lets us query multiple shards in a single statement, so we can include all primary keys in one query, eliminating per‑shard batching.
This dramatically reduced the number of queries issued for bulk lookups on several models.
Figure: Bulk lookup queries for one model during its ramp onto vindexes.
Purple lines denote 1 %, 10 %, 50 %, and 100 % of traffic to the model using vindexes.
Conclusion
Five years, ~2,500 pull requests, and ~6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes!
Even with a streamlined migration process, replacing the database infrastructure for a codebase of Etsy’s scale and age remained a challenge. As infrastructure engineers, we often had minimal context about the code we were changing and what might break. Nevertheless, through coordination and careful testing across Etsy engineering, we met our goals:
- Scaling operations are no longer manual and can be performed in days rather than months.
- The index database is no longer a single point of failure for our sharded cluster.
- Shard infrastructure is obscured from developers, simplifying data modeling and query writing.
- Database performance stayed largely the same, making the change invisible to end users.
- We progressed incrementally, rolled back quickly when needed, and integrated Vitess without a massive data migration.
We are now excited to leverage the new Vitess features unlocked by migrating to vindexes, such as:
- Re‑sharding data.
- Rebalancing data across shards.
- Using Vitess’s MoveTable operations to shard previously unsharded tables.
Swapping out such critical and complex infrastructure with almost no downtime or impact on users was an extremely challenging task—but it was also an immensely rewarding one.
Extremely gratifying.
Acknowledgements
This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago‑Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering.