Inside the feature store powering real-time AI in Dropbox Dash
Source: Dropbox Tech Blog
Dropbox Dash uses AI to understand questions about your files, work chats, and company content, bringing everything together in one place for deeper, more focused work. With tens of thousands of potential work documents to consider, both search and agents rely on a ranking system powered by real‑time machine learning to find the right files fast. At the core of that ranking is our feature store, a system that manages and delivers the data signals (“features”) our models use to predict relevance.
Why a custom feature store?
To help users find exactly what they need, Dash must read between the lines of user behavior across file types, company content, and the messy, fragmented realities of collaboration. It then surfaces the most relevant documents, images, and conversations when—and how—they’re needed.
The feature store is a critical part of how we rank and retrieve the right context across your work. It’s built to:
- Serve features quickly
- Keep pace as user behavior changes
- Let engineers move fast from idea to production
(For more on how feature stores connect to context engineering in Dash, check out our deep dive on context engineering.)
What’s in this post?
We’ll walk through:
- How we built the feature store behind Dash’s ranking system
- Why off‑the‑shelf solutions didn’t fit
- How we designed for speed and scale
- What it takes to keep features fresh as user behavior evolves
Along the way, we’ll share the trade‑offs we made and the lessons that shaped our approach.
Our Goals and Requirements
Building a feature store for Dash required a custom solution rather than an off‑the‑shelf product. The main constraints were:
| Area | Challenge | Why It Matters |
|---|---|---|
| Hybrid Infrastructure | On‑premises low‑latency service mesh ↔ Spark‑native cloud environment | Standard cloud‑native stores couldn’t span both worlds, so we needed a bridge that kept development velocity high. |
| Search Ranking Scale | One query → thousands of feature lookups (behavioral, contextual, real‑time signals) | The store must sustain massive parallel reads while staying under sub‑100 ms latency budgets. |
| Real‑Time Relevance | Signals (e.g., document open, Slack join) must be reflected in the next search within seconds | Requires an ingestion pipeline that can keep up with user‑behavior velocity at scale. |
| Mixed Computation Patterns | Some features are streaming‑first; others need batch processing of historic data | A unified framework is needed to support both efficiently, reducing cognitive load for engineers and shortening the path from idea to production. |
Summary
- Bridge on‑prem & cloud without sacrificing speed.
- Support massive parallel reads while guaranteeing low latency.
| Metric | Python (original) | Go (new) |
|---|---|---|
| Latency | ≤ 100 ms | 25–35 ms |
| Throughput | Thousands of req/s with high CPU | Thousands of req/s with lower CPU |
| Scalability | Limited by GIL & process coordination | Linear scaling with goroutine count |
The Go service now handles thousands of requests per second, adding only ~5–10 ms of processing overhead on top of Dynovault’s client latency and consistently achieving p95 latencies of ~25–35 ms.
Impact
- Met Dash’s latency targets reliably.
- Prevented feature serving from becoming a bottleneck as search traffic and feature complexity grew.
Read more about Go at the official site.
Keeping Features Fresh
Speed matters only when the data itself is fresh. Stale features can lower ranking quality and hurt user experience, so our feature store must reflect new signals as soon as possible—often within minutes of user actions.
The Challenge
- Scale – Many of Dash’s most important features depend on large joins, aggregations, and historical context, making fully real‑time computation impractical.
- Balance – We needed an ingestion strategy that kept data fresh and reliable without overwhelming our infrastructure or slowing development.
Our Solution: A Three‑Part Ingestion System
| Ingestion Type | What It Handles | Key Benefits |
|---|---|---|
| Batch ingestion | Complex, high‑volume transformations built on the medallion architecture (raw → refined stages). | • Intelligent change detection → only modified records are written. • Write volume reduced from hundreds of millions per hour to < 5 minutes. |
| Streaming ingestion | Fast‑moving signals (e.g., collaboration activity, content interactions). | • Near‑real‑time processing of unbounded datasets. • Features stay aligned with users’ current actions. |
| Direct writes | Lightweight or pre‑computed features (e.g., relevance scores from an LLM evaluation pipeline). | • Bypass batch pipelines entirely. • Data appears in the online store within seconds. |
Outcome
By combining these three ingestion paths, Dash can keep feature values fresh without forcing all computation onto a single pipeline. This preserves ranking quality while scaling to real‑world usage.
What We Learned
Building a feature store at Dropbox scale reinforced several hard‑earned lessons about systems design.
Serving‑side insights
- Python’s concurrency model became a limiting factor for high‑throughput, mixed CPU‑I/O workloads.
- Even with careful parallelism, the Global Interpreter Lock (GIL) capped performance for CPU‑bound work such as JSON parsing.
- Switching to multiple processes introduced new coordination bottlenecks.
- Rewriting the serving layer in Go removed those trade‑offs and let us scale concurrency predictably.
Data‑side insights
- Infrastructure changes mattered, but understanding access patterns mattered just as much.
- Only 1–5 % of feature values change in a typical 15‑minute window.
- Exploiting this fact dramatically reduced write volume and ingestion time, turning hour‑long batch cycles into five‑minute updates—improving freshness without increasing system load.
Hybrid architecture
- Feast – orchestration & consistency
- Spark – large‑scale computation
- Dynovault – low‑latency online serving
Rather than relying on a single vendor solution, this approach lets us tune each layer to its strengths while keeping training and serving aligned.
Takeaway
The work underscored the value of a middle path between building everything from scratch and adopting off‑the‑shelf systems wholesale. By combining open‑source foundations with internal infrastructure and tailoring them to real constraints, we built a feature store that meets today’s needs and can evolve with us in the future.
Acknowledgments
Special thanks to all current and past members of the AI/ML Platform and Data Platform teams for their contributions, as well as our machine‑learning engineers who spin up the magic with the tooling we build.
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.