Inside the feature store powering real-time AI in Dropbox Dash

Published: (December 18, 2025 at 01:00 PM EST)
5 min read

Source: Dropbox Tech Blog

Dropbox Dash uses AI to understand questions about your files, work chats, and company content, bringing everything together in one place for deeper, more focused work. With tens of thousands of potential work documents to consider, both search and agents rely on a ranking system powered by real‑time machine learning to find the right files fast. At the core of that ranking is our feature store, a system that manages and delivers the data signals (“features”) our models use to predict relevance.

Dropbox Dash: The AI teammate that understands your work

Why a custom feature store?

To help users find exactly what they need, Dash must read between the lines of user behavior across file types, company content, and the messy, fragmented realities of collaboration. It then surfaces the most relevant documents, images, and conversations when—and how—they’re needed.

The feature store is a critical part of how we rank and retrieve the right context across your work. It’s built to:

  • Serve features quickly
  • Keep pace as user behavior changes
  • Let engineers move fast from idea to production

(For more on how feature stores connect to context engineering in Dash, check out our deep dive on context engineering.)

What’s in this post?

We’ll walk through:

  1. How we built the feature store behind Dash’s ranking system
  2. Why off‑the‑shelf solutions didn’t fit
  3. How we designed for speed and scale
  4. What it takes to keep features fresh as user behavior evolves

Along the way, we’ll share the trade‑offs we made and the lessons that shaped our approach.

Our Goals and Requirements

Building a feature store for Dash required a custom solution rather than an off‑the‑shelf product. The main constraints were:

AreaChallengeWhy It Matters
Hybrid InfrastructureOn‑premises low‑latency service mesh Spark‑native cloud environmentStandard cloud‑native stores couldn’t span both worlds, so we needed a bridge that kept development velocity high.
Search Ranking ScaleOne query → thousands of feature lookups (behavioral, contextual, real‑time signals)The store must sustain massive parallel reads while staying under sub‑100 ms latency budgets.
Real‑Time RelevanceSignals (e.g., document open, Slack join) must be reflected in the next search within secondsRequires an ingestion pipeline that can keep up with user‑behavior velocity at scale.
Mixed Computation PatternsSome features are streaming‑first; others need batch processing of historic dataA unified framework is needed to support both efficiently, reducing cognitive load for engineers and shortening the path from idea to production.

Summary

  • Bridge on‑prem & cloud without sacrificing speed.
  • Support massive parallel reads while guaranteeing low latency.
MetricPython (original)Go (new)
Latency≤ 100 ms25–35 ms
ThroughputThousands of req/s with high CPUThousands of req/s with lower CPU
ScalabilityLimited by GIL & process coordinationLinear scaling with goroutine count

The Go service now handles thousands of requests per second, adding only ~5–10 ms of processing overhead on top of Dynovault’s client latency and consistently achieving p95 latencies of ~25–35 ms.

Impact

  • Met Dash’s latency targets reliably.
  • Prevented feature serving from becoming a bottleneck as search traffic and feature complexity grew.

Read more about Go at the official site.

Keeping Features Fresh

Speed matters only when the data itself is fresh. Stale features can lower ranking quality and hurt user experience, so our feature store must reflect new signals as soon as possible—often within minutes of user actions.

The Challenge

  • Scale – Many of Dash’s most important features depend on large joins, aggregations, and historical context, making fully real‑time computation impractical.
  • Balance – We needed an ingestion strategy that kept data fresh and reliable without overwhelming our infrastructure or slowing development.

Our Solution: A Three‑Part Ingestion System

Ingestion TypeWhat It HandlesKey Benefits
Batch ingestionComplex, high‑volume transformations built on the medallion architecture (raw → refined stages).• Intelligent change detection → only modified records are written.
• Write volume reduced from hundreds of millions per hour to < 5 minutes.
Streaming ingestionFast‑moving signals (e.g., collaboration activity, content interactions).• Near‑real‑time processing of unbounded datasets.
• Features stay aligned with users’ current actions.
Direct writesLightweight or pre‑computed features (e.g., relevance scores from an LLM evaluation pipeline).• Bypass batch pipelines entirely.
• Data appears in the online store within seconds.

Outcome

By combining these three ingestion paths, Dash can keep feature values fresh without forcing all computation onto a single pipeline. This preserves ranking quality while scaling to real‑world usage.

What We Learned

Building a feature store at Dropbox scale reinforced several hard‑earned lessons about systems design.

Serving‑side insights

  • Python’s concurrency model became a limiting factor for high‑throughput, mixed CPU‑I/O workloads.
  • Even with careful parallelism, the Global Interpreter Lock (GIL) capped performance for CPU‑bound work such as JSON parsing.
  • Switching to multiple processes introduced new coordination bottlenecks.
  • Rewriting the serving layer in Go removed those trade‑offs and let us scale concurrency predictably.

Data‑side insights

  • Infrastructure changes mattered, but understanding access patterns mattered just as much.
  • Only 1–5 % of feature values change in a typical 15‑minute window.
  • Exploiting this fact dramatically reduced write volume and ingestion time, turning hour‑long batch cycles into five‑minute updates—improving freshness without increasing system load.

Hybrid architecture

  • Feast – orchestration & consistency
  • Spark – large‑scale computation
  • Dynovault – low‑latency online serving

Rather than relying on a single vendor solution, this approach lets us tune each layer to its strengths while keeping training and serving aligned.

Takeaway

The work underscored the value of a middle path between building everything from scratch and adopting off‑the‑shelf systems wholesale. By combining open‑source foundations with internal infrastructure and tailoring them to real constraints, we built a feature store that meets today’s needs and can evolve with us in the future.

Acknowledgments

Special thanks to all current and past members of the AI/ML Platform and Data Platform teams for their contributions, as well as our machine‑learning engineers who spin up the magic with the tooling we build.

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Back to Blog

Related posts

Read more »

“함께 구매하면 좋은 상품” 추천 모델 고도화

배달의민족에서는 음식 배달뿐만 아니라 장보기도 당일 배송이 가능하다는 사실, 알고 계셨나요? 배민의 장보기·쇼핑 서비스는 배민B마트를 비롯해 마트, 편의점, 꽃, 전자제품 등 다양한 셀러가 입점해 있어 다양한 물건을 빠르게 받아보실 수 있습니다. 고객이 서비스에 진입한 순간부터 구매를...

우리는 코드처럼 문화도 리팩토링한다

팀 소개 커머스웹프론트개발팀이하 “커머프팀”은 배달의민족의 모든 커머스 서비스와 플랫폼은 물론, 백오피스부터 뒷단의 물류시스템에 이르기까지 웹 클라이언트 영역을 담당하는 거대한 규모의 팀입니다. 각기 다른 서비스를 담당하던 팀이 모여 하나의 큰 팀을 이루었고, 배달의민족이 꿈꾸는 커머...