Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

Published: (May 1, 2026 at 10:06 PM EDT)
4 min read

Source: Google Developers Blog

The challenge: Keeping GPUs fed

As model sizes grow, data loading and checkpointing often become the primary bottlenecks in training.

Data‑preparation activities for training models involve fetching and processing terabytes to petabytes of data from remote storage systems such as object stores.

Standard REST‑based storage access can struggle to meet the extreme throughput and low‑latency requirements of modern distributed training, leading to wasted GPU resources.

Rapid Bucket: Rapid Storage via Bi‑Di gRPC

Our new Rapid Bucket solution provides high‑performance object storage in dedicated zonal buckets. By bypassing legacy REST APIs and utilizing persistent gRPC bidirectional streams, we bring the power of Colossus—the filesystem‑stateful protocols that power YouTube and Google Search—directly to the PyTorch ecosystem.

Key performance metrics

  • Extreme throughput: 15 + TiB/s aggregate throughput
  • Ultra‑low latency: < 1 ms for random reads and append writes
  • High QPS: 20 M+ queries per second

Fsspec – PyTorch’s Pythonic File Interface

fsspec is the pervasive Pythonic interface for file systems in the PyTorch ecosystem. It is already used for:

  • Data preparation: Dask, Pandas, Hugging Face Datasets, Ray Data
  • Checkpoints: PyTorch Lightning, torch.distributed, Weights & Biases
  • Inference: vLLM

adk-java-1-0-release-1600x476

There are various backend implementations of fsspec for many different storage systems, all of which can be integrated under a single layer. This eliminates the need to write specific code for each backend. By integrating Rapid Storage with gcsfs (the Google Cloud Storage implementation of fsspec), developers can leverage the speed gains provided by Rapid with a simple fsspec.open() call—no complex code rewrites required.

Under the Hood: Leveraging Colossus

To achieve a performance boost with Rapid Buckets, we optimized the entire data path:

  • Stateful gRPC‑based streaming – Bi‑directional gRPC keeps the connection alive, eliminating per‑operation overhead such as connection setup, authentication, and metadata exchange. This enables efficient, stateful data exchange for multiple reads or appends within a single object.
  • Direct path – Google Cloud Storage (GCS) Rapid Buckets use direct connectivity for the gRPC bi‑directional streaming APIs (BidiReadObject, BidiWriteObject). By connecting clients directly to the underlying Colossus files, we achieve maximum performance. Non‑Rapid traffic typically traverses additional network hops, resulting in higher read/write latency. See the blog post How the Colossus stateful protocol benefits Rapid storage for more details.
  • Zonal co‑location – Storing data in the same zone as your compute resources (e.g., us-central1‑a) eliminates cross‑zone latency. Prior to Rapid Buckets, a regional bucket and the compute (or accelerators) could reside in different zones, incurring extra latency.
  • No‑op user migration – The existing fsspec API is preserved while all internal traffic for Rapid Buckets is upgraded from HTTP to bi‑directional gRPC. By adding bucket‑type auto‑detection to gcsfs, PyTorch and other fsspec clients automatically use Rapid Buckets with zero manual configuration.

Results

A dataset of 134 M rows (≈ 451 GB) was loaded onto 16 GKE nodes, each equipped with eight A4 GPUs. Training ran for 100 steps, with a checkpoint every 25 steps using PyTorch Lightning.

We benchmarked the total training time, including data‑load latency, and observed a 23 % performance gain when using a Rapid Bucket versus a standard regional bucket.

Microbenchmarking

Microbenchmarks—measuring the performance of individual building blocks such as I/O or resource usage—confirm these gains:

OperationThroughput improvement
Reads (sequential & random)4.8×
Writes2.8×

Tests were run with 16 MB I/O sizes across 48 processes.
More details are available in the GCSFS performance benchmarks.

Illustration

Rapid Bucket performance comparison

Get Started

Getting started with GCSFS on Rapid Bucket is easy. Your existing code and scripts remain the same—you only need to point to a Rapid Bucket to take advantage of the performance boost.

Installation

Rapid Bucket integration is available from version 2026.3.0.

pip install gcsfs

Code Sample: Read/Write with GCS Rapid

import gcsfs

# Initialize the filesystem
fs = gcsfs.GCSFileSystem()

# Writing to a Rapid bucket
with fs.open('my-zonal-rapid-bucket/data/checkpoint.pt', 'wb') as f:
    f.write(b"model data...")

# Appending to an existing object (Native Rapid feature)
with fs.open('my-zonal-rapid-bucket/data/checkpoint.pt', 'ab') as f:
    f.write(b"appended data...")
PreviousNext
0 views
Back to Blog

Related posts

Read more »