Apache Gravitino Introduction

Published: (January 16, 2026 at 06:05 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Author: shaofeng shi
Last Updated: 2025‑12‑29

Overview

In the era of big data, enterprises often need to manage metadata from multi‑cloud, multi‑domain, and heterogeneous data sources (e.g., Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc.). With the rapid adoption of AI model training and inference, massive multimodal data and model metadata also require a unified management solution.

Traditional approaches manage metadata separately for each source, which increases operational complexity and creates data silos. Apache Gravitino—a high‑performance, geographically distributed federated metadata lake—offers a unified solution for managing multi‑source metadata.

Project History

MilestoneDate
Initiated & founded by Datastrato Inc.
Open‑sourced2023
Donated to Apache Incubator2024
Graduated to Apache Top‑Level ProjectMay 2025

Deployed in production at companies such as Xiaomi, Tencent, Zhihu, Uber, and Pinterest.

What is Apache Gravitino?

A high‑performance, geographically distributed, federated metadata‑lake management system that provides a unified data & AI asset management platform. It can:

  • Unified Metadata Management – Unified models & APIs for diverse data sources.
  • Direct Metadata Management – Changes are reflected in real‑time to the underlying systems.
  • Multi‑Engine Support – Works with Trino, Spark, Flink, etc.
  • Geographically Distributed Deployment – Supports cross‑region, cross‑cloud architectures.
  • AI Asset Management – Manages both data assets and AI/ML model metadata.

Core Concepts

ConceptDescription
MetalakeContainer/tenant for metadata; typically one organization ↔ one metalake.
CatalogCollection of metadata from a specific source.
SchemaSecond‑level namespace (equivalent to a database schema).
TableBottom‑level object representing a concrete data table.

Supported Data Sources

CategoryTypes
Relational DatabasesMySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, …
Big‑Data StorageApache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)
Message QueuesApache Kafka
File SystemsHDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS
AI/ML Data FormatsLance (columnar format optimized for AI/ML workloads)

REST API Services

Gravitino Core REST API

  • Full CRUD for all metadata objects (Metalake, Catalog, Schema, Table, …)
  • User, group, role, and permission management
  • Advanced features: tags, policies, models, etc.
  • Authentication: Simple, OAuth2, Kerberos

Iceberg REST Service

  • Implements Apache Iceberg REST API spec
  • Supports Hive, JDBC, and custom back‑ends as storage
  • Table management & query capabilities across S3, HDFS, GCS, Azure, …

Lance REST Service

  • Implements Lance REST API spec
  • Optimized for AI/ML workloads (vector data storage & retrieval)
  • Namespace & table management

Direct Metadata Management

  • Real‑time Synchronization – Immediate propagation of metadata changes to underlying sources.
  • Bidirectional Synchronization – Sync both from Gravitino → source and source → Gravitino.
  • Transaction Support – Guarantees atomicity & consistency of metadata ops.
  • Version Management – Metadata version control & historical tracking.

Unified Permission Management

FeatureDescription
RBACFlexible permission handling for users, groups, and roles.
Ownership ModelEvery metadata object has a clear owner.
Permission InheritanceHierarchical inheritance from Metalake down to tables.
Fine‑grained ControlMulti‑level permissions (Metalake → Catalog → Schema → Table).

Supported Permission Types

  • User & group management
  • Catalog & schema creation
  • Read/write on tables, topics, filesets
  • Model registration & version control
  • Tag & policy application

Data Lineage (OpenLineage)

  • Automatic Lineage Collection – Via Spark plugins.
  • Unified Identifiers – Normalizes identifiers across sources to Gravitino IDs.
  • Multi‑Source Support – Hive, Iceberg, JDBC, file systems, etc.

Deployment Modes

  • Single‑node – Development & testing.
  • Cluster – High availability & load balancing.
  • Kubernetes – Containerized deployment with auto‑scaling.
  • Docker – Official Docker images available.

Storage Backends

  • Relational DBs: MySQL, PostgreSQL, …
  • Distributed storage systems (pluggable).

Authentication Methods

  • Simple (username/password)
  • OAuth2
  • Kerberos (for Hive backends)

Credential Management

  • Cloud storage credential vending (S3, GCS, Azure, …)
  • Dynamic credential refresh
  • Secure credential passing mechanisms

Integration with Compute Engines

Gravitino deeply integrates with mainstream compute engines and data‑processing frameworks, delivering a unified data‑access experience.

  • Apache Spark – Seamless metadata synchronization and lineage tracking.
  • (Other engines such as Trino, Flink, etc., are also supported.)

Integration Capabilities of Apache Gravitino

Gravitino provides a rich set of connectors and SDKs that let you plug into existing data infrastructures with minimal effort. The following sections outline each integration point and its key features.

🌟 Gravitino Spark Connector

Supports Spark SQL and DataFrame API

  • Automatic data‑lineage collection and tracking
  • Unified access to multiple data sources

🔎 Trino Connector

Integration through the Gravitino Trino Connector service

  • Federated queries across heterogeneous data sources
  • High‑performance analytical query capabilities

Integration through the Gravitino Flink Connector service

  • Stream‑batch unified data processing
  • Real‑time data processing and analysis

🐍 PyIceberg

Iceberg table access for Python environments

  • Connects to the Gravitino Iceberg REST service
  • Enables data‑science and machine‑learning workflows
  • Provides Pandas‑compatible data interfaces

🚀 Daft

Modern distributed data‑processing framework

  • Optimized for AI/ML workloads
  • Supports multimodal data processing
  • Integrated with Gravitino metadata management

☸️ Kubernetes

Native deployment on Kubernetes clusters

  • Helm charts and Operators for easy installation
  • Auto‑scaling and fault‑recovery capabilities
  • Integration with cloud‑native monitoring and logging systems

🌐 REST API

Complete RESTful interface for metadata management

  • Supports all CRUD operations on catalogs, schemas, tables, and more
  • Standardized HTTP endpoints
  • Multiple authentication methods (e.g., token, OAuth)

☕ Java SDK

Native Java client library

  • Type‑safe API surface
  • Built‑in connection pooling and retry mechanisms
  • Comprehensive exception handling

🐍 Python SDK

Python client library

  • Asynchronous operation support
  • Seamless integration with Jupyter notebooks
  • Tailored for data‑science workflows

Why These Integrations Matter

These capabilities enable Gravitino to seamlessly integrate into existing data ecosystems, giving users a unified and efficient data‑management experience. Upcoming articles will dive deeper into each component’s configuration and usage patterns—stay tuned!

👉 Continue reading: Setup Guide

⭐️ Follow and star the project: Apache Gravitino Repository

Note: This article reflects the features of Apache Gravitino v1.1.0. For the latest updates, consult the official documentation or open an issue on GitHub.

Back to Blog

Related posts

Read more »