Apache Gravitino Introduction

Published: 2 hours ago (January 16, 2026 at 06:05 PM EST)

4 min read

Source: Dev.to

Author: shaofeng shi
Last Updated: 2025‑12‑29

Overview

In the era of big data, enterprises often need to manage metadata from multi‑cloud, multi‑domain, and heterogeneous data sources (e.g., Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc.). With the rapid adoption of AI model training and inference, massive multimodal data and model metadata also require a unified management solution.

Traditional approaches manage metadata separately for each source, which increases operational complexity and creates data silos. Apache Gravitino—a high‑performance, geographically distributed federated metadata lake—offers a unified solution for managing multi‑source metadata.

Project History

Milestone	Date
Initiated & founded by Datastrato Inc.	–
Open‑sourced	2023
Donated to Apache Incubator	2024
Graduated to Apache Top‑Level Project	May 2025

Deployed in production at companies such as Xiaomi, Tencent, Zhihu, Uber, and Pinterest.

What is Apache Gravitino?

A high‑performance, geographically distributed, federated metadata‑lake management system that provides a unified data & AI asset management platform. It can:

Unified Metadata Management – Unified models & APIs for diverse data sources.
Direct Metadata Management – Changes are reflected in real‑time to the underlying systems.
Multi‑Engine Support – Works with Trino, Spark, Flink, etc.
Geographically Distributed Deployment – Supports cross‑region, cross‑cloud architectures.
AI Asset Management – Manages both data assets and AI/ML model metadata.

Core Concepts

Concept	Description
Metalake	Container/tenant for metadata; typically one organization ↔ one metalake.
Catalog	Collection of metadata from a specific source.
Schema	Second‑level namespace (equivalent to a database schema).
Table	Bottom‑level object representing a concrete data table.

Supported Data Sources

Category	Types
Relational Databases	MySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, …
Big‑Data Storage	Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)
Message Queues	Apache Kafka
File Systems	HDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS
AI/ML Data Formats	Lance (columnar format optimized for AI/ML workloads)

REST API Services

Gravitino Core REST API

Full CRUD for all metadata objects (Metalake, Catalog, Schema, Table, …)
User, group, role, and permission management
Advanced features: tags, policies, models, etc.
Authentication: Simple, OAuth2, Kerberos

Iceberg REST Service

Implements Apache Iceberg REST API spec
Supports Hive, JDBC, and custom back‑ends as storage
Table management & query capabilities across S3, HDFS, GCS, Azure, …

Lance REST Service

Implements Lance REST API spec
Optimized for AI/ML workloads (vector data storage & retrieval)
Namespace & table management

Direct Metadata Management

Real‑time Synchronization – Immediate propagation of metadata changes to underlying sources.
Bidirectional Synchronization – Sync both from Gravitino → source and source → Gravitino.
Transaction Support – Guarantees atomicity & consistency of metadata ops.
Version Management – Metadata version control & historical tracking.

Unified Permission Management

Feature	Description
RBAC	Flexible permission handling for users, groups, and roles.
Ownership Model	Every metadata object has a clear owner.
Permission Inheritance	Hierarchical inheritance from Metalake down to tables.
Fine‑grained Control	Multi‑level permissions (Metalake → Catalog → Schema → Table).

Supported Permission Types

User & group management
Catalog & schema creation
Read/write on tables, topics, filesets
Model registration & version control
Tag & policy application

Data Lineage (OpenLineage)

Automatic Lineage Collection – Via Spark plugins.
Unified Identifiers – Normalizes identifiers across sources to Gravitino IDs.
Multi‑Source Support – Hive, Iceberg, JDBC, file systems, etc.

Deployment Modes

Single‑node – Development & testing.
Cluster – High availability & load balancing.
Kubernetes – Containerized deployment with auto‑scaling.
Docker – Official Docker images available.

Storage Backends

Relational DBs: MySQL, PostgreSQL, …
Distributed storage systems (pluggable).

Authentication Methods

Simple (username/password)
OAuth2
Kerberos (for Hive backends)

Credential Management

Cloud storage credential vending (S3, GCS, Azure, …)
Dynamic credential refresh
Secure credential passing mechanisms

Integration with Compute Engines

Gravitino deeply integrates with mainstream compute engines and data‑processing frameworks, delivering a unified data‑access experience.

Apache Spark – Seamless metadata synchronization and lineage tracking.
(Other engines such as Trino, Flink, etc., are also supported.)

Integration Capabilities of Apache Gravitino

Gravitino provides a rich set of connectors and SDKs that let you plug into existing data infrastructures with minimal effort. The following sections outline each integration point and its key features.

🌟 Gravitino Spark Connector

Supports Spark SQL and DataFrame API

Automatic data‑lineage collection and tracking
Unified access to multiple data sources

🔎 Trino Connector

Integration through the Gravitino Trino Connector service

Federated queries across heterogeneous data sources
High‑performance analytical query capabilities

⚡ Apache Flink Connector

Integration through the Gravitino Flink Connector service

Stream‑batch unified data processing
Real‑time data processing and analysis

🐍 PyIceberg

Iceberg table access for Python environments

Connects to the Gravitino Iceberg REST service
Enables data‑science and machine‑learning workflows
Provides Pandas‑compatible data interfaces

🚀 Daft

Modern distributed data‑processing framework

Optimized for AI/ML workloads
Supports multimodal data processing
Integrated with Gravitino metadata management

☸️ Kubernetes

Native deployment on Kubernetes clusters

Helm charts and Operators for easy installation
Auto‑scaling and fault‑recovery capabilities
Integration with cloud‑native monitoring and logging systems

🌐 REST API

Complete RESTful interface for metadata management

Supports all CRUD operations on catalogs, schemas, tables, and more
Standardized HTTP endpoints
Multiple authentication methods (e.g., token, OAuth)

☕ Java SDK

Native Java client library

Type‑safe API surface
Built‑in connection pooling and retry mechanisms
Comprehensive exception handling

🐍 Python SDK

Python client library

Asynchronous operation support
Seamless integration with Jupyter notebooks
Tailored for data‑science workflows

Why These Integrations Matter

These capabilities enable Gravitino to seamlessly integrate into existing data ecosystems, giving users a unified and efficient data‑management experience. Upcoming articles will dive deeper into each component’s configuration and usage patterns—stay tuned!

👉 Continue reading: Setup Guide

⭐️ Follow and star the project: Apache Gravitino Repository

Note: This article reflects the features of Apache Gravitino v1.1.0. For the latest updates, consult the official documentation or open an issue on GitHub.