Data Cataloguing in AWS

Published: (December 3, 2025 at 08:44 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Introduction

In modern data engineering, one of the most overlooked but powerful capabilities is data cataloguing. Without a clear understanding of what data exists, where it lives, its schema, and how it changes over time, no ETL architecture can scale. This guide walks through how to catalogue data using AWS Glue Crawlers and how to structure your metadata layer when working with raw and cleaned datasets stored in Amazon S3.

The tutorial uses a simple CSV file in an S3 raw bucket and shows how AWS Glue automatically discovers its structure and builds a searchable, query‑ready data catalog. All steps can be replicated through the AWS Console.

What is Data Cataloguing?

Data cataloguing is the process of creating a structured inventory of all your data assets.

A good data catalog contains:

  • Dataset name
  • Schema (columns, data types, partitions)
  • Location (e.g., S3 path)
  • Metadata (size, owner, last updated)
  • Tags, classifications, lineage

Think of it as the “index” of your data ecosystem—similar to how a library catalog helps readers find books quickly.

Why it matters

  • Makes data discoverable across teams
  • Reduces manual documentation
  • Ensures schema consistency across pipelines
  • Enables data validation and quality checks
  • Fuels self‑service analytics
  • Supports governance and compliance

Data Cataloguing in ETL Pipelines

ETL pipelines depend heavily on metadata. Before transforming any dataset, the pipeline must understand:

  • What columns exist
  • Which data types to enforce
  • What partitions to use
  • What schema evolution has happened
  • How to map raw → cleaned → curated layers

A strong data catalog ensures that:

  • ETL jobs run reliably
  • Glue/Spark scripts do not break due to schema drift
  • Downstream BI tools (Athena, QuickSight, Superset, Power BI) can read data instantly
  • Data lineage and documentation stay updated

AWS Glue Data Catalog acts as the central metadata store for all your structured and semi‑structured data.

Architecture Overview

Below is a high‑level diagram of the workflow:

Architecture diagram

The project walkthrough shows how Glue Crawlers:

  • Scan an S3 bucket
  • Detect the schema (headers, types, formatting)
  • Generate metadata
  • Store the metadata as a table in the Data Catalog

The resulting metadata is queryable through Amazon Athena, usable by Glue ETL jobs, and consumable by analytics tools.

Understanding Amazon S3, AWS Glue Crawler, and the Glue Data Catalog

Amazon S3 (Simple Storage Service)

Amazon S3 is a fully managed object storage service that lets you store any type of data at scale—CSV files, logs, JSON, Parquet, images, and more. It is highly durable, cost‑effective, and integrates seamlessly with AWS analytics services. In most modern data engineering architectures (including the Medallion architecture), S3 serves as the landing, raw, and processed layers.

AWS Glue Crawler

An AWS Glue Crawler is an automated metadata discovery tool that scans data stored in Amazon S3 (and other sources). When the crawler runs it:

  1. Reads the file structure and content
  2. Detects the data format (CSV, JSON, Parquet, etc.)
  3. Infers column names and data types
  4. Identifies partitions
  5. Classifies datasets using built‑in or custom classifiers

The crawler then automatically creates or updates table metadata without manual schema definition.

AWS Glue Data Catalog

The Glue Data Catalog is a centralized metadata repository for all your datasets within AWS. It stores:

  • Table definitions
  • Schema information
  • Partition details
  • Additional metadata used by analytics services

When a Glue Crawler finishes scanning an S3 bucket, it writes the discovered schema and table information into the Glue Data Catalog. This metadata can be queried by services such as Athena, EMR, Redshift Spectrum, and AWS Glue ETL jobs.

Workflow summary:
S3 → Glue Crawler scans files → Schema is inferred → Metadata stored in Glue Data Catalog → Data becomes queryable.

Step‑by‑Step Workflow

1. Upload Your CSV File to Amazon S3

Create an S3 bucket (replace the name with your own):

aws s3api create-bucket --bucket medallion-orders-2025-12-17 --region us-east-1

Upload the sample CSV file:

aws s3 cp orders.csv s3://medallion-orders-2025-12-17/

Optionally place the file in a folder (prefix) to represent a raw layer:

aws s3 cp orders.csv s3://medallion-orders-2025-12-17/raw/orders.csv

Upload screenshot

2. Create a Glue Database

  1. In the Glue Console, navigate to Data Catalog → Databases.
  2. Click Add database.
  3. Name the database orders_db and click Create database.

Create database screenshot

3. Create an AWS Glue Crawler

  1. Navigate to Glue → Crawlers and click Create crawler.
  2. Provide a name (e.g., orders_crawler) and click Next.
  3. Click Add a data source, select S3, and point to the bucket/prefix containing your CSV file.
  4. Choose the database orders_db created earlier.
  5. Configure a schedule (or run on demand) and finish the wizard.

Create crawler screenshot

4. Run the Crawler and Verify the Table

After creating the crawler, select it and choose Run crawler. Once completed, go to Data Catalog → Tables in the orders_db database. You should see a new table (e.g., orders) with inferred columns, data types, and partition information.

5. Query the Cataloged Data with Athena

  1. Open the Athena console.
  2. Set the query result location to an S3 bucket (e.g., s3://my-athena-results/).
  3. Run a simple query:
SELECT * FROM orders_db.orders LIMIT 10;

If the crawler succeeded, Athena will return the rows from orders.csv.

Next Steps

  • Add partitions (e.g., by date) and re‑run the crawler to keep the catalog up‑to‑date.
  • Integrate with Glue ETL jobs to transform raw data into cleaned/curated tables.
  • Set up Lake Formation or IAM policies for fine‑grained access control.
  • Enable schema versioning to track evolution over time.

By establishing a reliable data catalog early, you lay the foundation for scalable, maintainable, and governed data pipelines on AWS.

Back to Blog

Related posts

Read more »