Data Cataloguing in AWS
Source: Dev.to
Introduction
In modern data engineering, one of the most overlooked but powerful capabilities is data cataloguing. Without a clear understanding of what data exists, where it lives, its schema, and how it changes over time, no ETL architecture can scale. This guide walks through how to catalogue data using AWS Glue Crawlers and how to structure your metadata layer when working with raw and cleaned datasets stored in Amazon S3.
The tutorial uses a simple CSV file in an S3 raw bucket and shows how AWS Glue automatically discovers its structure and builds a searchable, query‑ready data catalog. All steps can be replicated through the AWS Console.
What is Data Cataloguing?
Data cataloguing is the process of creating a structured inventory of all your data assets.
A good data catalog contains:
- Dataset name
- Schema (columns, data types, partitions)
- Location (e.g., S3 path)
- Metadata (size, owner, last updated)
- Tags, classifications, lineage
Think of it as the “index” of your data ecosystem—similar to how a library catalog helps readers find books quickly.
Why it matters
- Makes data discoverable across teams
- Reduces manual documentation
- Ensures schema consistency across pipelines
- Enables data validation and quality checks
- Fuels self‑service analytics
- Supports governance and compliance
Data Cataloguing in ETL Pipelines
ETL pipelines depend heavily on metadata. Before transforming any dataset, the pipeline must understand:
- What columns exist
- Which data types to enforce
- What partitions to use
- What schema evolution has happened
- How to map raw → cleaned → curated layers
A strong data catalog ensures that:
- ETL jobs run reliably
- Glue/Spark scripts do not break due to schema drift
- Downstream BI tools (Athena, QuickSight, Superset, Power BI) can read data instantly
- Data lineage and documentation stay updated
AWS Glue Data Catalog acts as the central metadata store for all your structured and semi‑structured data.
Architecture Overview
Below is a high‑level diagram of the workflow:
The project walkthrough shows how Glue Crawlers:
- Scan an S3 bucket
- Detect the schema (headers, types, formatting)
- Generate metadata
- Store the metadata as a table in the Data Catalog
The resulting metadata is queryable through Amazon Athena, usable by Glue ETL jobs, and consumable by analytics tools.
Understanding Amazon S3, AWS Glue Crawler, and the Glue Data Catalog
Amazon S3 (Simple Storage Service)
Amazon S3 is a fully managed object storage service that lets you store any type of data at scale—CSV files, logs, JSON, Parquet, images, and more. It is highly durable, cost‑effective, and integrates seamlessly with AWS analytics services. In most modern data engineering architectures (including the Medallion architecture), S3 serves as the landing, raw, and processed layers.
AWS Glue Crawler
An AWS Glue Crawler is an automated metadata discovery tool that scans data stored in Amazon S3 (and other sources). When the crawler runs it:
- Reads the file structure and content
- Detects the data format (CSV, JSON, Parquet, etc.)
- Infers column names and data types
- Identifies partitions
- Classifies datasets using built‑in or custom classifiers
The crawler then automatically creates or updates table metadata without manual schema definition.
AWS Glue Data Catalog
The Glue Data Catalog is a centralized metadata repository for all your datasets within AWS. It stores:
- Table definitions
- Schema information
- Partition details
- Additional metadata used by analytics services
When a Glue Crawler finishes scanning an S3 bucket, it writes the discovered schema and table information into the Glue Data Catalog. This metadata can be queried by services such as Athena, EMR, Redshift Spectrum, and AWS Glue ETL jobs.
Workflow summary:
S3 → Glue Crawler scans files → Schema is inferred → Metadata stored in Glue Data Catalog → Data becomes queryable.
Step‑by‑Step Workflow
1. Upload Your CSV File to Amazon S3
Create an S3 bucket (replace the name with your own):
aws s3api create-bucket --bucket medallion-orders-2025-12-17 --region us-east-1
Upload the sample CSV file:
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/
Optionally place the file in a folder (prefix) to represent a raw layer:
aws s3 cp orders.csv s3://medallion-orders-2025-12-17/raw/orders.csv

2. Create a Glue Database
- In the Glue Console, navigate to Data Catalog → Databases.
- Click Add database.
- Name the database
orders_dband click Create database.

3. Create an AWS Glue Crawler
- Navigate to Glue → Crawlers and click Create crawler.
- Provide a name (e.g.,
orders_crawler) and click Next. - Click Add a data source, select S3, and point to the bucket/prefix containing your CSV file.
- Choose the database
orders_dbcreated earlier. - Configure a schedule (or run on demand) and finish the wizard.

4. Run the Crawler and Verify the Table
After creating the crawler, select it and choose Run crawler. Once completed, go to Data Catalog → Tables in the orders_db database. You should see a new table (e.g., orders) with inferred columns, data types, and partition information.
5. Query the Cataloged Data with Athena
- Open the Athena console.
- Set the query result location to an S3 bucket (e.g.,
s3://my-athena-results/). - Run a simple query:
SELECT * FROM orders_db.orders LIMIT 10;
If the crawler succeeded, Athena will return the rows from orders.csv.
Next Steps
- Add partitions (e.g., by date) and re‑run the crawler to keep the catalog up‑to‑date.
- Integrate with Glue ETL jobs to transform raw data into cleaned/curated tables.
- Set up Lake Formation or IAM policies for fine‑grained access control.
- Enable schema versioning to track evolution over time.
By establishing a reliable data catalog early, you lay the foundation for scalable, maintainable, and governed data pipelines on AWS.
