Organizing How to Use AWS Lake Formation
Source: Dev.to
Original Japanese article: AWS Lake Formationの使い方について整理してみる I’m Aki, an AWS Community Builder (@jitepengin). Previously, I wrote an article titled Is AWS Glue Data Catalog Sufficient as a Data Catalog? Organizing Its Design, Limitations, and Complementary Strategies. “AWS Lake Formation is necessary to complement data governance” but did not go into detail because it was outside the scope of the article. This time, I’d like to organize my thoughts on Lake Formation, covering everything from the fundamentals to practical usage patterns. Lake Formation is often perceived as a service that is “somewhat difficult” or “unnecessary because IAM is enough.” I hope this article helps you evaluate whether Lake Formation is worth adopting in your environment. AWS Lake Formation is a service that provides access management and governance for data lakes. It allows you to centrally manage who can access which data and at what level. Although they are often confused, Lake Formation and Glue Data Catalog serve different purposes.
Service Role
Glue Data Catalog A technical catalog that manages metadata such as schemas and partitions
Lake Formation A governance layer that manages access permissions for data registered in the Glue Data Catalog
Amazon S3 (Actual Data) ↓ Glue Data Catalog (Metadata Management) ↓ Lake Formation (Access Control) ↓ Athena / Glue Job / Redshift Spectrum
In other words, data resides in S3, Glue Data Catalog manages metadata, and Lake Formation provides access control on top of that metadata layer. When managing a data lake on S3 using IAM alone, several challenges emerge: Granularity limitations: IAM primarily operates at the bucket or prefix level, making table-, column-, and row-level access control difficult. Operational complexity: As users and roles increase, S3 bucket policies and IAM policies become increasingly difficult to manage. Cross-account sharing: Implementing data sharing across AWS accounts using only IAM can lead to complicated designs. Limited visibility for auditing: It is difficult to easily understand who can access which tables. Typical examples include: More than ten Athena users need different levels of access, making permission management increasingly complicated. Different departments should see different subsets of data. For example, the sales department should only see Eastern Japan sales, while executives can see all data. Personally identifiable information (PII) such as email addresses and credit card numbers should be hidden from analysts. Data needs to be shared with another AWS account. Lake Formation addresses these challenges. With Lake Formation, you can implement: Fine-grained table-, column-, and row-level access control Permission management at the Glue Data Catalog database and table level Tag-based access control (LF-TBAC) for large-scale environments Cross-account data sharing through AWS RAM Centralized auditing through CloudTrail integration Lake Formation does not replace IAM; it works as an additional layer on top of IAM. When a query is executed (for example, through Athena), access is granted only if both conditions are satisfied: IAM Permission AND Lake Formation Permission ↓ Access Allowed
Even if permissions are granted in Lake Formation, access is denied if IAM blocks it. Likewise, even if IAM allows access, the request is denied if the corresponding Lake Formation permissions are missing. Understanding this “AND” relationship is the foundation of permission design. Lake Formation permissions are managed across multiple levels.
Level Target Example Permissions
Data Lake Administrator Entire Lake Formation environment Full permissions
Database Level Glue Data Catalog database CREATE TABLE, DROP
Table Level Individual table SELECT, INSERT, ALTER
Column Level Specific columns within a table SELECT on selected columns
Row Level Rows matching specific conditions SELECT on filtered rows
Permissions can be granted or revoked through the console, CLI, or SDK.
Example: Grant SELECT permission on a table
aws lakeformation grant-permissions
—principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role
—permissions SELECT
—resource ’{
“Table”: {
“DatabaseName”: “mydb”,
“Name”: “sales_table”
}
}’
One of Lake Formation’s strongest capabilities is fine-grained access control beyond the table level. Both column-level and row-level security are implemented using a mechanism called Data Filters. Access can be restricted to specific columns. Suppose the customer table contains the following columns:
customer_id name email credit_card purchase_amount
You could allow analysts to access only customer_id, name, and purchase_amount, while hiding email and credit_card.
This can be achieved simply by specifying included or excluded columns in a Data Filter.
Row-level filters allow access only to rows matching specific conditions.
Filter expressions are written using PartiQL WHERE-clause syntax.
For example, if the sales table contains a region column and the Eastern Japan team should only see rows where region = ‘east’, you can create the following Data Filter:
aws lakeformation create-data-cells-filter
—table-data ’{
“TableCatalogId”: “123456789012”,
“DatabaseName”: “mydb”,
“TableName”: “sales”,
“Name”: “east-region-filter”,
“RowFilter”: {
“FilterExpression”: “region = '''east'''”
},
“ColumnWildcard”: {}
}’
Combining column filters and row filters enables cell-level security, where users can access only specific columns within specific rows.
According to the official documentation:
Up to 100 filters per principal
array and map types are not supported in filter expressions (struct types can be used in row filters)
Cell-level security does not support nested columns, views, or resource links
Cell-level security is available in all regions when using Athena Engine Version 3 or Redshift Spectrum
Protecting PII such as email addresses and credit card numbers
Restricting business data by department or geographic region
Compliance requirements for regulated data
As the number of databases and tables grows, managing permissions table by table becomes increasingly difficult.
LF-TBAC (Lake Formation Tag-Based Access Control) addresses this problem.
LF-Tags are key-value tags unique to Lake Formation.
They are separate from both S3 resource tags and IAM tags and are managed independently within Lake Formation.
aws lakeformation create-lf-tag
—tag-key “sensitivity”
—tag-values ’[“public”, “internal”, “confidential”]’
LF-Tags can be assigned to databases, tables, and columns.
aws lakeformation add-lf-tags-to-resource
—resource ’{“Table”: {“DatabaseName”: “mydb”, “Name”: “sales”}}’
—lf-tags ’[{“TagKey”: “sensitivity”, “TagValues”: [“internal”]}]’
Permissions are then granted based on tags rather than table names.
aws lakeformation grant-permissions
—principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/analyst-role
—permissions SELECT
—resource ’{
“LFTagPolicy”: {
“ResourceType”: “TABLE”,
“Expression”: [{“TagKey”: “sensitivity”, “TagValues”: [“public”, “internal”]}]
}
}’
This grants SELECT access to all tables tagged with either sensitivity=public or sensitivity=internal.
When new tables are created, simply assigning the appropriate LF-Tag automatically applies the correct permissions.
In environments with dozens or hundreds of tables, table-by-table permission management becomes unrealistic.
LF-TBAC enables a simpler model:
Roles can access data with specific tags.
However, tag design should be carefully planned from the beginning.
sensitivity, domain, and owner early on can save significant effort later.
Lake Formation works closely with Glue Data Catalog.
Glue Data Catalog manages metadata, while Lake Formation governs access to that metadata.
When Lake Formation is enabled, access to Glue Data Catalog is routed through Lake Formation authorization checks.
This means that access to metadata itself—such as table definitions—can also be controlled.
When a Glue Job accesses data governed by Lake Formation, permissions must be granted not only through IAM but also through Lake Formation.
This is a common pitfall.
A typical issue is:
IAM permissions look correct, but the Glue Job still cannot read data.
aws lakeformation grant-permissions
—principal DataLakePrincipalIdentifier=arn:aws:iam::123456789:role/glue-job-role
—permissions SELECT
—resource ’{
“Table”: {“DatabaseName”: “mydb”, “Name”: “source_table”}
}’
Lake Formation supports cross-account data sharing through AWS RAM (Resource Access Manager). Users in the target account can query shared tables directly from their own Athena environment. Because Lake Formation permissions—including column and row filters—remain enforced, scenarios such as sharing data while excluding sensitive columns are supported. To use cross-account sharing, the Data Catalog Cross Account Version setting must be configured to Version 3 or later. Version 3 enables direct sharing with IAM principals in other accounts. Version 4 adds support for hybrid access mode in cross-account scenarios. When Athena accesses a Lake Formation-managed table: A user executes a query in Athena. Athena requests table metadata from Glue Data Catalog. Lake Formation validates permissions. If authorized, access to data in S3 is allowed. Column and row filters are applied before results are returned. This enables fine-grained access control without modifying S3 bucket policies. Since Redshift Spectrum also relies on Glue Data Catalog, Lake Formation permissions are enforced there as well. This makes it easier to maintain consistent access control across Athena and Redshift Spectrum. To preserve backward compatibility, Lake Formation grants the IAMAllowedPrincipals group Super permissions on existing Data Catalog resources by default. In this state, access is effectively controlled by IAM alone, and Lake Formation’s fine-grained controls are not enforced. To fully leverage Lake Formation, these permissions must eventually be removed and replaced with explicit Lake Formation permissions. However, switching everything at once can break existing workloads. This is where Hybrid Access Mode becomes useful. When registering S3 locations, Hybrid Access Mode allows selected principals to opt into Lake Formation authorization while other principals continue using IAM-only access. This approach minimizes risk and enables gradual migration. Personally, I believe this is the most practical approach for existing environments. As mentioned earlier, forgetting to grant Lake Formation permissions to Glue Job roles prevents ETL jobs from reading or writing data. Many “it should work but doesn’t” permission issues ultimately trace back to this. I’ve forgotten it myself a few times and ended up scrambling to find the root cause. Lake Formation does not override S3 bucket policies. Even if access is granted in Lake Formation, requests are denied if the bucket policy blocks them. When adopting Lake Formation, bucket policies must be designed to allow access through Lake Formation-authorized service roles. Maintaining consistency among IAM, Lake Formation, and S3 bucket policies is critical. Changing the design later can become painful, so it’s worth thinking through carefully from the beginning. When enabling Lake Formation for the first time, at least one Data Lake Administrator must be configured. Relying on a single administrator can become an operational bottleneck, so I recommend assigning multiple administrators. When Athena Workgroups are used together with Lake Formation, behavior may vary depending on Workgroup configuration. In particular, don’t forget to grant permissions to the S3 bucket used for query results. This is another thing I occasionally forget myself. For new environments, enabling Lake Formation from the start is usually the best option. For existing environments, a phased approach tends to work better. I’ve done this before, and while it’s certainly possible, it’s somewhat tedious. Register S3 locations using Hybrid Access Mode Opt in selected principals Keep IAM-only access for others Monitor access through CloudTrail Manage permissions for newly created tables through Lake Formation Leave existing tables under IAMAllowedPrincipals Gradually revoke IAMAllowedPrincipals permissions Replace them with Lake Formation permissions Validate behavior after each migration step Lake Formation is particularly valuable for: Fine-grained table-, column-, and row-level access control Consistent authorization across Athena, Glue, and Redshift Spectrum Scalable permission management using LF-TBAC Cross-account data sharing However, some areas remain outside its scope: Direct access control to raw files in S3 Business metadata management Data quality management As discussed in my previous article, Lake Formation and DataZone have complementary responsibilities.
Service Role
Lake Formation Technical governance (who can access what)
Amazon DataZone Business governance (discovering, understanding, and requesting data)
A useful way to think about them is: Lake Formation = Technical foundation for governance DataZone = Business foundation for governance Combined with Glue Data Catalog, these services form a comprehensive data catalog and governance solution on AWS. In this article, I reviewed AWS Lake Formation from its fundamentals through practical implementation patterns. While there is a learning curve, it is an extremely important service for implementing proper data governance. Key takeaways: Lake Formation complements IAM rather than replacing it, adding fine-grained table-, column-, and row-level controls. Column, row, and cell-level security are implemented through Data Filters. LF-TBAC reduces operational overhead as the number of tables grows. Lake Formation integrates tightly with Glue Data Catalog by adding a governance layer on top of metadata management. Understanding IAMAllowedPrincipals and using Hybrid Access Mode for gradual adoption is essential in existing environments. Lake Formation certainly introduces some complexity, but when implementing proper access control in a data lake, the limitations of IAM alone eventually become apparent. In environments where data is consumed by multiple teams and a wide variety of users, Lake Formation is well worth considering. That said, successful adoption depends on maintaining consistency across IAM, Lake Formation, and S3 bucket policies, so careful planning is essential. I hope this article helps anyone considering the adoption of Lake Formation.