Query Glue tables across accounts and regions

Posted: | Tags: aws cloud athena

Recently, I updated a Stack Overflow answer to include some information on Lake Formation that was new to me: resource links. The answer also covered a few other methods to query data cataloged in Glue across accounts and regions. Here’s the setup: you have a Glue Data Catalog and a S3 bucket in an AWS account called Owner Account, and there are users in different AWS accounts and regions that would like access to this data. For the sake of explanation, we’ll focus on one account that needs access in a different region; this account will be called Query Account.

Avid users of Lake Formation will already know where this is going.

Setting up cross-account access with AWS Glue

Let’s assume Lake Formation didn’t exist for a moment, and neither did resource links. What exactly could we achieve? In short, we can only go so far as cross-account access.

Diagram quering Glue data catalog in the Owner account from the Query account in the same region.

We’d need to create an IAM role and policy in the Query Account that allows the user using Athena to reach the Glue resources and query the data in S3 in the Owner Account. That alone isn’t enough. We’d also need to set a resource policy on the S3 bucket and Glue resources that permits this role to access its content. With all this in place, we can now query and register the data catalog in Athena in the Query Account and run our queries.

This approach is limited to resources in the same region when using Athena.

A resource link is a pointer to a Glue database or table that is either owned or shared with an account which can cross regions. That solves our problem; this feature—and a relatively new one at that, circa 2023—is usually paired with AWS Lake Formation. This is a service that provides a number of security and governance features for data on AWS. Getting started with Lake Formation will require some time but would eventually allow us to access our databases and tables across accounts and regions by using a “resource link”.

Diagram using LakeFormation and Resource Access Manager to share a Glue data catalog across accounts within the same region. A resource link is then created from the shared data catalog across regions within the Query account.

The S3 bucket can also be registered with Lake Formation so you don’t need to worry about resource policies here; Lake Formation will take care of that for you once resources are shared. You can then share the database and tables from Lake Formation to “External accounts”. When doing so, you must choose “Select” and “Describe” permissions and make them grantable to be able to share them with your resource link later on.

Now, from the Query Account, navigate to the Resource Access Manager (RAM) inviations generated by sharing the Glue resources from the Owner Account. Once accepted, you will then be able to see the resources within the Lake Formation console in the same region as the source. Finally, you can then go to your desired region in the Query Account and from Lake Formation create a resource link under Databases. You should be able to see your shared database that you just accepted from RAM. Once complete, you can then navigate to Athena and query the database and tables as if they’re owned by the account.

Caveman mode: Creating target databases in Glue

Lake Formation is great and all, but what if we don’t want to use it? Can we still create a resource link with Glue? I can’t find references to creating resource links outside of Lake Formation. Through my search, I did, however, find an issue in the Terraform AWS provider repository that references the inclusion of target databases to create resource links with the API. Calling get-database on a Lake Formation created resource link showed that a target database was in fact configured. Bingo!

Diagram showing a resource link created across regions and accounts to a Glue data catalog in the Owner account.

So, from the Owner Account, ensure that you have the S3 and Glue resource policy applied to your resources, like we did in the beginning, so your IAM Role in the Query Account can access the resources. Next, from the Query Account, go to the CloudShell console to create a target database in your desired region. First, create a JSON file with the following fields:

{
    "Name": "<database-name>"
    "TargetDatabase": {
        "CatalogId": "<catalog-id>",
        "DatabaseName": "<owner-database-name>",
        "Region": "<owner-region>"
    }
}

The Name can be whatever you’d like your resource link to be called; the CatalogId should be whatever the source catalog ID is. If you’re using the default catalog it’s probably the AWS account ID of the Owner Account. The DatabaseName and Region should match those of your source Glue resources. I’ve saved the file as input.json to use in the following command:

aws glue create-database --catalog-id <destination-catalog-id> --database-input file://input.json

The --catalog-id here is where you’d like the resource link to be created, in my case, the default catalog, so it’s the AWS account ID of the Query Account. From there, you can query the data from Athena; the database name will be the name of the resource link you specified within the JSON file.

That’s it, success! If you have to set this up via infrastructure as code or want to avoid Lake Formation, you know now how to set this up.


Related ramblings