Using S3 Metadata to list recently uploaded objects

Posted: | Tags: cloud aws til storage

Recently, Simon Willison shared how he uses S3 event notifications with Lambda and DynamoDB to list recently uploaded files from an S3 bucket. The first thought that occurred to me was to use S3 inventory which provides a daily catalog of objects within a bucket queriable through Athena. The second idea involved doing the same with the recently announced S3 metadata feature. Both methods, I discovered, were already commented on by others. In this post, I want to explore the S3 metadata method to get my feet wet with the service.

Disclaimer: Do not take the information here as a good or best practice. The purpose of this site is to post what I have learned in somewhat real-time.

S3 metadata uses S3 tables (built on Apache Iceberg tables) to inventory uploads and updates made to objects in a bucket, this includes user-defined metadata.

Create an S3 Table bucket and enable S3 Metadata

To begin, an S3 Table bucket needs to be created, which can be done through the console or CLI. While creating a bucket, I also enabled Integration with AWS analytics services - Preview. This feature will be needed when using Athena later. We do not need to create a namespace or table, as these will be created for us by S3 Metadata next.

From the S3 console, select the general-purpose bucket you’d like to monitor, and under the Metadata tab, select Create metadata configuration. From there, you can select the previously created S3 Table bucket and enter a metadata table name. This step can also be done through the console or CLI and the SDK and API as well. The table will be created under the reserved aws_s3_metadata namespace in the selected S3 Table bucket.

Grant permissions to query the table

Lake Formation permissions need to be granted to be able to query the metadata created table in Athena. Enabling the integration in the previous step and granting Lake Formation permissions can be skipped if you are directly querying the table through Spark, although I haven’t personally tested this out.

From the Lake Formation console, under Data Permissions select Grant then choose the IAM users, roles, or other principals that will be running the queries on the table. Under LF-Tags or catalog resources select Named Data Catalog resources and for Catalog choose the data catalog created for your S3 Table. This will be in the format <account_id>:s3tablescatalog/<table_bucket_name>. For Databases select the metadata namespace, aws_s3_metadata, and for Tables select the previously created S3 Table. Under Table permissions, we only need the Select Table permissions. Finally, click Grant.

These instructions in a more generic form can also be found in the AWS documentation.

Query with Athena

Provided Athena is already set up on the account, the S3 metadata table can be queried. The table can be reached from s3tablescatalog/TABLE_BUCKET_NAME"."aws_s3_metadata"."TABLE_NAME" by replacing the TABLE_BUCKET_NAME and TABLE_NAME placeholders.

The following query will get the 5 most recent records from the table.

SELECT *
FROM "s3tablescatalog/TABLE_BUCKET_NAME"."aws_s3_metadata"."TABLE_NAME"
ORDER BY record_timestamp DESC
LIMIT 5

The table schema is described in the documentation. The record_timestamp is used to sort the records, and the record_type can be used to determine the action, such as CREATE or DELETE. The last_modified_date column is present for objects that have the CREATE record type indicating the time the object was created or modified.

Cleaning up

Delete S3 Table resources

Unlike creation, this process can only be done from the CLI. I used Cloudshell to accomplish this through the console.

Delete the S3 table.

aws s3tables delete-table --table-bucket-arn TABLE_BUCKET_ARN --namespace aws_s3_metadata --name TABLE_NAME 

Delete the S3 Table namespace.

aws s3tables delete-namespace --table-bucket-arn TABLE_BUCKET_ARN --namespace aws_s3_metadata

Delete the S3 Table bucket.

aws s3tables delete-table-bucket --table-bucket-arn TABLE_BUCKET_ARN

Disable S3 Metadata

From the general purpose bucket Metadata tab, Delete and confirm the metadata configuration.

Delete the data catalog

I have found that manually deleting the data catalog created in Lake Formation disables the integration we enabled earlier. From the Lake Formation console, under Catalog, select s3tablecatalog and then Actions and Delete.

Costs

Charges associated with the solution by service are:

  1. S3 Metadata
    • $0.45 per million updates
  2. S3 Tables1
    • First 50TB is charged at $0.0265 per GB per month
    • PUT, POST, LIST requests at $0.005 per 1,000 requests per month
    • GET and all other requests at $0.0004 per 1,000 requests per month
    • Object monitoring at $0.45 per month
  3. Glue Data Catalog (only with analytics integration)2
    • Metadata storage, first million free
    • Metadata requests, first million free
  4. Athena
    • SQL queries, $5.00 per TB of data scanned

I have purposely omitted any pricing examples as this is highly dependent on the data request and query patterns.

Closing thoughts

S3 metadata provides an easy-to-setup low-code solution to get the latest object updates for an S3 bucket. S3 Event Notifications would get those updates through sooner than S3 metadata does, although the delay of a few minutes might be appropriate for some use cases. The S3 inventory option has the slowest update time at 24 hours.

In terms of cost, this solution may end up being the most expensive by a few cents compared to the S3 inventory option, with the S3 Event Notifications and Lambda being the cheapest. However, depending on the usage, these numbers will change, so I recommend doing your own calculations. Since this solution involves a number of services, it’s easy to see why it may be the most expensive.

A final thought, since this is a new feature it is only available in us-east-1, us-east-2, and us-west-2, at the moment. More limitations and listed in the AWS documentation.

I’d be happy to hear suggestions, comments, and questions. Reach out to me on Mastodon or email if you have anything to ask or add.


  1. Costs for table maintenance have been excluded for brevity but can be found on the pricing page, charged per object and GB processed. ↩︎

  2. Additional costs can be applied; more details are available on the pricing page↩︎


Related ramblings