Recently, I updated a Stack Overflow answer to include some information on Lake Formation that was new to me: resource links. The answer also covered a few other methods to query data cataloged in Glue across accounts and regions. Here’s the setup: you have a Glue Data Catalog and a S3 bucket in an AWS account called Owner Account, and there are users in different AWS accounts and regions that would like access to this data. For the sake of explanation, we’ll focus on one account that needs access in a different region; this account will be called Query Account.
Read more...
Athena
I’ve been spending more time browsing StackOverflow recently and came across a question asking if it was possible to find duplicate objects within an S3 bucket. One way would be to hash the object prior to upload and store the value in a local or remote data store. If that’s not possible or too much overhead, I figured I could use S3 Metadata and Athena to solve this, services I’ve covered on this blog not too long ago. Athena alone has come up a few times this year, just because I’ve been finding interesting use cases for it. While I am an AWS employee, everything I’ve written, and will write, on this blog has always been out of personal interest. There are no sponsored posts here and all opinions are all my own.
Read more...
Amazon S3 Intelligent-Tiering moves your data to the most cost-effective S3 storage tier based on the object’s access pattern for the price of $0.0025 per 1,000 objects it monitors. Since the movement is done by the service you don’t know, or need to know, the access tier the object is currently in as all objects can be retrieved asychronously. If you opt-in for asynchronous archive tiers, you can find out if an object is in one of these tiers by requesting the HEAD of an object. This only works for these opt-in tiers, if you’d like to find out if the object is in the Frequent Access, Infrequent Access or Archive Instance Access tiers you will need to refer to the Amazon S3 Inventory. The S3 Inventory provides a snapshot of your object’s metadata at a daily or weekly frequency, this snapshot also includes S3 Intelligent-Tiering access tier, the key we are interested in.
Read more...
Ben Welsh and Katlyn Alo have created a course that walks through running your first Athena query, complete with sample data. I’ve written about Athena a few times on this blog and the course works as a great primer. This post will act as an addendum to the guide, specifically step 4, where you create an Athena database and table.
A database, in Athena, holds one or more tables, and the table points to the data and what schema it’s in. The creation of these resources in the guide follows a pattern I’ve seen a lot where the database and table are created using SQL through Athena. It’s simple, and does the job, but I’d like to document how to create a table and database using AWS Glue, an extract-transform-load data service, and highlight some comparisons between the two methods.
Read more...
Recently, Simon Willison shared how he uses S3 event notifications with Lambda and DynamoDB to list recently uploaded files from an S3 bucket. The first thought that occurred to me was to use S3 inventory which provides a daily catalog of objects within a bucket queriable through Athena. The second idea involved doing the same with the recently announced S3 metadata feature. Both methods, I discovered, were already commented on by others. In this post, I want to explore the S3 metadata method to get my feet wet with the service.
Read more...
Grafana has the ability to use Amazon Athena as a data source allowing you to run SQL queries to visualize data. The Athena table data types are conveniently inherited in Grafana to be used in dashboard panels. If the data types in Athena are not exactly how you’d like them in Grafana you can still apply conversion functions.
In this case the timestamp column in Athena is formatted as a string, and I do not have the ability to adjust the table in Athena (which is normally what you’d want to do). If your Athena table doesn’t have a column in a DATE, TIMESTAMP or TIME format you will not be able to natively use a panel that relies on a timestamp. You can parse the timestamp string using the parse_datetime() function that expects the string, and the format the string is in. This columns will now appear as a timestamp to Grafana.
Read more...