Finding duplicate objects in an S3 bucket
Posted: | Tags: aws cloud athena storage tilI’ve been spending more time browsing StackOverflow recently and came across a question asking if it was possible to find duplicate objects within an S3 bucket. One way would be to hash the object prior to upload and store the value in a local or remote data store. If that’s not possible or too much overhead, I figured I could use S3 Metadata and Athena to solve this, services I’ve covered on this blog not too long ago. Athena alone has come up a few times this year, just because I’ve been finding interesting use cases for it. While I am an AWS employee, everything I’ve written, and will write, on this blog has always been out of personal interest. There are no sponsored posts here and all opinions are all my own.
With that out of the way, I’ll go over the solution I shared on StackOverflow with some additional commentary. Each object within S3 has an entity tag (ETag
) which is a hash of its contents, not metadata. It can be an MD5 hash but this depends on how the object was created and encrypted. So it may not always work to find duplicate objects. However, assuming each object, and it’s duplicate, is created and uploaded in the same way, the following method should work.1 Using this we can enable S3 Metadata on a bucket to log changes to objects along with their metadata, which includes the ETag
.
Prerequisites
The steps to create an S3 Table and enable S3 Metadata, grant permissions to query the table, and use the Athena editor haven’t changed, so I’ve linked to the relevant sections from my other post.
Building the query
The part I’m excited to show off is the query. It’s large and may not be optimal, but it does work.
Find all current objects
First, we need a list of all objects currently in the bucket2, so we filter out the deleted ones.
-- Get all created objects and remove deleted ones through left exclusive join
SELECT all.e_tag, all.key
FROM "aws_s3_metadata"."<table-bucket-name>" all
LEFT OUTER JOIN (
-- Get all deleted objects
SELECT e_tag, key
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE record_type = 'DELETE'
) deleted ON deleted.key = all.key
WHERE all.record_type = 'CREATE' AND deleted.key IS NULL
GROUP BY all.e_tag, all.key
The inner query gets all the deleted objects, and the outer query all created objects within the bucket. Using a left exclusive join we can get a list of all objects that currently remain in the bucket.2
Another way to accomplish this is using the NOT IN
operator.
SELECT key, e_tag
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE record_type = 'CREATE'
AND key NOT IN (
SELECT key
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE record_type = 'DELETE'
)
Find all duplicates
With a list of all current objects and their entity tags, we can GROUP BY
and COUNT
any e_tag
that appear more than once in our list. The results would be the entity tags of our duplicate objects
-- List all duplicates
SELECT e_tag
FROM (
-- Current object query
)
GROUP BY e_tag HAVING COUNT(e_tag) > 1
From here we use the IN
operator to get the key
’s for each listed e_tag
.
SELECT key, e_tag
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE e_tag
IN(
-- List all duplicate e_tag query
)
The final query
Joining each part together we get the final query.
-- List duplicate objects
SELECT key, e_tag
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE e_tag
IN(
-- List all duplicates
SELECT e_tag
FROM (
-- Get all created objects and remove deleted ones through left exclusive join
SELECT all.e_tag, all.key
FROM "aws_s3_metadata"."<table-bucket-name>" all
LEFT OUTER JOIN (
-- Get all deleted objects
SELECT e_tag, key
FROM "aws_s3_metadata"."<table-bucket-name>"
WHERE record_type = 'DELETE'
GROUP BY e_tag, key
) deleted ON deleted.key = all.key
WHERE all.record_type = 'CREATE' AND deleted.key IS NULL
)
GROUP BY e_tag HAVING COUNT(e_tag) > 1
);
Finally, when you are done and wish to remove the resources you configured, the clean up step in my previous post walks through deleting the table with its resource and disables the S3 metadata configuration.
While writing this I found a post that describes how to calculate an objects entity tag prior to upload. I have not had the time to test it out thoroughly but from quick tests it does seem to work. This might warrant a follow-up post from me. ↩︎
Currently, S3 Metadata only records changes to metadata after being enabled. Any prior objects don’t show up in the table unless they’ve been modified, so this will only work for future added or deleted objects. ↩︎ ↩︎