Querying S3 Tables with DuckDB
Posted: | Tags: s3 duckdbDuckDB has the ability to query S3 Tables, when this was first announced in beta I was unable to get it to work and subsequently gave up. A little over a year later I decided to give this a go again and now that it’s generally available, to my surprise, it just works. There’s still guides on how to set this up with nightly releases and other feature flags but nothing that shows the generally available path, so here’s my steps to get started.
Support for S3 Tables is included in the core extensions through Iceberg. In order to autenticate you will need the aws extension, which is also included as a core extension. I’m using DuckDB version 1.5.0 for this.
I also assume you have an S3 Table bucket, namespace and table with adequate permissions. If not, you can walk through the official AWS tutotiral.
Authenticate with AWS
First, I authenticate with AWS using the CLI before launching DuckDB, you can do something similar. Then open the DuckDB CLI and create the AWS secret.
CREATE SECRET (
TYPE S3,
PROVIDER credential_chain
);
Attach the S3 Table bucket
Next, attach the S3 Table bucket, there’s two way to go about this, the first uses the glue endpoint, demonstrated in the AWS blog, and the second uses the s3_tables endpoint, demonstrated in the DuckDB docs. From my brief tests both work identically, so pick whichever approach suites you.
glue endpoint
ATTACH '<account-id>:s3tablescatalog/<bucket-name>' AS duck_db
(TYPE ICEBERG, ENDPOINT_TYPE 'GLUE');
s3_tables endpoint
ATTACH '<s3-table-bucket-arn>' AS duck_db
(TYPE iceberg, ENDPOINT_TYPE s3_tables);
Query the S3 Table
With the endpoint created you should now be able to see all the tables in your S3 Table bucket.
SHOW ALL TABLES;
┌──────────┬──────────────┬─────────────────┬──────────────┬──────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────────┼─────────────────┼──────────────┼──────────────┼───────────┤
│ duck_db │ my_namespace │ a_generic_table │ [__] │ [UNKNOWN] │ false │
│ duck_db │ my_namespace │ my_table │ [__] │ [UNKNOWN] │ false │
└──────────┴──────────────┴─────────────────┴──────────────┴──────────────┴───────────┘
Now we can try querying a specific table.
SELECT * FROM duck_db.my_namespace.a_generic_table LIMIT 5;
┌──────────────┬───────────────┬────────────────────────────────┬───┬──────────────┬─────────────┐
│ an_id │ an_type │ description │ … │ aisle │ owner │
│ varchar │ varchar │ varchar │ … │ varchar │ varchar │
├──────────────┼───────────────┼────────────────────────────────┼───┼──────────────┼─────────────┤
│ e43a3f2a-cc8 │ thermostat │ Regulate motor temperature │ … │ WLD │ wekjsnij │
│ 8e7cb55b-ec7 │ motor │ Electric motor │ … │ WDS │ wekjsnij │
│ 7bdaf908-091 │ gearbox │ Modular gearbox │ … │ WSP │ wekjsnij │
│ 8e7cb55b-ec7 │ motor │ Electric motor │ … │ WLC │ wekjsnij │
│ 2ef23471-1d5 │ thermostat │ Amazon Simple Queue Service │ … │ WCV │ wekjsnij │
└──────────────┴───────────────┴────────────────────────────────┴───┴──────────────┴─────────────┘
5 rows use .last to show entire result 23 columns (5 shown)
Python approach
The same steps done through the CLI can be carried over to Python. I have the s3_tables endpoint variant commented out in favour for the glue one, but you can pick whichever works best for you.
import duckdb
account_id = "0123456789"
bucket_name = "my_new_bucket"
s3_table_arn = "<s3_table_arn>"
con = duckdb.connect()
con.sql("""
CREATE SECRET (
TYPE S3,
PROVIDER credential_chain
)
""")
con.sql(f"""
ATTACH '{account_id}:s3tablescatalog/{bucket_name}' AS duck_db
(TYPE ICEBERG, ENDPOINT_TYPE 'GLUE')
""")
# con.sql("""
# ATTACH '{s3_table_arn}' AS duck_db (
# TYPE iceberg,
# ENDPOINT_TYPE s3_tables
# );
# """)
print(con.sql("SHOW ALL TABLES").fetchdf())
df = con.sql("""
SELECT * FROM duck_db.rme_cost_radar_namespace.daily_charges_by_resource LIMIT 10
""").fetchdf()
print(df)
That’s it!