IAM eventual consistency and Terraform

Posted: | Updated: | Tags: cloud aws terraform til

Update 2023-12-19: Got an update from the issue I raised that the AWS Backup Access Policy and IAM role issue has been resolved in the Terraform AWS Provider version v5.30.0 via this Pull Request thanks to @nam054 and @johnsonaj. They delay has now been added as part of the provider itself and I’ve confirmed it works! You can disregard the rest of this post or continue reading if you’re interested.

I recently came across an InvalidParameterValueException when trying to add a newly created AWS IAM role as a principle within an AWS Backup access policy in Terraform. It worked after applying the Terraform module a second time. After multiple repeated trials I found the module always failed on the first attempt but succeeded on the second. It seemed odd and after an embarrassingly long time searching online, I came across a pattern in the reported errors in the issues on the AWS Terraform Provider repository. These included MalformedPolicyDocument, InvalidPolicy, InvalidParameterValue among others, all related to referencing recently created IAM resources.

This led me direclty to a sub-heading of the AWS IAM User Guide discussing troubleshooting, titled “Changes that I make are not always immediately visible”. In short, IAM is by nature eventually consistent, as a request for changes to the service resources are acknowledged before they propagate to every endpoint. The documentation provides the reader with guidance on how to best work with eventual consistency, it reads:

You must design your global applications to account for these potential delays… make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them.

My application isn’t global but is still tied to this issue. Now with the AWS Terraform provider there’s been a lot of discussion on how to get around this including pull requests to services affected by IAM’s eventual consistency adding additional checks or timeouts. However, I haven’t seen a single solution that targets all services that depend on IAM, or a clean way to verify an IAM resources are created before continuing with dependencies.

So, I wish I had a more elegant solution to provide here but all I can do is leave you with a sleep timer. The following uses time_sleep from Hachicorp’s time provider. I found 20 seconds to work consistently with my tests, here the timer depends on the IAM resource and the downstream resource can then depend on the timer.

resource "time_sleep" "iam_delay" {
  depends_on = [aws_iam_role.the_role]

  create_duration = "20s"
}

The time_sleep documentation also notes:

In many cases, this resource should be considered a workaround for issues that should be reported and handled in downstream Terraform Provider logic.

A workaround it is, I did create an issue with not enough programming experience or time to tackle the problem, I will leave this workaround up for others to find.


Related ramblings