refactoring_terraform

Posted on 15 September 2022, updated on 17 February 2023.

At Padok, we use Terraform most days. We manage large codebases that provision long-lived cloud infrastructure. Such codebases often need to evolve, and regular refactoring is key to success. Terraform's nature can make it tedious to refactor an existing codebase, so teams are often reluctant to refactor. To make refactoring painless, we have developed a tool to help us. This tool is called tfautomv.

Refactoring keeps technical debt in check

On large, long-lived projects, infrastructure usually plays a key role. At Padok, we write our infrastructure as code, allowing us to manage multiple environments with minimal drift and make iterative changes to our infrastructure daily.

Teams usually have two models of their infrastructure. The first is their code, which matches the infrastructure as it exists in the cloud. The second is the mental model of their infrastructure.

This mental model is what SREs use when they think about and discuss infrastructure. It is often modular: SREs have a high-level view of their infrastructure and smaller, detailed views of various resources and how they interact.

When these two models drift — when a team's mental model differs from their actual code — we call this technical debt. Managing this debt is a core part of any software engineer's job. So when the mental model evolves, which is only natural, maintainers should refactor the codebase so that both models stay in sync.

This doesn't mean that the codebase frequently needs significant refactoring. A team's mental model doesn't change significantly very often. Making small regular changes to the codebase's structure is an excellent way to keep technical debt under control.

Codebases that have drifted significantly from the team's mental model, usually over time, can profit from a large refactor. Such an endeavor can be a good investment for teams that want to be more efficient when working with Terraform.

Pains of refactoring with Terraform

Terraform is inherently stateful. It keeps track of all resources it manages in its state. Let's say you have a single VM already provisioned with Terraform:

resource "aws_instance" "single" {
  ami           = "ami-09e513e9eacab10c1" # Ubuntu 22.04
  instance_type = "t3.large"
}

And you want to add a second instance where you will deploy a replica:

resource "aws_instance" "single" {
  ami           = "ami-09e513e9eacab10c1" # Ubuntu 22.04
  instance_type = "t3.large"
}

resource "aws_instance" "replica" {
  ami           = "ami-09e513e9eacab10c1" # Ubuntu 22.04
  instance_type = "t3.large"
}

You run Terraform and now have a second instance. All is good. However, you no longer think of the first instance as the "single" instance, as your code says, but as the "primary" instance. Your mental model has changed, and your code should change with it:

resource "aws_instance" "primary" {
  ami           = "ami-09e513e9eacab10c1" # Ubuntu 22.04
  instance_type = "t3.large"
}

resource "aws_instance" "replica" {
  ami           = "ami-09e513e9eacab10c1" # Ubuntu 22.04
  instance_type = "t3.large"
}

This seems like an insignificant change, but when you run Terraform, you get a scary message:

Plan: 1 to add, 0 to change, 1 to destroy.

Terraform wants to destroy your primary instance! That is not what you want. You don't want to change your infrastructure at all. Why does Terraform believe differently?

Terraform kept track of your primary instance in its state under this ID:

aws_instance.single

Since you renamed the resource, Terraform no longer sees it in your code and wants to destroy it. Based on your code, it also wants to create a new resource with this ID:

aws_instance.primary

Terraform has no way of knowing these two resources are the same. You must inform Terraform by changing its state. Before Terraform 1.1, this command did that:

terraform state mv aws_instance.single aws_instance.primary

With Terraform 1.1 and above, you can perform this migration declaratively by adding a moved block to your code:

moved {
  from = aws_instance.single
  to   = aws_instance.primary
}

When you rerun Terraform, it now knows that no changes are required:

Plan: 0 to add, 0 to change, 0 to destroy.

Running the commands or writing the required blocks is painless on small codebases. On larger codebases, refactoring your code can quickly lead to regret (this really happened):

Plan: 316 to add, 0 to change, 316 to destroy.

Writing all the necessary moved blocks can be very time-consuming and error-prone. Teams are reluctant to refactor their Terraform codebase because they have better things to do with their time.

That is why we decided to make this process much, much faster.

Introducing tfautomv

Imagine a single command that writes all the moved blocks required for Terraform to take your refactoring into account. That is what tfautomv does.

We are proud to make this tool fully open-source. At Padok, we are all about sharing expertise, and we feel this is in line with the open-source community's values.

To use tfautomv, all you need to do is run it where you would run Terraform:

tfautomv

It will list the resources Terraform wishes to destroy and create, find resource pairs with the same attribute values, and write the necessary moved block to migrate the resource's state to where it needs to be.

If tfautomv finds multiple matches for the same resource, it doesn't take any risks and does nothing. It is then your responsibility to write the corresponding block. In our experience, these cases are rare.

Users often want to know why a resource did not match any others, so we offer a flag that prints tfautomv's analysis:

tfautomv -show-analysis

Detailed output of tfautomv's reasoning contains all the information you need:

detailes_output_tfautomv

Older codebases may use a version of Terraform that does not support moved blocks. For those use-cases, tfautomv can also produce copy-pastable commands:

tfautomv -output=commands

You can also perform a dry run if you want to see what moves tfautomv will find if you run it:

tfautomv -dry-run

Padok works with multiple cloud providers, so tfautomv is completely provider-agnostic. It works on any Terraform codebase, regardless of where you provision your resources.

Padok already uses tfautomv to refactor production codebases. Feedback has been very positive. The tool is heavily tested after every change to be as reliable as possible. We never want to break production infrastructure.

Origin story

The idea for tfautomv first emerged when working for a client. We managed a large Terraform codebase which was becoming painful for our SREs. We decided to refactor it and suffered from the pain point described above.

The first quick-and-dirty version of tfautomv was tailor-made for the client's infrastructure and our specific refactoring needs. The tool was a great success: it saved us a lot of time by automatically performing most of the necessary state migrations.

We then decided to rewrite the tool to make it more rigorous in its logic and to make it project-agnostic. We started with a significant design phase, walking through different cases that may emerge during refactoring. We support the most common cases so that our SREs can get the most done in the least amount of time.

Implementing the new design did not take very long. Tfautomv is not a giant tool. Its codebase currently contains about 1400 lines of Go, about 600 of which are automated tests.

What next?

We have plans for additional features that our SREs have requested. In the next release, we will add an option to ignore specific differences between resources. Users will be able to make tfautomv more flexible, allowing matches between resources that are not strictly identical. This is particularly useful when two different values of an attribute are functionally equivalent at the provider level.

Feel free to request additional features. All feedback is welcome!

Reach out to Padok and tfautomv's maintainers on Twitter: @padok_m33 and @ArthurBusser. We would love to hear about your experience with tfautomv.