Terraform State Management at Scale
Terraform state is straightforward when you have a single team managing a handful of resources. It becomes a minefield when you scale to dozens of environments, multiple teams, and thousands of resources. This post covers the patterns and practices that keep state manageable at scale.
The Problem with State at Scale
Every Terraform deployment maintains a state file that maps your configuration to real-world resources. When organizations grow, common state-related problems emerge:
- State collisions: Two engineers running
terraform applysimultaneously, corrupting state - Monolithic state: A single state file tracking thousands of resources, making plans slow and risky
- Environment drift: No clear separation between dev, staging, and production state
- Access control gaps: Everyone having write access to production state
- Migration pain: Moving resources between state files without destroying and recreating them
Remote Backend Setup
The first step is moving from local state to a remote backend. For AWS environments, S3 with DynamoDB locking is the standard pattern.
terraform {
backend "s3" {
bucket = "acme-terraform-state"
key = "networking/vpc/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
# Use separate AWS profile for state access
profile = "terraform-state"
}
}Bootstrapping the Backend
There's a chicken-and-egg problem: you need infrastructure to store state, but you use Terraform to create infrastructure. The solution is a small bootstrap module that creates the state resources themselves.
# This module uses local state intentionally.
# Run once, then store its state in version control.
resource "aws_s3_bucket" "terraform_state" {
bucket = "acme-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}Key details: S3 versioning lets you recover from corrupted state by rolling back to a previous version. KMS encryption protects secrets stored in state. The DynamoDB table provides distributed locking to prevent concurrent modifications.
State Decomposition Strategies
A monolithic state file is the single biggest scaling bottleneck. When one state file tracks your VPC, EKS cluster, databases, and application configs, a terraform plan takes minutes and a failed apply can block every team.
Split by Lifecycle
Group resources by how often they change and who changes them:
infrastructure/
├── bootstrap/ # State backend (local state, committed)
├── networking/
│ ├── vpc/ # VPC, subnets, route tables (rarely changes)
│ └── dns/ # Route53 zones, records
├── platform/
│ ├── eks/ # EKS cluster, node groups
│ ├── rds/ # Database instances
│ └── elasticache/ # Redis/Memcached clusters
├── security/
│ ├── iam/ # IAM roles, policies
│ └── kms/ # Encryption keys
└── services/
├── api-gateway/ # API Gateway configs (changes frequently)
└── lambda/ # Lambda functionsEach directory is a separate Terraform root module with its own state file. The networking/vpc state might change quarterly, while services/api-gateway changes weekly. This isolation means a team updating Lambda functions never risks blocking the networking team.
Cross-State References
When modules are split, they still need to reference each other. Use terraform_remote_state data sources or, better yet, SSM Parameter Store for loose coupling.
# In platform/eks/main.tf - read VPC outputs
data "terraform_remote_state" "vpc" {
backend = "s3"
config = {
bucket = "acme-terraform-state"
key = "networking/vpc/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
name = "acme-production"
role_arn = var.cluster_role_arn
vpc_config {
subnet_ids = data.terraform_remote_state.vpc.outputs.private_subnet_ids
}
}# In networking/vpc/outputs.tf - publish to SSM
resource "aws_ssm_parameter" "vpc_id" {
name = "/infrastructure/networking/vpc_id"
type = "String"
value = aws_vpc.main.id
}
resource "aws_ssm_parameter" "private_subnets" {
name = "/infrastructure/networking/private_subnet_ids"
type = "StringList"
value = join(",", aws_subnet.private[*].id)
}
# In platform/eks/main.tf - read from SSM
data "aws_ssm_parameter" "vpc_id" {
name = "/infrastructure/networking/vpc_id"
}
data "aws_ssm_parameter" "private_subnets" {
name = "/infrastructure/networking/private_subnet_ids"
}SSM Parameter Store is preferred because it decouples the consumer from the producer's state backend configuration. If the VPC team migrates their state to a different bucket, consumers don't need to update anything.
Workspace Patterns for Environments
Terraform workspaces provide isolated state within the same configuration. They're useful for managing multiple environments (dev, staging, prod) from a single codebase, but they come with trade-offs.
Workspaces vs. Directory Separation
| Approach | Pros | Cons |
|---|---|---|
| Workspaces | DRY code, single source of truth | All envs share same config, risky blast radius |
| Directories | Full isolation, independent deploys | Code duplication, drift between envs |
| Hybrid | Shared modules, env-specific roots | More repo structure to maintain |
The hybrid approach works best at scale: shared modules define the “what” while environment-specific root modules define the “how much” and “where.”
infrastructure/
├── modules/ # Shared, versioned modules
│ ├── vpc/
│ ├── eks-cluster/
│ └── rds-instance/
├── environments/
│ ├── dev/
│ │ ├── main.tf # module "vpc" { source = "../../modules/vpc" }
│ │ ├── terraform.tfvars # instance_type = "t3.medium"
│ │ └── backend.tf # key = "dev/platform/terraform.tfstate"
│ ├── staging/
│ │ ├── main.tf
│ │ ├── terraform.tfvars # instance_type = "t3.large"
│ │ └── backend.tf # key = "staging/platform/terraform.tfstate"
│ └── prod/
│ ├── main.tf
│ ├── terraform.tfvars # instance_type = "m5.xlarge"
│ └── backend.tf # key = "prod/platform/terraform.tfstate"State Locking Deep Dive
State locking prevents concurrent operations on the same state file. Without it, two simultaneous terraform apply runs can corrupt state irreversibly.
When Locks Go Wrong
Sometimes a Terraform run crashes or is interrupted, leaving a stale lock. Here's how to handle it safely:
# Step 1: Verify the lock is actually stale
# Check who holds it and when it was acquired
aws dynamodb get-item \
--table-name terraform-state-locks \
--key '{"LockID": {"S": "acme-terraform-state/networking/vpc/terraform.tfstate-md5"}}' \
--output json | jq '.Item.Info.S | fromjson'
# Step 2: Confirm no active Terraform process is running
# Check CI/CD pipelines and ask the team
# Step 3: Force unlock only when certain
terraform force-unlock LOCK_ID
# NEVER force-unlock without verifying steps 1 and 2Safe State Migration
Moving resources between state files is one of the riskiest Terraform operations. A mistake can cause Terraform to destroy and recreate production resources. Here's a safe procedure.
Migration Procedure
- Take a snapshot of both source and destination state files
- Run
terraform planon both to confirm clean state - Move resources using
terraform state mv - Run
terraform planon both again—expect no changes - If plan shows changes, restore from snapshot and investigate
#!/bin/bash set -euo pipefail # 1. Snapshot current state echo "Taking state snapshots..." cd infrastructure/monolith terraform state pull > /tmp/monolith-state-backup.json cd ../networking/vpc terraform state pull > /tmp/vpc-state-backup.json # 2. Verify clean plans echo "Verifying clean state..." cd ../../monolith terraform plan -detailed-exitcode # Exit code 0 = no changes cd ../networking/vpc terraform plan -detailed-exitcode # 3. Move resources echo "Moving VPC resources..." cd ../../monolith terraform state mv \ -state-out=../networking/vpc/terraform.tfstate \ 'aws_vpc.main' 'aws_vpc.main' terraform state mv \ -state-out=../networking/vpc/terraform.tfstate \ 'aws_subnet.private' 'aws_subnet.private' # 4. Verify both states show no changes echo "Verifying migration..." terraform plan -detailed-exitcode cd ../networking/vpc terraform plan -detailed-exitcode echo "Migration complete. Both states are clean."
Migration Safety Checklist
- Always back up state files before any migration
- Run migrations during a maintenance window—no concurrent applies
- Use
-lock=true(the default) to hold locks during migration - Test the migration in a non-production environment first
- Have a rollback plan: restoring the state snapshot
Access Control for State
Not everyone should have write access to production state. Use S3 bucket policies and IAM to enforce least-privilege access.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowStateRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::acme-terraform-state",
"arn:aws:s3:::acme-terraform-state/*"
]
},
{
"Sid": "AllowProdStateWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::acme-terraform-state/prod/*",
"Condition": {
"StringEquals": {
"aws:PrincipalTag/team": "platform"
}
}
},
{
"Sid": "AllowDevStateWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::acme-terraform-state/dev/*"
}
]
}This policy allows all engineers to read any state (useful for terraform_remote_state lookups) but restricts production state writes to the platform team. Dev state is open for all engineers to modify.
CI/CD Integration
Running Terraform from CI/CD pipelines enforces consistency and provides audit trails. Here's a GitHub Actions workflow pattern that separates plan from apply.
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Plan
run: |
cd infrastructure/environments/${{ matrix.environment }}
terraform init
terraform plan -out=tfplan -no-color
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
// Post terraform plan output as PR comment
apply:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Apply
run: |
cd infrastructure/environments/${{ matrix.environment }}
terraform init
terraform apply -auto-approveKey Takeaways
- Never use local state in production: Remote backends with locking are non-negotiable for team environments
- Decompose state by lifecycle: Resources that change at different rates should live in separate state files
- Prefer SSM over remote_state: Loose coupling between state files reduces blast radius and simplifies refactoring
- Use the hybrid approach: Shared modules with environment-specific roots give you DRY code and full isolation
- Treat state migration as a production change: Back up, verify, migrate, verify again
- Enforce access control on state: Production state writes should be restricted to authorized teams and CI/CD pipelines
Conclusion
Terraform state management is not glamorous, but getting it wrong can bring production down. The patterns covered here—remote backends, state decomposition, safe migrations, and access control—form the foundation for scaling infrastructure as code across teams and environments. Invest in this foundation early, and your future self will thank you when the organization grows from 5 services to 50.
Amar Sattaur
Staff DevOps Engineer specializing in infrastructure as code, Kubernetes, and platform engineering.