Skip to content
Ayhan Sipahi Ayhan Sipahi

Should Stateful Resources Live in a Separate CDK Stack?

A lifecycle test for CDK stack layout: give a resource its own long-lived stack when it outlives any single deployer, then reach it by a well-known name.

A CDK app answers one layout question over and over: does this resource share a stack with the Lambda functions that use it, or get a stack of its own? The usual advice, “keep the database separate,” reaches the right answer through the wrong test. State is not the criterion. A shared EventBridge bus holds no data, yet deleting it breaks every domain at once, so asking “does it hold state” misclassifies it. The criterion that predicts placement is lifecycle: a resource earns its own long-lived stack when its lifecycle outlives any single deployer, and other stacks then reach it by a well-known name. That name is a contract, and the contract sits on a coupling spectrum you can tune.

The lifecycle test

State is the obvious half of the test, and it is real. Replacing a DynamoDB table, an RDS or Aurora cluster, an S3 bucket with objects, an OpenSearch domain, or a Cognito user pool loses data that a redeploy cannot rebuild. But state is only half. The other half is the rendezvous point: a resource every domain depends on reaching by a stable identity, even though it stores nothing durable. A central EventBridge bus and a shared SNS topic are the clear cases. They hold no rows, yet a casual delete or rename breaks every stateless stack at once. Yan Cui reaches the same boundary from the cohesion side and puts shared infrastructure outside the service entirely: he treats VPCs and subnets as “part of the ‘platform’, not the service” and says “they should have their own stack and repo and pipeline.”

So the test has two branches, and a Yes on either one sends the resource to a long-lived stack.

No

No

Yes

Yes

Yes

No

Resource

Lose data if replaced?

Every domain must reach it?

Stateless app stack

Own long-lived stack

Cheap and fast to create?

Per-env instance, physical name

One shared instance, SSM path

Because the boundary is lifecycle and not resource type, it also tells you what stays with the application: an API Gateway, a Lambda, an IAM role, or a per-request table a preview environment can throw away. AWS’s own guidance starts from the same baseline, not from “always split.” It says to “keep them together unless you know you want them separated,” then carves out the exception: “Consider keeping stateful resources (like databases) in a separate stack from stateless resources. You can then turn on termination protection on the stateful stack.” The lifecycle test is that carve-out made precise. The same instinct shows up outside CDK; Terraform teams split state files by blast radius and rate of change for the same reason.

The stateful stack

The long-lived stack opts out of accidental destruction in two places. At the stack level, terminationProtection: true blocks a delete of the whole stack. At the resource level, a removal policy decides what happens when a resource leaves the stack.

CDK already leans safe here. As AWS puts it, the toolkit defaults to “policies that retain everything you create,” and a data resource removed from a stack is “orphaned from the stack” rather than deleted. Set the policy explicitly anyway, and pick the right flavor. RemovalPolicy.RETAIN keeps the resource on any removal but orphans it, including when a fresh resource’s creation rolls back, which can leave empty junk behind. RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE retains on delete and on replacement but still cleans up a resource whose creation was rolled back. For a genuinely stateful resource, prefer the latter; the code below does. Resources that ship only as L1 constructs in aws-cdk-lib (an ElastiCache replication group, an OpenSearch Serverless collection) take the policy through applyRemovalPolicy(...) instead of a removalPolicy prop.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as events from 'aws-cdk-lib/aws-events';

interface StatefulStackProps extends cdk.StackProps {
  readonly stage: string;
}

export class StatefulStack extends cdk.Stack {
  public readonly usersTableName: string;
  public readonly eventBusName: string;

  constructor(scope: Construct, id: string, props: StatefulStackProps) {
    super(scope, id, {
      ...props,
      terminationProtection: true, // block an accidental delete of the whole stack
    });

    // Loses data if replaced: retain on delete and on replacement.
    const table = new dynamodb.Table(this, 'Users', {
      tableName: `myapp-${props.stage}-users`, // stage-scoped physical name: the contract
      partitionKey: { name: 'userId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE,
    });

    // A rendezvous point: no stored data, but every domain must reach it by name.
    const bus = new events.EventBus(this, 'DomainEvents', {
      eventBusName: `myapp-${props.stage}-events`,
    });

    this.usersTableName = table.tableName;
    this.eventBusName = bus.eventBusName;
  }
}

The reason to separate is exactly the one AWS names: with the database in its own protected stack, you can “freely destroy or create multiple copies of the stateless stack without risk of data loss.” The trade-off is honest. You now reason about two stacks instead of one, and if you wire them together with CloudFormation exports you inherit a deploy-time lock, discussed next. Note also that the bus carries no removal policy: it stores nothing, so its protection is not a retain policy but the fact that it lives in a stack you never casually destroy.

The named contract

Once a resource lives in its own stack, every other stack reaches it through a contract. There are three, and they form a spectrum from most coupled to most rigid.

CFN export: deploy-time lock

SSM path: resolved at deploy

Physical name: rename blocked

CloudFormation export. When you pass a construct from one stack to another in the same app, CDK synthesizes an Fn::ImportValue for you. It is convenient, and CloudFormation guarantees the value exists while it is imported. The cost is the export update lock. CloudFormation refuses to change or delete an exported value while a consumer still imports it, reporting that the export cannot be modified while another stack depends on it. Breaking that reference cleanly takes a two-step exportValue procedure, which the CloudFormation 500-resource-limit post covers in depth.

SSM parameter path. The producer writes a StringParameter; the consumer reads it with ssm.StringParameter.valueForStringParameter(...), which returns a token CloudFormation resolves at deploy time. There is no export lock, and the value is not baked into the synthesized template. By contrast, valueFromLookup(...) resolves at synth time, caches into cdk.context.json, and embeds the value in the template. Do not use it for anything you would not commit to source control. Specifically, the path string is itself a fixed, well-known name. The coupling did not vanish; it moved up one level, from the resource to the parameter path.

Physical resource name. The consumer imports by a name it already knows: Table.fromTableName, Bucket.fromBucketName, EventBus.fromEventBusName. Nothing resolves at deploy time, so there is no ordering coupling and no export lock. Two constraints come with it. First, the name can never change, because changing it forces a replacement. Second, the import is a proxy: it does not become part of your app. A grant* call adds IAM to the consumer’s own role (that works), but you cannot mutate the imported resource’s policy from the consuming stack. And fromTableName and its siblings are same-account only; crossing an account boundary needs the ARN form plus a resource policy, covered later.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as events from 'aws-cdk-lib/aws-events';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

interface AppStackProps extends cdk.StackProps {
  readonly stage: string;
}

export class AppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: AppStackProps) {
    super(scope, id, props);

    // Import by a name this stack already knows: no cross-stack export, no deploy-time lock.
    const table = dynamodb.Table.fromTableName(this, 'Users', `myapp-${props.stage}-users`);
    const bus = events.EventBus.fromEventBusName(this, 'DomainEvents', `myapp-${props.stage}-events`);

    const handler = new NodejsFunction(this, 'ApiHandler', {
      entry: 'src/handlers/api.ts',
      environment: {
        TABLE_NAME: table.tableName,
        BUS_NAME: bus.eventBusName,
      },
    });

    // grant* adds IAM to the consumer's own role, which works on an imported proxy.
    table.grantReadWriteData(handler);
    bus.grantPutEventsTo(handler);
  }
}

Which rung to pick follows the deploy cadences of the two stacks:

ContractCouplingReach for it when
CloudFormation exportDeploy-time lock; consumer pins the producerThe stacks always deploy together and you want CloudFormation to guarantee the value stays
SSM parameter pathResolved at deploy, no lockProducer and consumer deploy on independent cadences
Physical nameNever resolved, rename bannedYou want zero deploy-time ordering and accept an immutable name (the stateful default)

A note on physical names

AWS’s guidance runs the other way: “Use generated resource names, not physical names,” because “Names are a precious resource. Each name can only be used once.” That rule is correct for the resources it was written for. A stateless stack you replace freely should not pin names, or you could not stand up a second copy in the same account. But the stateful case inverts the premise. You do not replace a database on a whim. The property “this resource cannot be renamed without a replacement” is exactly the guarantee you want on production data. A stage-scoped physical name (myapp-${stage}-users) is legitimate here precisely because the resource is one you intend never to replace.

The rule still has teeth, and they are worth stating plainly. Because a RETAIN-orphaned resource keeps its physical name after its stack is gone, re-creating that stack fails with a name conflict until you delete, rename, or re-adopt the orphan. If you would rather not argue with AWS guidance at all, the softer reading costs nothing: a well-known name can be an SSM path instead of a physical name. The SSM path and the physical name are the same idea at two different points on the coupling spectrum.

Ephemeral environments

Per-PR environments are where the lifecycle test stops being theory. A preview environment instantiates only the stateless stack and imports the stateful layer by name. What “the stateful layer” means splits on the second question in the decision tree: is the resource cheap and fast to create?

Cheap and fast, such as DynamoDB on-demand, is created in seconds and costs almost nothing when idle, so each environment can have its own table, named myapp-${env}-users. This is the one case where the resource genuinely does not outlive its deployer, so co-locating it with the stateless layer per environment is legitimate. It is also the narrow case where the “keep them together” camp wins outright, discussed in the next section.

Expensive and slow, such as RDS, Aurora, OpenSearch, or MSK, is different. These “take longer to spin up, which also doesn’t play well with using ephemeral environments,” in Yan Cui’s words, and a cluster per preview environment multiplies idle cost. Therefore share one instance by name across every ephemeral environment and push isolation down to the data level: per-environment table names, schemas, database names, or key prefixes inside the one cluster. This is Yan Cui’s pattern directly: “one RDS cluster in the dev account” that every ephemeral environment uses but where each has “their own tables/databases.”

The wiring uses the SSM rung with one indirection. The shared-cluster stack publishes its ARN to a fixed path; the app stack reads that path through a stage that falls back to its own when no shared stage is passed:

import * as ssm from 'aws-cdk-lib/aws-ssm';

// Producer: inside the shared-cluster stack, publish the ARN to a fixed, well-known path.
new ssm.StringParameter(this, 'DbArnParam', {
  parameterName: `/myapp/${props.stage}/db-arn`,
  stringValue: cluster.clusterArn,
});

// Consumer: inside each per-environment app stack, resolve the shared cluster.
// ssmStage falls back to this environment's own stage when no shared stage is passed in.
const ssmStage = this.node.tryGetContext('ssmStage') ?? props.stage;
const dbArn = ssm.StringParameter.valueForStringParameter(
  this,
  `/myapp/${ssmStage}/db-arn`,
);
// valueForStringParameter returns a deploy-time token: no export lock,
// and the value is not baked into the synthesized template.

The ephemeral preview environments post covers the operational mechanics. Here the point is the seam: import by a well-known name, and the stateless layer no longer cares whether the database is per-environment or shared.

The case for keeping them together

The strongest counter-argument comes from Yan Cui, who is “very much in the monolith stack camp” and prefers to keep stateful and stateless resources together for cohesion. Three of his points survive scrutiny and should shape when you do not split.

First, separation is not protection. Moving a database into its own stack “doesn’t eliminate the risk of accidental deletion. It just moves the target.” The protection comes from terminationProtection on the stack and a retain policy (DeletionPolicy plus UpdateReplacePolicy) on the resource, none of which require a separate stack. Split without setting those, and you have added a stack while protecting nothing.

Second, the deploy-time cost of co-locating unchanged stateful resources is close to zero. CloudFormation skips resources that have not changed, so deploy time tracks the stateless count. Yan Cui measured three stacks: 5 Lambda functions took 46.4 seconds; the same 5 Lambdas plus 5 DynamoDB tables also took 46.4 seconds; 20 Lambdas took 55 seconds. The tables were free; the extra ten seconds came from more functions. (These are 2023 figures shown for direction, not a benchmark; the absolute numbers drift with the service, the shape does not.)

Third, in CDK the unit of deployment is the app, not the stack. As Yan Cui’s later note acknowledges, “the unit of deployment of CDK is the CDK app,” so splitting a monolith into two stacks in the same app does not by itself buy independent deploy cadence. They still deploy together. If you need truly independent deploys, that is a separate app or pipeline decision, which the CDK code-organization post covers.

So where does that leave the one case both camps actually disagree on, an ordinary per-service DynamoDB table? The lifecycle test answers it. If the table outlives its deployers, holding a real environment’s data, give it the long-lived stack and the named contract. If it is a per-PR table that dies with its environment, keep it with the stateless layer. The disagreement dissolves once you ask about lifecycle instead of state.

Crossing account boundaries

Everything so far assumes one account. Crossing into another account is a different rung, and it removes options rather than adding them. CloudFormation exports do not cross accounts at all; they are scoped to one account and region. The name-import helpers (fromTableName, fromBucketName, fromEventBusName) are same-account only too. That leaves exactly one mechanism: import by ARN (fromTableArn, fromEventBusArn) and attach a resource policy on the producer so the other account is allowed in.

For a rendezvous-point bus, that policy is the contract:

import * as events from 'aws-cdk-lib/aws-events';
import * as iam from 'aws-cdk-lib/aws-iam';

const bus = new events.EventBus(this, 'DomainEvents', {
  eventBusName: `myapp-${props.stage}-events`,
});

// Producer account: let a specific consumer account put events on the bus.
bus.addToResourcePolicy(new iam.PolicyStatement({
  sid: 'AllowConsumerAccountPutEvents',
  effect: iam.Effect.ALLOW,
  principals: [new iam.AccountPrincipal('222222222222')],
  actions: ['events:PutEvents'],
  resources: [bus.eventBusArn],
}));

Consumers in the other account then import with EventBus.fromEventBusArn(...) and put events through that policy. The cross-account fan-out post and the isolated consumer accounts post cover the eventing side. Putting a resource in its own account is the far end of this spectrum, justified by compliance or a hard blast-radius limit rather than by lifecycle. When you go there, an AWS Control Tower landing zone and backup vaults are the tools, not cross-stack references.

Common pitfalls

  • Changing the logical ID of a stateful resource. The logical ID derives from the construct id and its position in the tree. Renaming or moving a construct changes it, so CloudFormation replaces the resource, which for a table is data loss. AWS is blunt: “Don’t change the logical ID of stateful resources,” and notes that “Stateful resources are more sensitive to construct renaming.” Freeze construct IDs on stateful resources, and if a rename is truly unavoidable, pin the old ID with overrideLogicalId.
  • The RETAIN orphan trap. A RETAIN-orphaned resource keeps its physical name after its stack is deleted, so redeploying that stack fails with a name conflict. Do not destroy stateful stacks casually; if you must, delete or rename the orphan first, or adopt it back with cdk import.
  • Reaching across accounts with exports. Exports are account and region scoped, and fromXxxName is same-account only. Use the ARN import plus a resource policy instead.
  • Separation without a retain policy. A separate stack with no removal policy still deletes its data on stack delete. The stack boundary is not the protection; the policy is.
  • A cluster per preview environment. RDS and OpenSearch bill by the hour and provision slowly, so a per-PR instance is expensive and slow. Share one by name and isolate at the data level.
  • Unscoped physical names. A bare users collides across environments. Scope every physical name with the stage: myapp-${stage}-users.
  • Assuming a split buys independent deploys. Stacks in one CDK app deploy together; independent cadence is an app or pipeline decision, not a stack split.

Key takeaways

The default holds for most serverless-leaning CDK apps: put a resource in its own long-lived stack when its lifecycle outlives any single deployer, and reach it by a well-known name. A resource earns that stack whether it loses data or is a rendezvous point every domain must reach. For cheap, fast resources that name is a stage-scoped physical name; for expensive shared ones it is an SSM parameter path pointing at a single instance. Both are the same idea at different points on the coupling spectrum. Keep resources together when the lifecycle test says No, when protection already comes from termination protection and retain policies, or when a per-PR resource dies with its environment. Escalate to a separate account only for compliance or a hard blast-radius limit, not as a default. The one action that pays off immediately: set terminationProtection and an explicit RemovalPolicy on every stateful resource today, before deciding anything about stack boundaries. The boundary is a design choice; the retain policy is the safety net.

References

Related posts