You Don't Need a Maintenance Window to Upgrade Flink

Flink's upgrade story sounds simple until a real stateful job starts changing underneath you.

Take a savepoint, deploy the new version, restore from the savepoint and keep going.

This works well enough when the job shape is stable. It gets less fun when protos change, schemas move, SQL gets rewritten, joins shift, windows change, or connector versions alter the operator graph in ways that are technically explainable and operationally exhausting.

The failure mode that changed my mind was not dramatic at first. Event protos changed. A few stateful jobs did not absorb the change gracefully. Some derived state took days to backfill.

Three days is a long time to spend being reminded that your deployment strategy is also your recovery strategy...

Savepoints are doing too much

Savepoints are useful (so are checkpoints); I'm not arguing against them.

The problem is treating savepoints as the primary upgrade primitive for every production change.

A stateful Flink job is not just code. It is code plus keyed state, timers, windows, offsets, joins, and whatever shape the planner decided your query should have. When that shape changes, restoring the new job from the old job's state becomes fragile.

The scary part is when you only discover that fragility after stopping the old job.

That turns a deployment into a negotiation with production.

Run the new job next to the old one

The approach I prefer is boring:

Leave the existing job running.
Deploy the new version separately.
Have it consume the same live traffic.
Write its output somewhere non-authoritative, usually a canary topic.
Compare outputs until you trust it.
Move consumers when you are ready.

The old job remains authoritative the entire time.

The candidate job has to prove itself against real traffic before anything depends on it. If it fails, you stop it. No emergency restore. No maintenance window. No explaining why a "safe" deploy created a multi-day backfill.

The canary topic is where the truth shows up

With concurrent jobs, you can compare behavior before promotion:

counts
schemas
null rates
key coverage
aggregation outputs
lateness and watermark behavior
downstream business metrics

This catches a different class of problem than a unit test or a dry-run restore. It tells you whether the new job behaves like the old one under the mess of production traffic.

Rollback becomes less dramatic: keep consumers pointed at the old topic, fix the candidate, try again.

This is not free

Concurrent jobs cost more. They double-read traffic for a while. They require duplicate sinks or canary topics. You need comparison tooling. You need discipline around consumer cutover.

Still, I will take that complexity over a deployment path where the first real test happens after the old job is down.

Savepoints are excellent recovery tools, but as anyone that's worked with Flink knows, they are not always the safest promotion mechanism.

For stateful streaming systems, the safest deployment is the one where production users never find out you were deploying.