Flink's upgrade story sounds simple until a real stateful job starts changing underneath you.
Take a savepoint. Deploy the new version. Restore from the savepoint. Keep going.
This works well enough when the job shape is stable. It gets less fun when protobufs change, schemas move, SQL gets rewritten, joins shift, windows change, or connector versions alter the operator graph in ways that are technically explainable and operationally exhausting.
The failure mode that changed my mind was not dramatic at first. Event protobufs changed. A few stateful jobs did not absorb the change gracefully. In some cases, recovering derived state meant backfills that took days.
Three days is a long time to spend being reminded that your deployment strategy is also your recovery strategy.
Savepoints are doing too much
Savepoints are extremely useful. So are checkpoints. I am not arguing against them.
The problem is treating savepoints as the primary upgrade primitive for every production change.
A stateful Flink job is not just code. It is code plus keyed state, timers, windows, offsets, joins, and whatever shape the planner decided your query should have this week. When that shape changes, restoring the new job from the old job's state can become fragile.
The scary part is when you only discover that fragility after stopping the old job.
That turns a deployment into a negotiation with production.
Run the new job next to the old one
The approach I prefer is boring:
- Leave the existing job running.
- Deploy the new version separately.
- Have it consume the same live traffic.
- Write its output somewhere non-authoritative, usually a canary topic.
- Compare outputs until you trust it.
- Move consumers when you are ready.
The old job remains authoritative the entire time.
That is the important property. The candidate job has to prove itself against real traffic before anything depends on it. If it fails, you stop it. No emergency restore. No maintenance window. No explaining why a "safe" deploy created a multi-day backfill.
The canary topic is where the truth shows up
With concurrent jobs, you can compare behavior before promotion:
- counts
- schemas
- null rates
- key coverage
- aggregation outputs
- lateness and watermark behavior
- downstream business metrics
This catches a different class of problem than a unit test or a dry-run restore. It tells you whether the new job behaves like the old one under the mess of production traffic.
It also makes rollback almost boring. Keep consumers pointed at the old topic, fix the candidate, try again.
This is not free
Concurrent jobs cost more. They double-read traffic for a while. They require duplicate sinks or canary topics. You need comparison tooling. You need discipline around consumer cutover.
Still, I will take that complexity over a deployment path where the first real test happens after the old job is down.
Savepoints are excellent recovery tools. They are not always the safest promotion mechanism.
For stateful streaming systems, the safest deployment is often the one where production never has to find out you were deploying.