Domain-Driven Pipelines: The Case for Decomping Data Infrastructure

April 8, 2024

Every data org eventually has the same conversation: "We need one pipeline to rule them all."

It never works. The single, perfect, all-encompassing data pipeline is a myth. Teams burn quarters trying to build one, and the result is always the same — a brittle monolith that bottlenecks every team it touches.

Perfect Pipelines

Why do we build data infrastructure? To make data accessible and actionable across the organization. That's it.

Marketing needs real-time customer interaction data. Finance needs batch-processed transactions. ML needs feature stores with sub-second lookups. These are fundamentally different access patterns. Forcing them through the same pipeline is not engineering — it's wishful thinking.

The real benefit of modern data engineering isn't technical. It's organizational. Teams can process data in ways that fit their needs. They can ship changes without filing tickets with a central data team. They can add new sources without a cross-org coordination meeting.

A single pipeline destroys all of that.

Unwanted Inflexibility

Monolithic pipelines kill velocity in three ways:

They create cross-team coupling. Need a schema change? Get in line. The central data team is busy with three other requests. Your sprint is blocked. Meanwhile, you're getting paged at 2 AM because some other team's data flow is choking the shared pipeline.

They're a single point of failure. One bad deployment, one malformed record, one upstream schema change — and every team's data goes dark. The blast radius is the entire organization.

They strangle experimentation. You know this is happening when your team spends more time in meetings debating shared data model design than actually building. When every team's requirements contradict each other, and every change is a negotiation. That's not engineering. That's bureaucracy.

Domains and Streams

The alternative is simple: stop thinking in pipelines. Think in domains and streams.

Every data domain has a canonical flow — the processes that make sense for that domain. Payments data has different shape, velocity, and consumers than clickstream data. Treat them differently.

This is the same insight that drove microservices in application architecture. Each service owns its domain, its data store, its processing logic. Data infrastructure should work the same way.

Events

"But how do other teams get the data they need?"

Each domain publishes its processed data as events — typically through something like Kafka, but the platform doesn't matter. What matters is the contract: a domain owns its data, publishes well-defined events, and consumers subscribe to what they need.

This is the key shift. Teams don't share a pipeline. They share a protocol. Each consumer processes events at their own pace, in their own way, without stepping on anyone else. No coordination. No shared failure modes. No 2 AM pages because marketing's data broke finance's pipeline.

Data Quality and Governance

Here's the real reason people want a single pipeline: they want data quality and governance. That's a legitimate goal. But the approach is wrong.

Data quality doesn't come from routing everything through one place. It comes from ownership. Strong quality checks at the source. Clear data stewardship. Robust metadata management. These practices work better in a distributed model because accountability is clear — if the payments domain's data is bad, the payments team owns it.

FAQs

What about data consistency?

This concern assumes that a single pipeline gives you strong consistency. It doesn't — not at scale. You're already eventually consistent; you just don't know it yet.

In a well-designed event-driven architecture, eventual consistency is explicit. All parts of the system converge on the same state. The temporary inconsistencies are usually imperceptible in practice, and you gain scalability and resilience in return.

What about compliance and auditing?

Distributed architecture actually makes compliance easier, not harder. Clear domain ownership means granular auditing. Event-driven architectures create natural audit trails — every event is a record. With proper metadata management, data lineage tracking across domains is straightforward.

Is centralized data management always a bad idea?

Sometimes, yes. Small orgs with simple data needs don't need the overhead of domain decomposition. Specific compliance requirements might demand centralized control over certain data flows. The point isn't "centralization is always wrong." The point is that it should be a deliberate choice for specific domains, not the default architecture for everything.

Conclusion

Single pipelines don't scale. They bottleneck teams and kill iteration speed. Design your data infrastructure around domains and event streams, and your teams will spend their time extracting value from data instead of fighting over shared infrastructure.