Too often when organizations design data pipelines, they find themselves obsessed with creating the perfect, all-encompassing solution. Common wisdom has it that every data point in an enterprise should flow through a single, meticulously-crafted pipeline.
As with much common wisdom, this isn't necessarily true. It's a good idea to have a robust data infrastructure, sure. But bending over backward to force everything through a single pipeline is often the wrong approach. At best, it's often simply not necessary; at worst, it can spawn even more problems than it purports to solve. The very notion runs counter to the distributed, event-driven systems that power many data-driven enterprises today, and a perfect, one-size-fits-all pipeline is, by and large, mythical.
Perfect Pipelines
Before we get into the problems of monolithic data pipelines, let's discuss why we build data infrastructure in the first place. We'll start with one of the most fundamental goals: make data accessible and actionable across the organization.
Different teams have different data needs. Marketing needs real-time access to customer interaction data, while finance requires batch processing of transaction data. Forcing both through the same pipeline is like trying to fit a square peg in a round hole.
And this leads us to perhaps the most fundamental benefits of modern data engineering. While we can rattle off the technical benefits of distributed systems, the true benefits are organizational. Teams have the flexibility to process and analyze data in ways that suit their specific needs. They can make changes to their data flows without breaking other teams' pipelines. They can implement new data sources or sinks as needed, without requiring coordination across the entire organization.
Enforcing a single, perfect data pipeline sabotages those benefits.
Unwanted Inflexibility
Monolithic pipelines re-establish rigidity across teams and across data domains. They force teams to conform to a one-size-fits-all approach to data processing. If a team requires a change to the pipeline, it is now at the mercy of some central data engineering team's schedule to get the changes made. And consider production issues. Teams will start finding themselves awoken at night, paged because some other teams' data flow is clogging up the entire pipeline.
Single pipelines also equate to single points of failure. Requiring all data to flow through a single pipeline means that our entire data infrastructure may grind to a halt if there's an issue in any part of the pipeline. They also introduce performance and scalability bottlenecks. As the volume and variety of data grow, the load on the central pipeline will effectively grow exponentially.
Perhaps most importantly, adherence to a single pipeline severely hampers our ability to innovate and experiment with data. This becomes apparent as teams wrap themselves further and further around the axle, arguing about how to squeeze their unique data needs into the existing pipeline. Have you ever found yourself in endless meetings debating the design of your data models? Trying to meet every team's requirements? Negotiating compromises when those requirements contradict each other? That's a clear smell that something is wrong.
Domains and Streams
Instead of a single pipeline for all our data, think in terms of domains and streams. Within any system that manages data, there will be specific flows and processes that make sense for that domain. That is the domain's canonical data flow. There may be additional processes that transform or enhance that data. But those additional processes are always designed to serve the specific needs of that domain.
As an analogy, think of microservices in software architecture. Each service handles a specific domain, with its own data store and processing logic. The same principle can apply to our data engineering.
Events
If you're not familiar with event-driven architectures, you might be wondering how different teams are supposed to access data if it's not all flowing through a central pipeline. Wouldn't they still need to tap into some central data lake or warehouse?
As it turns out, they don't. Instead, each data domain publishes its processed data as events. Generally, we use an event streaming platform like Kafka for this purpose, but the details don't matter here. What does matter is that different teams are able to subscribe to those events, ingest them, and process them in ways that make sense for their specific needs.
This approach allows for a more flexible, scalable, and resilient data infrastructure. Each team can process data at their own pace, in their own way, without affecting others. It naturally leads to a more decentralized, distributed system that can handle the complexity and volume of modern data.
Data Quality and Governance
Often when looking for a perfect pipeline, what we're really seeking is a way to ensure data quality and governance. We want to make sure that our data is accurate, consistent, and used appropriately.
Forcing all data through a single pipeline isn't the answer. Instead, implementing strong data quality checks at the source, clear data ownership and stewardship policies, and robust metadata management.
These practices can be implemented across multiple pipelines and data flows. They ensure that no matter how data is processed or where it flows, it maintains its integrity and usefulness.
FAQs
What about data consistency?
It's a valid concern, but it's based on a misunderstanding of what data consistency really means in a distributed system.
In a well-designed, event-driven data architecture, the aim is on eventual consistency. Given enough time, all parts of the system will reflect the same state of data. It allows for temporary inconsistencies (often imperceptible in practice) in exchange for greater scalability and resilience.
What about compliance and auditing?
Compliance and auditing are indeed critical, especially in regulated industries; however, distributed data architecture doesn't preclude effective compliance and auditing practices. In fact, it can enhance them.
Clearly defining data domains and ownership enavkes more granular and effective auditing. Event-driven architectures naturally create audit trails as data flows through the system, and with proper metadata management, data lineage can be tracked across multiple pipelines and processes.
Is centralized data management always a bad idea?
We've seen that the enforcement of a single, perfect data pipeline is often unnecessary, detrimental, and in many cases, a fallacy that we're trying to make a reality. Still, the core concept here is that prescribing one-size-fits-all solutions often cause more problems than they solve. The same applies to data infrastructure design, and some scenarios do, in fact, call for centralized management. As discussed above, compliance or business requirements might necessitate a more centralized approach for specific data domains.
But this is not a universal, golden rule to which we must adhere. Despite common thinking, a single, all-encompassing data pipeline is not an automatic requirement. Moreover, it is often a hindrance in terms of flexibility, scalability, and innovation.
Conculsion
Designing data infrastructure instead with domains and event streams in mind helps avoid bottlenecks, allows more flexible, scalable event-driven systems, and shifts the focus to extracting value from data rather than forcing it through artificial constraints.