Data Silos: Understanding, Addressing, Integrating

Exploring the roots and pathways to overcoming data fragmentation

Chunting Wu
5 min readMay 13, 2024
Photo by Joshua Michaels on Unsplash

Nowadays, data engineering is facing many challenges, and one of the toughest issues is the complex technical stack and the emerging problem of data silos.

The term “data silos” refers to data that is scattered in several places, which are difficult to connect and integrate with each other, as if they are isolated (and they are indeed physically isolated).

There are many reasons for data silos, for example:

  • Separate data pipelines between different departments
  • Data centers in different regions for compliance purposes.
  • For cost reasons, technical stacks are built on different vendors.
  • And so on.

The most complicated scenario in these cases would be data silos across several public clouds.

For instance, the original technical stack was built on GCP, so cloud services such as GCS and BigQuery were widely used. But then, for various reasons, we started to build AWS data stacks, so we need to use S3 and Redshift, etc. In fact, all three public clouds have similar roles in data engineering.

In fact, in the three major public clouds, there are similar roles in data engineering responsible for the corresponding functions.

If we are looking for relational databases, AWS has RDS, Azure has SQL Database, and Cloud SQL is also available on GCP. If we are looking for object storage services, AWS has the famous S3, Azure also has Blob storage, and GCP also has GCS.

If we are looking for a common data warehouse, AWS has Redshift, Azure can use Synapase to deal with it, and GCP also has the famous BigQuery. Even the specialized catalog service for data engineering, there is Glue on AWS, and Azure and GCP also have the corresponding Data Catalog.

Arguably, all three public clouds are competitive with each other. While some services have common protocols that can be migrated to each other, such as RDS and SQL Database, which can be migrated almost seamlessly based on the same implementations.

However, other services, such as object storage, are not as simple to migrate. As a result, the data services built on the three public clouds have become data silos for each other.

Furthermore, with the rise of AI in recent years, many organizations have started to build their own AI/ML technical stacks. Unlike the technical stacks used in data engineering, AI/ML technical stacks are basically fully independent, such as feature stores and vector databases. These infrastructures also have their own data, but it is difficult to integrate with the existing stacks.

The following is a complete data infra and a complete AI/ML infra.

https://a16z.com/emerging-architectures-for-modern-data-infrastructure/
https://a16z.com/emerging-architectures-for-modern-data-infrastructure/

From the above two diagrams, we can figure out that these two groups of technical stacks are independent of each other, i.e. two silos.

Data Lakehouse

After understanding the root cause of data silos, the next question is how to solve it.

The most straightforward idea is to centralize the management of all data, which is a typical data lakehouse, and Databricks was the first to propose this idea.

In analytics scenarios, structured data is often used, while AI/ML scenarios rely heavily on unstructured data. For data lakehouses, both structured and unstructured data can be centrally managed.

In addition, data lakehouse provides a common interface for various processing engines, such as Spark and Flink for stream processing and Trino for big data querying. Thanks to the multi-engine support, data lakehouse can be applied to almost every data scenario.

https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

Moreover, data lakehouses provide an cost and performance balance. It can support a variety of usage scenarios while at the same time saving costs. Of course, this is an advantage traded for performance.

Nevertheless, data lakehouses still don’t totally solve everything, such as the following problems.

  • At the sacrifice of performance, user-facing feature still requires dedicated database support.
  • Unified data storage is a challenge for privacy and compliance.
  • The data is all put together and still can’t avoid the dilemma of data swamp which is the result of the data lake in the past. Although there is a catalog in the data lakehouse, it is still not transparent enough.

Most importantly, although the data lakehouse effectively mitigate the gap between data engineering and AI/ML, it is still helpless in cross public cloud scenarios.

Metadata Lake

The problem we face now is the storage everywhere, and each type of storage has its own catalog, and each catalog is heterogeneous.

What if we treat the catalog as we treat the data? In other words, just as we gather all the data in one place, we gather all the catalogs in one place, so can we achieve complete governance? When we can completely govern all the data with metadata, won’t we break the data silo?

Yes, exactly. This is what the concept of metadata lake does.

From the above diagram, we can see modern data processing engines have corresponding catalogs, and Metadata Lake is designed to centralize the governance of these catalogs. In this way, there is a unified place to view all data sources and understand the internal data structure.

This concept is not new, in fact, all public clouds have launched corresponding products, like Microsoft’s OneLake and GCP’s BigLake, which are based on this concept.

However, these public cloud services can only solve the data silos within the public cloud, and there is still no solution for cross public cloud.

Fortunately, more and more enterprises have noticed this business opportunity and started to make their own products, e.g., Datastrato’s Gravitino is focused on solving the data silos across clouds.

Conclusion

We are living in a chaotic and complex era.

The amount of data is exploding, and the AI boom is exacerbating the process. In the past, we have tried to store data, clean it, transform it, and then use it. With the data we have, we’re already swamped. And with the increasing data volume, it’s only getting busier.

However, nowadays, this is not enough. We want to know what is stored and we want to be able to integrate and interact with it easily. Therefore, data governance has become an obvious subject.

That’s also why there’s an increasing number of tools geared towards metadata.

I’m also concerned about this topic because our organization recently made the transition from GCP to AWS, and governing the warehouse schema for both clouds is becoming an issue. I will leave this story for later, so let’s call it a day.

--

--

Chunting Wu

Architect at SHOPLINE. Experienced in system design, backend development, and embedded systems. Sponsor me if you like: https://www.buymeacoffee.com/MfGjSk6