Member-only story
Data Silos: Understanding, Addressing, Integrating
Exploring the roots and pathways to overcoming data fragmentation
Nowadays, data engineering is facing many challenges, and one of the toughest issues is the complex technical stack and the emerging problem of data silos.
The term “data silos” refers to data that is scattered in several places, which are difficult to connect and integrate with each other, as if they are isolated (and they are indeed physically isolated).
There are many reasons for data silos, for example:
- Separate data pipelines between different departments
- Data centers in different regions for compliance purposes.
- For cost reasons, technical stacks are built on different vendors.
- And so on.
The most complicated scenario in these cases would be data silos across several public clouds.
For instance, the original technical stack was built on GCP, so cloud services such as GCS and BigQuery were widely used. But then, for various reasons, we started to build AWS data stacks, so we need to use S3 and Redshift, etc. In fact, all three public clouds have similar roles in data engineering.