A 5-pillar approach to modern data management
Manish Limaye
Pillar #1: Data platform
The data platform pillar comprises tools, frameworks and processing and hosting technologies that enable an organization to process large volumes of data, both in batch and streaming modes. Organizations must decide on their hosting provider, whether it be an on-prem setup, cloud solutions like AWS, GCP, Azure or specialized data platform providers such as Snowflake and Databricks. They must also select the data processing frameworks — such as Spark, Beam or SQL-based processing — and choose tools for ML.
Based on business needs and the nature of the data, raw vs structured, organizations should determine whether to set up a data warehouse, a Lakehouse or consider a data fabric technology. The choice of vendors should align with the broader cloud or on-premises strategy. For example, if a company has chosen AWS as its preferred cloud provider and is committed to primarily operating within AWS, it makes sense to utilize the AWS data platform. Similarly, there is a case for Snowflake, Cloudera or other platforms, depending on the company’s overarching technology strategy.
However, I am not in favor of assembling numerous tools in pursuit of the elusive “best of breed” dream, as integrating these tools is excessively time-consuming, and technology evolves too rapidly for DIY integration to keep up. Furthermore, generally speaking, data should not be split across multiple databases on different cloud providers to achieve cloud neutrality. Not my original quote, but a cardinal sin of cloud-native data architecture is copying data from one location to another. That’s free money given to cloud providers and creates significant issues in end-to-end value generation.