The rise of the data lakehouse: A new era of data value
With 65 million vaccine doses to administer at the height of the COVID-19 pandemic, Luigi Guadagno, vice president of pharmacy renewal and healthcare platform technology at Walgreens, needed to know where to send them. To find out, he queried Walgreens’ data lakehouse, implemented with Databricks technology on Microsoft Azure.
“We leveraged the lakehouse to understand the moment,” he says. For Guadagno, the need to match vaccine availability with patient demand came at the right moment, technologically speaking. The giant pharmaceutical chain had put its lakehouse in place to address just such challenges in its quest, to, as Guadagno puts it, “To get the right product in the right place for the right patient.”
Previously, Walgreens was attempting to perform that task with its data lake but faced two significant obstacles: cost and time. Those challenges are well-known to many organizations as they have sought to obtain analytical knowledge from their vast amounts of data. The result is an emerging paradigm shift in how enterprises surface insights, one that sees them leaning on a new category of technology architected to help organizations maximize the value of their data.
Enter the data lakehouse
Traditionally, organizations have maintained two systems as part of their data strategies: a system of record on which to run their business and a system of insight such as a data warehouse from which to gather business intelligence (BI). With the advent of big data, a second system of insight, the data lake, appeared to serve up artificial intelligence and machine learning (AI/ML) insights. Many organizations, however, are finding this paradigm of relying on two separate systems of insight untenable.
The data warehouse requires a time-consuming extract, transform, and load (ETL) process to move data from the system of record to the data warehouse, whereupon the data would be normalized, queried, and answers obtained. Meanwhile, unstructured data would be dumped into a data lake where it would be subjected to analysis by skilled data scientists using tools such as Python, Apache Spark, and TensorFlow.
Under Guadagno, the Deerfield, Ill.-based Walgreens consolidated its systems of insight into a single data lakehouse. And he’s not alone. An increasing number of companies are finding that lakehouses — which fall into a product category generally known as query accelerators — are meeting a critical need.
“Lakehouses redeem the failures of some data lakes. That’s how we got here. People couldn’t get value from the lake,” says Adam Ronthal, vice president and analyst at Gartner. In the case of the Databricks Delta Lake lakehouse, structured data from a data warehouse is typically added to a data lake. To that, the lakehouse adds layers of optimization to make the data more broadly consumable for gathering insights.
The Databricks Delta Lake lakehouse is but one entry in an increasingly crowded marketplace, that includes such vendors as Snowflake, Starburst, Dremio, GridGain, DataRobot, and perhaps a dozen others, according to Gartner’s Market Guide for Analytics Query Accelerators.
Moonfare, a private equity firm, is transitioning from a PostgreSQL-based data warehouse on AWS to a Dremio data lakehouse on AWS for business intelligence and predictive analytics. When the implementation goes live in the fall of 2022, business users will be able to perform self-service analytics on top of data in AWS S3. Queries will include which marketing campaigns are working best with which customers and which fund managers are performing best. The lakehouse will also help with fraud prevention.
“You can intuitively query the data from the data lake. Users coming from a data warehouse environment shouldn’t care where the data resides,” says Angelo Slawik, data engineer at Moonfare. “What’s super important is that it takes away ETL jobs,” he says, adding, “With Dremio, if the data is in S3, you can query what you want.”
Moonfare selected Dremio in a proof-of-concept runoff with AWS Athena, an interactive query service that enables SQL queries on S3 data. According to Slawik, Dremio proved more capable thanks to very fast performance and a highly functional user interface that allows users to track data lineage visually. Also important was Dremio’s role-based views and access control for security and governance, which help the Berlin, Germany-based company comply with GDPR regulations.
At Paris-based BNP Paribas, scattered data silos were being used for BI by different teams at the giant bank. Emmanuel Wiesenfeld, an independent contractor, re-architected the silos to create a centralized system so business users such as traders could run their own analytics queries across “a single source of truth.”
“Trading teams wanted to collaborate, but data was scattered. Tools for analyzing the data also were scattered, making them costly and difficult to maintain,” says Wiesenfeld. “We wanted to centralize data from lots of data sources to enable real-time situational awareness. Now users can write their own scripts and run them over the data,” he explains.
Using Apache Ignite technology from GridGain, Wiesenfeld created an in-memory computing architecture. Key to the new approach is moving from ETL to ELT, where transformation is carried out while performing computations in order to streamline the entire process, according to Wiesenfeld, who says the result was to reduce latency from hours to seconds. Wiesenfeld has since launched a startup called Kawa to bring similar solutions to other customers, particularly hedge funds.
Starburst takes a mesh approach, leveraging open-source Trino technology in Starburst Enterprise to improve access to distributed data. Rather than moving data into a central warehouse, the mesh enables access while allowing data to stay where it is. Sophia Genetics is using Starburst Enterprise in its cloud-based bioinformatics SaaS analytics platform. One reason: Keeping sensitive healthcare data within specific countries is important for regulatory reasons. “Due to compliance constraints, we simply can not deploy any system that accesses all data from one central point,” said Alexander Seeholzer, director of data services at Switzerland-based Sophia Genetics in a Starburst case study.
The new query acceleration platforms aren’t standing still. Databricks and Snowflake have introduced data clouds and data lakehouses with features designed for the needs of companies in specific industries such as retail and healthcare. These moves echo the introduction of industry-specific clouds by hyperscalers Microsoft Azure, Google Cloud Platform, and Amazon Web Services.
The lakehouse as best practice
Gartner’s Ronthal sees the evolution of the data lake to the data lakehouse as an inexorable trend. “We are moving in the direction where the data lakehouse becomes a best practice, but everyone is moving at a different speed,” Ronthal says. “In most cases, the lake was not capable of delivering production needs.”
Despite the eagerness of data lakehouse vendors to subsume the data warehouse into their offerings, Gartner predicts the warehouse will endure. “Analytics query accelerators are unlikely to replace the data warehouse, but they can make the data lake significantly more valuable by enabling performance that meets requirements for both business and technical staff,” concludes its report on the query accelerator market.
Noel Yuhanna, vice president and principal analyst at Forrester Research, disagrees, asserting the lakehouse will indeed take the place of separate warehouses and lakes.
“We do see the future of warehouses and lakes coming into a lakehouse, where one system is good enough,” Yuhanna says. For organizations with distributed warehouses and lakes, the mesh architecture such as that of Starburst will fill a need, according to Yuhanna, because it enables organizations to implement federated governance across various data locations.
Whatever the approach, Yuhanna says companies are seeking to gain faster time to value from their data. “They don’t want ‘customer 360’ six months from now; they want it next week. We call this ‘fast’ data. As soon as the data is created, you’re running analytics and insights on it,” he says.
From a system of insight to a system of action
For Guadagno, vaccine distribution was a high-profile, lifesaving initiative, but the Walgreens lakehouse does yeoman work in more mundane but essential retail tasks as well, such as sending out prescription reminders and product coupons. These processes combine an understanding of customer behavior with the availability of pharmaceutical and retail inventory. “It can get very sophisticated, with very personalized insights,” he says. “It allows us to become customer-centric.”
To others embarking on a similar journey Guadagno advises, “Put all your data in the lakehouse as fast as possible. Don’t embark on any lengthy data modeling or rationalization. It’s better to think about creating value. Put it all in there and give everybody access through governance and collaboration. Don’t waste money on integration and ETL.”
At Walgreens, the Databricks lakehouse is about more than simply making technology more efficient. It’s key to its overall business strategy. “We’re on a mission to create a very personalized experience. It starts at the point of retail — what you need and when you need it. That’s ultimately what the data is for,” Guadagno says. “There is no more system of record and system of insight. It’s a system of action.”