Looking at both sides of the data lake argument


For many decades, the data warehouse was the go-to technology for storing large amounts of data for querying and data mining. This should not be confused with the venerable database, which has a different mode of operation and use.

The data lake arrived at the same time as the advent of Big Data. The concept was coined in 2010 by James Dixon, founder of Pentaho (now a part of Hitachi Vantara), in a blog post announcing his company’s first Hadoop-based release. He argued that data marts, aka data warehouses, had several problems, such as size restrictions to narrow research parameters.

“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples,” he wrote.

Data lakes are often compared to data warehouses but the two are nothing like except for one common element: both are for storing and later analyzing massive amounts of data, and that is all they have in common.

“Is the data lake the new data warehouse? Yes and no,” says Steve Tcherchian, CISO for XYPRO Technology, a cyber security vendor for mission critical apps. “They can be used as data warehouse but if they are not used correctly they are data graveyards.”

The differences

Here is the fundamental difference between a data lake and a data warehouse: the data lake uses schema on read, while the data warehouse uses schema on write. In schema on write, the first step to creating the data warehouse is to create data tables and configure the schemes to format the data for those tables. So the data is heavily processed before it is stored.

Only after you have created the tables and configured the schema can you begin to input the data. This means a lot of data prep work in advance but once it’s all in, the data can be quickly read, processed, and analyzed.

In the schema on read of the data lake, the order is flipped. You dump all your data into the data lake and then create tables and schemas to read in the data for analysis. This means slower reads because the data is being processed as it is read. On the other hand, this also provides flexibility and scalability because you can change the schema on the fly, whereas with the data warehouse the schema is set and has been applied to the data.

Because of this, a data lake is considered a vast pool of raw data where the purpose of the data is not defined. A data warehouse is a repository for structured and defined data that has already been processed for a particular purpose, usually some kind of business intelligence.

That is the upside to the data lake: flexibility. “Large enterprises tend to run into a problem with a data warehouse, where they’re unable to integrate lots of their datasets into the data warehouse, and they’re slowed down by those technical limitations of needing to apply the schema on write. If a company is frozen in that, that problem, then a data lake could be the solution for them,” said Michael Knopf, senior software engineer at TruSTAR, a cybersecurity platform vendor.

The downside to data lakes is not one of technology. One thing all we spoke to can agree on is that data lakes are neither necessarily good nor bad, it’s all in how you use it, and many enterprises don’t use them properly. And a big factor in that is using the data lake as a dump for everything and examining it later.

“Organizations tend to use them to dump all their data into data lake with the understanding or assumption they will use that data at a later time. The challenge becomes without proper inventory of that data it will become more and more difficult as you add data to use as time goes on,” said Tcherchian.

“They fall short primarily because almost all too often they are created before their real purpose is understood,” said Joshua Greenbaum, principal analyst with Enterprise Applications Consulting. “It’s putting the cart before the horse and saying, ‘let’s just assemble all this data in one spot and figure out what to do with it later.’ That’s out of proportion to the potential value in most cases.”

“If you don’t put a context around the information, the data lake then it is like dumping in garbage,” said Satish Abburi, founder and CTO at Elysium Analytics. “”What we do is as the data gets into the lake, we will enrich the information, provide the data with the context. So it’s easy to extract the insights with the analytics or with the queries.”

Companies need to focus much more on quality, argues Greenbaum. “The assembly of petabytes of data and no attention to the quality of that data is a waste. I’ve been in situations with clients where they find the data they were hoping would be the foundation of an AI or machine learning project was poor quality and they effectively have to start over,” he said.

“I think it’s definitely a more flexible way to do analytics, but what you’re doing is basically relaxing the rules for getting more getting new data in that might have a different schema,” said Tomer Shiran, co-founder and CPO of Dremio. “It’s more of a human problem than a technical problem when you relax the rules so much, people are going to use that new flexibility. People don’t have to think as carefully about putting new data in there. It can turn into a mess where it’s basically just disorganized.”

Knopf notes that the schema on write of a data warehouse means that you have to make this decision up front of what your schemas are all going to be and if you try to add more to that, it can get very messy.

“When you start off with these sorts of data warehousing projects, you might have a vision of how things are going to be. And it doesn’t take too long before it turns out that that vision wasn’t exactly on point. But you’ve already locked yourself into certain schemas and pat assumptions that make it difficult to evolve to meet the evolving business needs,” he said.

That’s a caveat to data warehousing and a plus the data lakes. Also, data lakes let you draw from many sources; different SQL and non- or NoSQL databases, CSV files, JSON, XML, and whatnot. The classical way to do this in a data warehouse is to “build a bunch of glue code,” as Knopf put it, to try to take those different formats and make them conform to this one data warehouse schema that you’ve made. With the data lake, you read all of the data sources from each source and leave them there, rather than try to stitch disparate sources together into one repository.

Things have changed

The data lake concept has been around for a decade as have data lakes, and if you were one of the early adopters you likely got burned. Those we spoke to said that the tools have advanced considerably, and the move from on-premises to the cloud has also aided in the advancement of data lakes.

Knopf said those who had bad early experiences should give data lakes another try. “I know, at my previous company, if you said the word Spark or Hadoop in the CEOs presence, he would almost want to kick you out of the meeting, because they had a bad taste in their mouth. And I think the reason was at the time, the technology wasn’t mature enough yet and they didn’t have the expertise at the company at that time to really make it successful. But I think the fact is that by now, if that company were to try it, they wouldn’t run into those same problems. It works a little better out of the box,” he said.

“The analytical tools for data extraction are hugely superior to what we had in the on-prem world,” said Greenbaum. “There are a hundred reasons why once you got a good data set you are able to use it than you were on prem.”

“More than 75% of our customers are running on Amazon or Azure,” said Shiran. “So it’s largely a cloud world these days. With these cloud data lakes you can actually do inserts, updates, deletes and transactions directly on the data lake in these open formats like a database user. So you have these new technologies now that are just enabling you to do things that in the past, you couldn’t do with data lake in the past.”

Early data lakes were little more than a flat file system stored on a Hadoop HDFS file system and thus, clunky and tough to use. Over time, cloud service providers like AWS, Google, and Microsoft have built out more comprehensive solutions that make working with a data lake easier.

With AWS, it starts with CloudFormation to configure the core AWS services, such as AWS Lambda, Amazon Elasticsearch, Amazon Cognito, AWS Glue, and Amazon Athena. It utilizes the features of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata.

On Azure Data Lake, services include HDInsight, a cloud version of Apache Spark and Hadoop service for the enterprise with a variety of Apache tools like Hive, Map Reduce, HBase, Storm, Kafka, and R-Server, Data Lake Store for massive data storage, integration with Visual Studio, Eclipse, and IntelliJ developer tools, and integration with Microsoft services.

Google in particular has emphasized on-premises migration to its Cloud services, offering hosting of Hadoop and Spark data lakes and tools to build, train, and deploy analytics faster on a Google data lake with Spark, BigQuery, AI Platform Notebooks, GPUs, and other analytics accelerators.

“Over time, the cloud providers came in and made this more accessible to the general community,” said Knopf. “They also did it in a way that it was more is probably more usable for an actual business. So it made it less esoteric. For instance, HDFS isn’t something that often gets used anymore and now is being replaced by something like Amazon S3.”

 

Data lakes done right

Shiran believes we’re starting to see that people are thinking about data in a more sophisticated way than what they ever have, similar to how developers think about code these days. “You have CI, CD, and you have version control.” He went on to say data lakes will have the full capabilities to do everything that a data warehouse does: “So you’re no longer going need a data warehouse. There is not going to be any justification for spending all that money and getting locked into a data warehouse with all this open source and startup innovation.”

Abburi also feels the data lake works best when it is at the center of the enterprise. “[The enterprise] shouldn’t have any more data duplicate silos in the organization. When it come to the analytics, and insights, everybody should be pointing to the centralized data lake. If organizations are able to achieve that one, that will be the first success,” he said.



Source link