Data analytics in the cloud: understand the hidden costs
Luke Roquet recently spoke to a customer who recounted the shock of getting a $700,000 bill for a single data science workload running in the cloud. When Roquet, who is senior vice president of product marketing at Cloudera, related the story to another customer, he learned that that company had received a $400,000 tab for a similar job just the week before.
Such stories should belie the common myth that cloud computing is always about saving money. In fact, “most executives I’ve talked to say that moving an equivalent workload from on-premises to the cloud results in about a 30% cost increase,” said Roquet.
This doesn’t mean the cloud is a poor option for data analytics projects. In many scenarios, the scalability and variety of tooling options make the cloud an ideal target environment. But the choice of where to locate data-related workloads should take multiple factors into account, of which only one is cost.
Data analytics workloads can be especially unpredictable because of the large data volumes involved and the extensive time required to train machine learning (ML) models. These models often “have unique characteristics that can cause their costs to explode,” Roquet said.
What’s more, local applications often need to be refactored or rebuilt for a specific cloud platform, said David Dichmann, senior director of product management at Cloudera. “There’s no guarantee that the workload is going to be improved and you can end up being locked into one cloud or another,” he said.
Cloud march is on
That doesn’t seem to be slowing the ongoing cloudward migration of workloads. Foundry’s 2022 Data & Analytics study found that 62% of IT leaders expect the share of analytics workloads they run in the cloud to increase.
Although cloud platforms offer many advantages, cost- and performance-sensitive workloads “are often better run on-prem,” Roquet said.
Choosing the right environment is about achieving balance. The cloud excels for applications that are ephemeral, need to be shared with others, or use cloud-native constructs like software containers and infrastructure-as-code, he said. Conversely, applications that are performance- or latency-sensitive are more appropriate for local infrastructure where data can be co-located, and long processing times don’t incur additional costs.
The goal should be to optimize workloads to interact with each other regardless of location and to move as needed between local and cloud environments.
The case for portability
Dichmann said three core components are needed to achieve this interoperability and portability:
- Use common data formats, ideally conforming to open standards like Apache Iceberg on Parquet files, for example. This makes the data easily accessible by several technologies for a number of business uses
- Ensure data services are portable. This way when business applications are developed in one environment, they can be re-deployed in another without rewrite
- Employ a common set of data management, observability, and governance practices
“Once you have one view of all your data and one way to govern and secure it then you can move workloads around without worrying about breaking any governance and security requirements,” he said. “People know where the data is, how to find it, and we’re all assured it will be used correctly per business policy or regulation.”
Portability may be at odds with customers’ desire to deploy best-of-breed cloud services, but Dichmann said “fit-for-purpose” is a better goal than best-of-breed. That means it’s more important to put flexibility ahead of bells and whistles. This gives the organization maximum flexibility for deciding where to deploy workloads.
A healthy ecosystem is also just as important as robust points solutions because a common platform enables customers to take advantage of other services without extensive integration work.
The best option for achieving workload portability is to use an abstraction layer that runs across all major cloud and on-premises platforms. The Cloudera Data Platform, for example, “is a true hybrid solution that provides the same services both in the cloud and on-prem,” Dichmann said. “It uses open standards that give you the ability to have data share a common format everywhere it needs to be, and accessed by a broader ecosystem of data services that makes things even more flexible, more accessible and more portable.”
Visit Cloudera to learn more.