AI/ML for Enterprise: Part 1 (Performance, Scale and Multi-Tasking) – VMware Cloud Community


In a recent post I talked about the exciting new topics that came up at the recent NVIDIA GTC. I talked about NVIDIA and VMware’s partnership, and specifically some bits about the AI Ready Enterprise platform. The partnership has grown rapidly and it’s a great reflection on how quickly machine learning is becoming not just mainstream, but mainstream for Enterprise.

Digital companies like Amazon, Facebook, Netflix etc. have been using Machine Learning for a long time now. As modern digital companies, with large budgets and most importantly, vast amounts of data and compute power, they have been able to adopt ML in a big way, for things like Chat Bots, Facial Recognition, Supply Chain Forecasting, etc.

For Enterprises, we are still in pretty new territory.

The kinds of insights that can be provided through Artificial Intelligence are absolutely useful to Enterprises, this is a given. Supermarkets will make use of supply chain forecasting, having better insight to predict stock levels, eliminating waste and saving money. Banks will want to use ML for improving the accuracy of credit scores. The list goes on and on.

The challenges for adopting AI/ML as an enterprise are very real though.

Whereas digital companies and startups can adopt public cloud services for their ML workloads very quickly and easily, for some enterprises, it’s not as easy. They might have data sovereignty, security or other compliance requirements. Cost might be the prohibitive factor, especially with ingress/egress costs if you have a requirement to move data in and out regularly. There are many reasons to bring it in-house, and building a platform on premises could be the best option for these cases.

Requirements

In my post “AI/ML Brain Food – Part 2” I talked about some of the things you can do to get the best performance for running ML workloads, such as GPUs. I also talked about things to take in consideration from a quality and cost perspective. But what are the main building blocks to do we need to run AI/ML workloads?

Machine Learning typically needs three things:

  1. Data
    1. All of the information you will use to train the AI (eg. a set of photos)
    2. The data will (and should) come from various sources, but needs to be stored somewhere for access, eg a Big Data DB
  2. A Model
    1. All the algorithms, the structure of the neural networks (the intelligent bit).
    2. The model is then trained to understand patterns by looking at the data.
    3. Inference can then be applied, where the trained mode, is applied to new data, proving the result, (eg matching a new face)
  3. Consumption
    1. This could be a visualisation, for example in supply chain forecasting it might be a graph showing the predicted demand of a particular product.
    2. It could also be an app, for example a facial recognition login using the learning to identify the correct face.

So the minimum we need are:

  • Data Storage (Could be the harddisk on your laptop, right up to a big data lake)
  • Compute Ideally GPUs to process the vast amount of data through the neural networks)
  • A Consumption Method (A visualisation tool or custom application to consume the intelligence gained)

Doing this on your laptop with a small dataset is absolutely possible, this is actually a great way to learn and test ideas, but whilst spinning up an ML workload on your laptop might be fairly straightforward, getting it into production and building apps based on the data inevitably requires a lot more work!

VMware and NVIDIA have worked in partnership for a long time now, predominantly focused on GPU virtualisation for VDI workloads. This gave a dramatic increase to the performance of graphic rich applications running over VDI. Improvements are coming all the time here as the requirements from new applications drive new requirements from the platform.

What was announced at NVIDIA GTC takes us further into the world of AI/ML though. Specifically, RDMA and MIG are two features in the latest versions of vSphere which enable this.

But in simple terms, what are they?

RDMA (Remote Direct Memory Access)

This gives us increased performance at scale. Simply put, RDMA is a feature in new Smart NICS, providing a way for applications to directly access the memory of a server, without having to be processed by the CPU first. It might sound like a small feature, but the implications for running ML workloads at scale is massive.

Graphic courtesy of NVIDIA developer community from the blog post: NVIDIA AI Enterprise – Optimized, Certified and Supported on VMware vSphere

MIG (Multi-Instance GPU)

Multi Instance GPU gives us the ability to multi-task with ML workloads. Multitasking is a big part of the VMware value proposition, being able to run different VMs with different operating systems and applications on the same server is just what we do. With MIG, we can now do this with machine learning workloads, being able to run facial recognition in one VM or container, visualise or develop algorithms in another, run a supply chain prediction workload in another, all on the same server, sharing resources.

 

Graphic courtesy of NVIDIA on the blog post: Ride the Fast Lane to AI Productivity with Multi-Instance GPUs

 

That’s it for today, thanks for reading. In part two, we will explore the concepts of AI/ML Training and Inference, and why they need all this performance I keep talking about!

Check back soon.



Source link