Search
StarWind is a hyperconverged (HCI) vendor with focus on Enterprise ROBO, SMB & Edge

Optimizing AI/ML Workloads with NVIDIA GPUs and VMware Cloud Foundation

  • November 19, 2024
  • 14 min read
Virtualization Architect. Alex is a certified VMware vExpert and the Founder of VMC, a company focused on virtualization, and the CEO of Nova Games, a mobile game publisher.
Virtualization Architect. Alex is a certified VMware vExpert and the Founder of VMC, a company focused on virtualization, and the CEO of Nova Games, a mobile game publisher.

Modern artificial intelligence (AI) and machine learning (ML) workloads require high-performance solutions while minimizing infrastructure costs, as the hardware for such workloads is expensive.

Using NVIDIA GPUs in combination with NVIDIA AI Enterprise technology and the VMware Cloud Foundation (VCF) platform allows companies to achieve excellent performance while taking advantage of virtualization. This combination helps to effectively manage AI/ML workloads while reducing equipment costs.

Benefits of GPU Virtualization: Reduced Costs and Increased Performance

One of the key features of VMware Cloud Foundation is the ability to use virtualized graphics processing units (vGPUs). This technology allows you to divide a physical GPU into multiple virtual segments with reliable isolation, which allows multiple tasks to run in parallel on a single physical resource without interfering with each other. For example, NVIDIA Multi-Instance GPU (MIG) technologies allow you to divide a GPU into multiple independent instances, each of which can be used for different workloads or users.

This virtualization model significantly improves resource efficiency, allowing you to reduce the total cost of ownership (TCO) by consolidating more virtual machines or other workloads on a single host. At the same time, the virtual infrastructure demonstrates performance close to the level of “bare metal”, which is an important indicator for business. And we will talk about this below.

NVIDIA GPUs Review: H100, A100, and L4

The solution in question is based on powerful NVIDIA GPUs such as the H100, A100, and L4. These GPUs are specifically designed to handle large amounts of data and perform complex calculations related to machine learning and artificial intelligence.

NVIDIA H100 (the architecture is named after the American scientist Grace Hopper)

This is the most advanced serial chip (possibly even in the world), with 80 billion transistors and a hardware engine for accelerating GPT transformers (Generative Pre-trained Transformer). This chip allows for significant acceleration of training and inference of models, and the H100 also supports confidential computing, which makes it an ideal choice for scenarios with increased security requirements, such as federated learning.

Let’s look at the capabilities of this most advanced GPU chip:

A group of circular icons with text Description automatically generated

 

The specification for this platform looks like this:

A screenshot of a computer Description automatically generated

This chip is unrivaled for heavy ML workloads, and today we will look at its use in a virtual environment of VMware Cloud Foundation based on the ESXi hypervisor in comparison with the bare metal scenario (i.e. a server without a virtualization system).

NVIDIA A100 (the architecture is named after the French physicist Ampere)

This device is focused on deep learning and is used to work with big data and complex neural networks. Thanks to the support of NVLink and GPU sharing technology, the A100 provides almost continuous operation with minimal delays (latency). Like the H100, this chip actively uses the NVLink interconnect protocol technology, which is especially useful for large-scale AI tasks.

The specification for the NVIDIA A100 chip looks like this:

A table with numbers and text Description automatically generated

If we compare this chip with the H100, we will see that the A100 is somewhat simpler (and, of course, cheaper) than its older brother:

A screenshot of a computer Description automatically generated

NVIDIA L4 (the architecture is named after the English mathematician Ada Lovelace)

This chip combines capabilities for both graphics (which is also important in desktop platforms) and machine learning. However, within the VMware Cloud Foundation, L4 is used by users primarily for ML tasks. This GPU provides high performance in processing images and texts, which makes it an important element for applications working with multimedia data and AI inference.

Below are the main characteristics of the L4 and A100 devices, where you can see that L4 is the youngest model in the line:

A table with numbers and text Description automatically generated

 

Performance testing: close to bare metal

One of the key aspects is the performance comparison of virtualized configurations with physical servers. Performance tests, including tasks such as RetinaNet (object detection) and BERT (natural language processing), have shown that virtualized VCF environments achieve bare metal performance. In some cases, virtualized solutions even outperform physical servers with fewer dedicated resources, demonstrating the low overhead of virtualization.

In many tests, the performance drop is only 2-8%, but in some cases, virtualized systems even outperform bare metal by 4%.

For example, in inference tests based on the MLPerf Inference 4.0 suite (using RetinaNet for object recognition, GPT for text generation, and other benchmarks), virtualized systems showed 95-104% of bare metal performance, confirming the possibility of using virtualization for the most demanding AI tasks.

H100 Chip Performance Testing

Let’s look at the details. For testing with the MLPerf Inference 4.0 benchmark, VMware used the following test configuration for a virtual environment and bare metal:

A screenshot of a computer Description automatically generated

As we can see, the virtual environment used fewer physical processor resources (14%), and only 12.8% of the memory for the virtual machine used for inference.

Two scenarios were used for testing:

  • Server scenario – this is when all source materials (photos, pictures, etc.) are fed to the host sequentially, as they are loaded, in accordance with the specified distribution.
  • Offline scenario – when all materials are already on the server and available.

Below are the tests results obtained for 5 different models in the server scenario (these are the same 95-104% performance compared to bare metal that we mentioned above):

A graph of blue and white bars Description automatically generated with medium confidence

Let’s describe these benchmarks:

  • Retinanet – image recognition program
  • Bert-99 – NLP processor
  • Gptj-99/ Gptj-99.9 – GPT model with 6 billion parameters
  • Stable-diffusion-xl – text-to-image engine with 2.6 billion parameters

For the offline scenario, the results were even better – and all this by using only part of the server resources!

A graph of different sizes and colors Description automatically generated with medium confidence

Another benchmark 3d-unet was used here, which simulates the work of medical imaging software. And it also showed excellent results.

L40s Chip Performance Testing

The following hardware test configuration was used here:

A screenshot of a computer Description automatically generated

For the tests, we have already taken a third of the server processor capacity, but 8.5% of its memory was sufficient.

For the tests described above, the results were somewhat more modest, but, nevertheless, the maximum loss of performance is only 8% and this, again, with the use of only part of the equipment resources:

A graph of different sizes and colors Description automatically generated with medium confidence

Here rnnt is a speech-to-text model, which was also tested in these scenarios.

Testing the performance of the A100 chip

In this case, the following hardware configuration was used for the physical and virtual environment (here we had to use two thirds of the processors and almost all the memory to get pleasing results):

A screenshot of a computer Description automatically generated

The results, normalized to the reference bare metal performance, for the task of training the model (not inference) were as follows (note that the higher the bar, the worse):

A graph showing a number of people Description automatically generated with medium confidence

Performance degradation in the VCF virtual environment for two benchmarks was in the range of 6-8%.

Virtualization Benefits for AI/ML Workloads

Let’s summarize the main benefits of virtualization when running ML workloads with NVIDIA accelerators:

  • Cost savings: using a portion of the physical server resources allows you to run more VMs and workloads, reducing overall hardware costs.
  • Isolation and security: fractional virtualized GPUs with isolation ensure data security, which is especially important in multi-tenant cloud environments.
  • Flexibility: VCF allows you to scale resources depending on the needs of the workload. VCF also allows you to dynamically allocate resources between VMs using DRS technology, providing flexibility in managing CPUs, memory, and GPUs.
  • Near-bare-metal performance: even with the overhead of virtualization, VCF demonstrates performance close to physical infrastructure.

Conclusion

Integrating NVIDIA GPUs with VMware Cloud Foundation offers a powerful solution for optimizing AI/ML workloads by using only the portion of compute resources needed for the task (which is only possible in a virtual environment). This combination allows companies to achieve maximum performance while reducing infrastructure costs by using only a portion of the available server fleet resources. With GPUs such as the H100, A100, and L4, you can confidently run demanding machine learning workloads while enjoying all the benefits of virtualization.

It can be said that VMware Cloud Foundation truly is the “sweet spot” for AI/ML workloads, offering a balance between performance, cost efficiency, and flexibility.

Hey! Found Alex’s article helpful? Looking to deploy a new, easy-to-manage, and cost-effective hyperconverged infrastructure?
Alex Bykovskyi
Alex Bykovskyi StarWind Virtual HCI Appliance Product Manager
Well, we can help you with this one! Building a new hyperconverged environment is a breeze with StarWind Virtual HCI Appliance (VHCA). It’s a complete hyperconverged infrastructure solution that combines hypervisor (vSphere, Hyper-V, Proxmox, or our custom version of KVM), software-defined storage (StarWind VSAN), and streamlined management tools. Interested in diving deeper into VHCA’s capabilities and features? Book your StarWind Virtual HCI Appliance demo today!