- If ChatGPT produces AI-generated code for your app, who does it really belong to?
- The best iPhone power banks of 2024: Expert tested and reviewed
- The best NAS devices of 2024: Expert tested
- Four Ways to Harden Your Code Against Security Vulnerabilities and Weaknesses
- I converted this Windows 11 Mini PC into a Linux workstation - and didn't regret it
Optimizing AI Workloads with NVIDIA GPUs, Time Slicing, and Karpenter
Maximizing GPU efficiency in your Kubernetes environment
In this article, we will explore how to deploy GPU-based workloads in an EKS cluster using the Nvidia Device Plugin, and ensuring efficient GPU utilization through features like Time Slicing. We will also discuss setting up node-level autoscaling to optimize GPU resources with solutions like Karpenter. By implementing these strategies, you can maximize GPU efficiency and scalability in your Kubernetes environment.
Additionally, we will delve into practical configurations for integrating Karpenter with an EKS cluster, and discuss best practices for balancing GPU workloads. This approach will help in dynamically adjusting resources based on demand, leading to cost-effective and high-performance GPU management. The diagram below illustrates an EKS cluster with CPU and GPU-based node groups, along with the implementation of Time Slicing and Karpenter functionalities. Let’s discuss each item in detail.
Basics of GPU and LLM
GPU: A Graphics Processing Unit (GPU) was originally designed to accelerate image processing tasks. However, due to its parallel processing capabilities, it can handle numerous tasks concurrently. This versatility has expanded its use beyond graphics, making it highly effective for applications in Machine Learning and Artificial Intelligence.
When a process is launched on GPU-based instances these are the steps involved at the OS and hardware level:
- Shell interprets the command and creates a new process using fork (create new process) and exec (Replace the process’s memory space with a new program) system calls.
- Allocate memory for the input data and the results using cudaMalloc(memory is allocated in the GPU’s VRAM)
- Process interacts with GPU Driver to initialize the GPU context here GPU driver manages resources including memory, compute units and scheduling
- Data is transferred from CPU memory to GPU memory
- Then the process instructs GPU to start computations using CUDA kernels and the GPU schedular manages the execution of the tasks
- CPU waits for the GPU to finish its task, and the results are transferred back to the CPU for further processing or output.
- GPU memory is freed, and GPU context gets destroyed and all resources are released. The process exits as well, and the OS reclaims the resource
Compared to a CPU which executes instructions in sequence, GPUs process the instructions simultaneously. GPUs are also more optimized for high performance computing because they don’t have the overhead a CPU has, like handling interrupts and virtual memory that is necessary to run an operating system. GPUs were never designed to run an OS, and thus their processing is more specialized and faster.
Large Language Models
A Large Language Model refers to:
- “Large”: Large Refers to the model’s extensive parameters and data volume with which it is trained on
- “Language”: Model can understand and generate human language
- “Model”: Model refers to neural networks
Run LLM Model
Ollama is the tool to run open-source Large Language Models and can be download here https://ollama.com/download
Pull the example model llama3:8b using ollama cli
ollama -h Large language model runner Usage: ollama [flags] ollama [command] Available Commands: serve Start ollama create Create a model from a Modelfile show Show information for a model run Run a model pull Pull a model from a registry push Push a model to a registry list List models ps List running models cp Copy a model rm Remove a model help Help about any command Flags: -h, --help help for ollama -v, --version Show version information Use "ollama [command] --help" for more information about a command.
ollama pull llama3:8b: Pull the model
ollama pull llama3:8b pulling manifest pulling 6a0746a1ec1a... 100% ▕█████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕█████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕█████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕█████████████████████████████████████████████████████████████████████▏ 110 B pulling 3f8eb4da87fa... 100% ▕█████████████████████████████████████████████████████████████████████▏ 485 B verifying sha256 digest writing manifest removing any unused layers success
ollama list: List the models
developer:src > ollama show llama3:8b Model arch llama parameters 8.0B quantization Q4_0 context length 8192 embedding length 4096 Parameters num_keep 24 stop "<|start_header_id|>" stop "<|end_header_id|>" stop "<|eot_id|>" License META LLAMA 3 COMMUNITY LICENSE AGREEMENT Meta Llama 3 Version Release Date: April 18, 2024
ollama run llama3:8b: Run the model
developer:src > ollama run llama3:8b >>> print all primes between 1 and n Here is a Python solution that prints all prime numbers between 1 and `n`: ```Python def print_primes(n): for possiblePrime in range(2, n + 1): # Assume number is prime until shown it is not. isPrime = True for num in range(2, int(possiblePrime ** 0.5) + 1): if possiblePrime % num == 0: isPrime = False break if isPrime: print(possiblePrime) n = int(input("Enter the number: ")) print_primes(n) ``` In this code, we loop through all numbers from `2` to `n`. For each number, we assume it's prime and then check if it has any divisors other than `1` and itself. If it does, then it's not a prime number. If it doesn't have any divisors, then it is a prime number. The reason why we only need to check up to the square root of the number is because a larger factor of the number would be a multiple of smaller factor that has already been checked. Please note that this code might take some time for large values of `n` because it's not very efficient. There are more efficient algorithms to find prime numbers, but they are also more complex.
In the next post…
Hosting LLMs on a CPU takes more time because some Large Language model images are very big, slowing inference speed. So, in the next post let’s look into the solution to host these LLM on an EKS cluster using Nvidia Device Plugin and Time Slicing.
Questions of comments? Please leave me a comment below.
Share: