Post

Even ๐—ก๐—ฉ๐—œ๐——๐—œ๐—” ๐—ก๐—œ๐—  ๐—ถ๐˜€ ๐—ฟ๐˜‚๐—ป๐—ป๐—ถ๐—ป๐—ด ๐—ผ๐—ป ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€, what is ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€?

 Kubernetes for Machine Learning

Even ๐—ก๐—ฉ๐—œ๐——๐—œ๐—” ๐—ก๐—œ๐—  ๐—ถ๐˜€ ๐—ฟ๐˜‚๐—ป๐—ป๐—ถ๐—ป๐—ด ๐—ผ๐—ป ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€, what is ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ and why is it worth learning ๐—ฎ๐˜€ ๐— ๐—Ÿ๐—ข๐—ฝ๐˜€/๐— ๐—Ÿ/๐——๐—ฎ๐˜๐—ฎ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ?

๐Ÿ‘‰ Today we look into the Kubernetes system from a birdโ€™s eye view.

๐—ฆ๐—ผ, ๐˜„๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ (๐—ž๐Ÿด๐˜€)?

  1. It is a container orchestrator that performs the scheduling, running and recovery of your containerised applications in a horizontally scalable and self-healing way.

Kubernetes architecture consists of two main logical groups:

  1. Control plane - this is where K8s system processes that are responsible for scheduling workloads defined by you and keeping the system healthy live.
  2. Worker nodes - this is where containers are scheduled and run.

๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ž๐˜‚๐—ฏ๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜๐—ฒ๐˜€ ๐—ต๐—ฒ๐—น๐—ฝ ๐˜†๐—ผ๐˜‚?

  1. You can have thousands of Nodes (usually you only need tens of them) in your K8s cluster, each of them can host multiple containers. Nodes can be added or removed from the cluster as needed. This enables unrivaled horizontal scalability.
  2. Kubernetes provides an easy to use and understand declarative interface to deploy applications. Your application deployment definition can be described in yaml, submitted to the cluster and the system will take care that the desired state of the application is always up to date.
  3. Users are empowered to create and own their application architecture in boundaries pre-defined by Cluster Administrators.
  • โœ… In most cases you can deploy multiple types of ML Applications into a single cluster, you donโ€™t need to care about which server to deploy to - K8s will take care of it.
  • โœ… You can request different amounts of dedicated machine resources per application.
  • โœ… If your application goes down - K8s will make sure that a desired number of replicas is always alive.
  • โœ… You can roll out new versions of the running application using multiple strategies - K8s will safely do it for you.
  • โœ… You can expose your ML Services for other Product Apps to use with few intuitive resource definitions.
  • โœ… โ€ฆ

โ—๏ธHaving said this, while it is a bliss to use, usually the operation of Kubernetes clusters is what is feared. It is a complex system.

โ—๏ธMaster Plane is an overhead, you need it even if you want to deploy a single small application.


 When to use kubernetes for Machine Learning

Should you use Kubernetes to deploy your Machine Learning models?

Most likely not! When a technology is hot, there is a tendency to disregard why the tool is useful in the first place, and we see massive adoption for no good reason.

If you need to deploy machine learning models, there are typically 2 axes to look at: how many users and how many ML teams you have. The number of users will give you a sense of how much workload you are likely to have for your ML applications, and the number of ML teams is a good proxy for the complexity of the applications.

If you have low user traffic, you are better off deploying to a barebones EC2 instance. You could Dockerize your application, but it might not even provide a huge advantage. If fault tolerance is required, you can get 2 servers and a load balancer for redundancy.

A typical server can handle ~1000 requests per second, so if you receive less than 100 requests per second, in the worst case, you have low user traffic. If traffic increases beyond that point, elastic load balancing is better to adapt to the workload.

If the number of people working on the ML code base is low, it might be better to avoid Kubernetes. The complexity of a code base is proportional to the number of people working on it. For example, if you have teams for ML engineering, MLOps, and data engineering, they each develop separate applications that need to be orchestrated together.

Containerizing becomes critical because each team has its own software practice, and applications communicate through APIs in a microservice infrastructure. ML applications become complex pipelines where data engineers might be in charge of data processing applications, ML engineers in charge of ML model inference applications, and MLOps engineers in charge of model monitoring applications, all of which have to work together seemingly.

Teams are likely to work independently of each other and need to focus on optimizing their own piece without constantly checking on others. Kubernetes can be a good solution when that level of complexity occurs.

It abstracts the different applications into computational blocks, and they are orchestrated by the Kube cluster itself, which allows for a high level of automation. It provides a scaling mechanism similar to load balancing to adapt to high workloads.

Very few companies can pretend to have that level of complexity, and even if people belong to different teams, if the number of people involved in deploying models is less than a dozen, it is unlikely that complexity calls for Kubernetes.

Even if the code seems complex, it might be simpler for those people to work on the same code base in a monolithic application.


 Kubernetes Scaling Strategies

Kubernetes Scaling Strategies:

Horizontal Pod Autoscaling (HPA):

  • Function: Adjusts the number of pod replicas based on CPU/memory usage or other select metrics.
  • Workflow: The Metrics Server collects data โ†’ API Server communicates with the HPA controller โ†’ The HPA controller scales the number of pods up or down based on the metrics.

Vertical Pod Autoscaling (VPA):

  • Function: Adjusts the resource limits and requests (CPU/memory) for containers within pods.
  • Workflow: The Metrics Server collects data โ†’ API Server communicates with the VPA controller โ†’ The VPA controller scales the resource requests and limits for pods.

Cluster Autoscaling:

  • Function: Adjusts the number of nodes in the cluster to ensure pods can be scheduled.
  • Workflow: Scheduler identifies pending pods โ†’ Cluster Autoscaler determines the need for more nodes โ†’ New nodes are added to the cluster to accommodate the pending pods.

Manual Scaling:

  • Function: Manually adjusts the number of pod replicas.
  • Workflow: A user uses the kubectl command to scale pods โ†’ API Server processes the command โ†’ The number of pods in the backend Kubernetes system is adjusted accordingly.

Predictive Scaling:

  • Function: Uses machine learning models to predict future workloads and scales resources proactively.
  • Workflow: ML Forecast generates predictions โ†’ KEDA (Kubernetes-based Event Driven Autoscaling) acts on these predictions โ†’ Cluster Controller ensures resource balance by scaling resources.

Custom Metrics Based Scaling:

  • Function: Scales pods based on custom application-specific metrics.
  • Workflow: Custom Metrics Server collects and provides metrics โ†’ HPA controller retrieves these metrics โ†’ The HPA controller scales the deployment based on custom metrics.

These strategies ensure that Kubernetes environments can efficiently manage varying loads, maintain performance, and optimize resource usage. Each method offers different benefits depending on the specific needs of the application and infrastructure.

This post is licensed under CC BY 4.0 by the author.