CNCF - Blog

Kubernetes is turning 10! Join the party on June 6th

Mon, 06 May 2024 16:00:00 GMT

Over the last 10 years, Kubernetes has risen to become the backbone of modern application deployment and has completely changed the course of innovation.

Kubernetes was the first project accepted into the CNCF Incubator in March 2016 and remains a top open source project. Kubernetes not only shattered expectations for itself but it set a North Star for the entire CNCF ecosystem. CNCF has since grown to more than 184 cloud native projects and the success of Kubernetes in the CNCF established a baseline of what was possible for others.

Kubernetes is today the de facto standard to deploy and operate containerized applications. Originally developed at Google and released as open source in 2014, Kubernetes builds on 15 years of running Google’s containerized workloads and the valuable contributions from the open source community. Kubernetes makes everything associated with deploying and managing applications easier.

On June 6, the Kubernetes and cloud native community will come together across the globe to celebrate a tremendous decade and look forward to the next #KuberTENes. Everyone in the community will have a chance to #celebr8k8s in the following ways.

KuberTENes Birthday Bash

Join us for a fun and memorable in-person event at Google’s Bay View Campus in Mountain View, California to celebrate on June 6th.

The birthday bash will blend elements of a reunion show, while celebrating many hidden figures that made this project possible, and a preview into what awaits us in the next 10 years. We’ll celebrate people who have been integral in shaping Kubernetes into what it is today, and discuss why its significance will only grow in the years ahead.

Interested in attending? We would love to have you. Tickets are free but space is limited so RSVP soon!

Birthday Bash Livestream/Replays

Can’t join the party in Mountain View? No problem! We will have a livestream during the event and the recording will be available as well on YouTube for all to see.

KuberTENes Birthday Party in a Box & Local Meet-Ups

If you’re unable to attend in the Bay Area or simply can’t make it, don’t fret! You can still join the celebration in your local community. Request a KuberTENes party in box here, please note all requests need to be made by May 10. Alternatively, you can connect with local Cloud Native Meet-Up Groups, who are already working hard to organize events in your area. Some of the locations on the list (but not limited to) include:

Cloud Native Wellington
Cloud Native Quebec
Cloud Native Luxembourg
Cloud Native Helsink
Cloud Native Dahaka
Cloud Native Chattogram
Cloud Native Paris
Cloud Native Tel Aviv
Cloud Native Singapore
Cloud Native Taiwan User Group
Cloud Native Copenhagen
Cloud Native Aarhus
Cloud Native Santa Catarina
Cloud Native New York Kubernetes Meetup
CloudNative Lorient
Cloud Native Saint Louis
Cloud Native Guatemala
Cloud Native Vilnius
Cloud Native Istanbul
Cloud Native Kerala
Cloud Native New Delhi
Cloud Native Stockholm
Cloud Native Aalborg
Cloud Native Goteborg
Cloud Native Amsterdam
Cloud Native and Kubernetes Oslo
Cloud Native Dallas
Cloud Native Sao Paulo
Cloud Native Munich
Cloud Native Bangalore
Cloud Native Salt Lake City

And, whether you join one of these events, be sure to join the biggest gathering of Kubernetes enthusiasts in Salt Lake City for KubeCon + CloudNativeCon North America where we plan to continue the 10-year celebrations.

Adding color-blind themes to Kubecolor to make Kubernetes more inclusive

Sun, 05 May 2024 16:00:00 GMT

Ambassador post originally published on Sebastian “Prune” Thomas’s blog

Kubcolor is a thin wrapper over the kubectl command that adds coloring to the output.

I cloned the project and started maintaining it back in 2022 when the original author wasn’t active anymore.

KubeColor can reformats the output of most kubectl commands to add colors and clarity. It makes it so easier to read the output that I still don’t understand that it’s not more widely used. I actually gave a lightning talk about it at the KubeCon’s Cloud Native Reject Europe 2024 in Paris if you want a video pitch.

One of the longest requested feature, discussed at length in the previous project, was to be able to cusotmize the color theme used by KubeColor.

Actually, when I first cloned the original project, I applied a patch that was globally changing the colors to make thinks less colory and more standard, limiting to a smaller set of colors. Some people started complaining right away, but that’s what people do anyways.

As of version 0.3.0, kubecolor now supports custom color scheme and theme, thanks to the work of other main contributor, AppleJag, which is a talented (Go) devleoper. Jag, I can’t thank you enough for all the help on this project.

What’s the problem ?

By default kubecolor uses the set of colors from your terminal’s config, so it always was possible to configure it. Just change the theme of your terminal and you can adapt the colors to your needs !

But more than colors default colors, some want to colorize some specific fields differently, or use more colors to further differenciate things.

But there’s more…

According to this article, one man out of 12 have some sort of color blindness (or color disability). Women are a little less concerned, with one out of 200, but still, it’s a lot ! (numbers may vary depending on the website too…)

Check the Wikipedia page to learn more, and there’s tons of other sites about this matter. And still, it’s usually not something we think of right away.

For example, just look at my blog, with it’s low contrast grey colors and you’ll understand that color blindness was not my main concerns at the time.

And, well, when we think at inclusion it’s generally genders and skin colors, and when we think about accessibility, it’s mobility impairement, deafs and blinds.

Color-blindness is usually not mentionned or taken care of. The CNCF Website itself does not mention it directly. The only TAG (Technical Advisory Groups) in the Accessibility section is focused on hearing issues.

Maybe because it’s easier to live with color-blindness, or because people don’t talk about it by shame, it’s still a real problem and the numbers are huge, far more that what I always belived.

Note that I’m not trying to rate any of those against the others, or trying to speak in place of the impacted persons. I’m not impaired and I just want to put some light on inclusivity.

So, what to do with this ?

As soon as kubecolor got the color theme functionnality, I started thinking of adding one or multiple color themes for the various kind of color-blindness.

The first question that came to my mind was :

It’s quite usual to use green for good things (success) and red for bad things (errors). But is there a common pattern for color-blind persons ?

Well, so far I don’t have the answer. But while searching I learnt few things:

Color is important, but so is the contrast
there’s also modifiers like bold and italic that can help better differenciate things
it is usually better to add some text explaining the status and not only rely on the color. Here, no probleme for us, as we are adding colors to an already expressive text.
Maybe I should not add a theme at all and each person will built its own

Understanding KubeColor Themes

Thanks to Jag, kubecolor can process many kind of definitions to configure the colors.

In short:

using a regular color names (like red, blue) will use whatever is defined in your terminal application’s theme. white may be white, or not, but if you already have a theme made for color-blindness, you may not have to change anything.
using many other ways to define colors, like HEX and RGB values, will allow to use custom colors not part of your terminal’s theme.
using bg= or fg= will allow to change the background or the front (text) color.
it is possible to use any of the modifiers like bold, italic and so on to even better tune the visibility of each field.
thanks to all the KUBECOLOR_THEME_* ENV variables, it is possible to fully customize the output of “each” field, depending on the original command used against kubectl (like get or a describe).
it is possible to create the theme as a file, which also enable sharing it, by creating a ~/.kube/color.yaml file (in OsX and Linux, may be a different location on Windows). We’ll dive on the format later, keep reading.
kubecolor embeds default themes, both in dark and light mode:
- dark
- light
- pre-0.0.21-dark: the previous color schema from the original project
- pre-0.0.21-light

You can check the content of each basic theme in the code in the config/theme.go file.

How to build a theme

As said earlier, you can either use the KUBECOLOR_THEME_* env variables or create your theme in the ~/.kube/color.yaml file.

Using ENV Variables

The easiest is to check the docs at https://github.com/kubecolor/kubecolor/blob/main/README.md#color-theme and experiment.

In any case, you have to pick a base theme, by setting KUBECOLOR_PRESET, then update some of the colors. For example you can change all the running pods to blue with:

KUBECOLOR_THEME_BASE_SUCCESS=blue KUBECOLOR_PRESET=dark kubecolor get pods -o wide

Bash

Using the config file

Create the file ~/.kube/color.yaml and add some content like:

preset: dark
theme:
  table:
    header: fg=red:bold:bg=blue

YAML

So basically, you take the ENV variable and you nest the last part of it.

With KUBECOLOR_THEME_STATUS_ERROR, you remove the KUBECOLOR part, so the final path is theme.status.error, so to show pods in error in pink:

preset: dark
theme:
  table:
    header: fg=red:bold:bg=blue
  status:
    error: pink

YAML

First I want to clearly state I do not have any color impairement, and the work I’m trying to achieve here is based on articles I read and some talking with color-blind persons. There’s no scientific work on my side.

The idea is to provide an out of the box solution to help people with color-blindness. The outcome may not be perfect or even useful and I take no responsability. It’s best effort. It’s OpenSource. Bare with me.

After some researches, I found the Cromatic Vision Simulator website which allows to load an image and, using a quad view, see what color-blind persons may see depending on the kind of disability.

In short, if I upload one of the previous images that captured the k get pods -o wide commands, we can check how it look using the dark theme:

regular view
Protanopia view
deuteranopia view
tritanopia view

Now I guess we all understand the issue with the current color scheme of the dark theme: any impaired person will lose most of the color informations. At this point, better use plain kubectl commands…

So I tested some chromatic progressions to try to identify a palette that could work fine at least most of the time:

Being color-blind is not just not seeing green or red, there’s also quite a limitation in the color hues that are perceived, so everything from green to red, where the color changes slowly for a regular eye, will be almost the same brown/yellowish for a Protanopian.

My final conclusion is that it seems possible to achieve a theme that will help better differenciate the content. What we need here is having different color hues to show the difference of, mostly, good and bad situation, and color cycles when there’s a table.

Using the Observable HQ website, I used the discrete 10 schema to cut the rainbow in 10 usable colors:

#23171b
#4860e6
#2aabee
#2ee5ae
#6afd6a
#c0ee3d
#feb927
#fe6e1a
#c2270a
#900c00

Once rendered, we have:

The dark theme only uses 6 colors (well, 5 as one if the default white for dark theme, or black for light theme). So here’s my selection:

Terminal Color	Matching Color	protanopia	deuteranopia	tritanopia
`yellow`	`#feb927`	`#f9bb27`	`#fbbc23`	`#ffacb6`
`magenta`	`#4860e6`	`#a77fe5`	`#888ee4`	`#257e7d`
`green`	`#6afd6a`	`#fee16c`	`#fee16c`	`#fee16c`
`red`	`#c2270a`	`#bb8b16`	`#936a15`	`#ff6579`
`cyan`	`#2aabee`	`#34adee`	`#22afef`	`#34b4b5`
Null color (white-ish)	`#2ee5ae`	`#e8d0b0`	`#c6beb3`	`#4ddfe0`

I also used the bold on the success and I actually inverted the error so the background is red and the text is white. High contrast is usually a good helper where we’re limited with the possible colors.

The result seems to be pretty much working in all situations:

Using the themes

Finally, along the other 4 themes announced before, you can now use any of the new themes if you’re concerned by color blindness. They are:

protanopia-dark
protanopia-light
deuteranopia-dark
deuteranopia-light
tritanopia-dark
tritanopia-light

Just set your env variables like:

KUBECOLOR_PRESET=protanopia-dark kubecolor get pods

Bash

export KUBECOLOR_PRESET=protanopia-dark
kubecolor get pods

Bash

Or set it in your config file ~/.kube/color.yaml like:

preset: protanopia-dark

YAML

Updating the theme

As the Themes are pretty much a first iteration and a work in proress, please, feel free to comment and open an issue if you feel the current themes can be enhenced.

Also, you can start creating your own theme, by modifying an existing one, then share it either in a issue or a Pull Request.

Simply start from original theme file and add more customization to the ~/.kube/color.yaml file:

preset: protanopia-dark
theme:
  base:
    key:
       - fg=#feb927
       - fg=white
    info:
    primary: fg=#4860e6
    secondary: fg=#2aabee
    success: fg=#6afd6a:bold
    warning: fg=#feb927
    danger: fg=white:bg=#c2270a
    muted: fg=#feb927
  options:
    flag: fg=#feb927
  table:
    header: fg=white:bold:bg=#2aabee
  status:
    error: fg=white:bg=#c2270a

YAML

Note that, at the moment, all protanopia, deuteranopia and tritanopia themes are the same. Please, when you leave a feedback, mention your condition, so we can update the themes differently to better suite each of the different situations.

I would encourage you to set your default theme according to your type of disability to benefit of the futur changes.

Wrapping it up

Next time you see the screen of a co-worker using strange colors, don’t smile or make fun, this person is probably suffering some sort of color blindness. Instead, just explain them that KubeColor is now your friend.

Even worse, the next time you see someone using kubectl in monochrome, insist for them to go check Kubecolor !

We put a lot of effort into this feature. We trully hope that it will help some persons out there and make Kubernetes more inclusive. If not, it was a good adventure.

Feature is available in Kubecolor v0.3.0, available now !

Top 5 cloud computing trends of 2024

Thu, 02 May 2024 16:00:00 GMT

Member post by Sameer Danave, Senior Director of Marketing, MSys Technologies

Every time I think I have this whole technology game down to a science, something changes in the blink of an eye. And if you’re as enthusiastic about the cloud as I am, you’ve likely experienced the same feeling of whiplash as cloud trends continue to shift.

Keeping up with the latest technology trends isn’t always easy. However, to stay ahead of the competition, it’s pivotal to stay ahead of them.

Fortunately, I’ve gathered all the information you need on the latest cloud computing trends straight from industry experts and MSys’ survey of 400+ technology professionals and crafted a Tech Lens 2024 guide just for you. Let’s delve into the top five trends from this guide for 2024.

Top 5 Key Cloud Computing Trends to Watch

Here are top five trends that are expected to witness significant traction in the forthcoming years.

AI As A Service (AIaaS)

In the forthcoming years, significant growth is foreseen in the integration of AI services into cloud solutions. Cloud infrastructure plays a crucial role in opening up AI’s economic and social benefits to enterprises. Training AI models, such as the robust large language model (LLM) powering ChatGPT, demands extensive data and substantial computing resources.

Enterprises are shifting away from constructing their own AI infrastructure and opting for AI-as-a-service provided by cloud platforms. This transition allows them to harness the transformative power of AI without the constraints of managing resources. AI as a Service offers pre-built AI models, tools, and APIs hosted on cloud platforms, enabling enterprises to seamlessly implement AI functionalities, even without specialized AI expertise and infrastructure.

Hybrid & Multi-Cloud Strategies

Multi-cloud and hybrid solutions have become incredibly popular among enterprises across the globe. The hybrid multi-cloud approach incorporates public cloud services from multiple providers, enabling portability across diverse cloud infrastructures. This enhances flexibility and reduces dependency on a single vendor, thus mitigating the risk of vendor lock-in.

Besides, hybrid cloud solutions offer a flexible approach to managing data storage complexities. By integrating public and private cloud environments, organizations can leverage existing infrastructure while gaining scalability, security, and redundancy. This approach optimizes storage resource allocation, strengthens disaster recovery capabilities, and fosters agility in response to evolving business requirements.

Moreover, hybrid cloud solutions provide enhanced control over IT infrastructure and bolstered security compared to alternative cloud options. Cloud vendors employ expert security professionals to ensure data protection, adhering to stringent protocols and compliance measures.

Edge AI Computing

The edge computing landscape is expected to witness significant traction in the forthcoming years. In the traditional cloud model, data transfers to a remote server for processing. In contrast, edge computing establishes a compact computing environment near the data source.

This reduces latency and enables real-time analysis and decision-making. The deployment of advanced networks like 5G, along with energy-efficient processors and algorithms, is expected to further bolster edge computing’s viability for evolving application needs by 2024.

Sustainable Cloud Computing

Sustainable computing is expected to experience significant growth in the years ahead. This trend is fueled by the understanding that approximately 1.8% to 3.9% of global greenhouse gas emissions stem from the information and communication technology (ICT) sector.

Green computing encompasses environmentally conscious practices across the lifecycle of computers, chips, and other technology components, spanning from design and manufacturing to usage and disposal. Its objective is to mitigate environmental impact by decreasing carbon emissions and energy consumption across all stages, including production, data centers, and end-user operations.

Additionally, green computing involves the selection of sustainably sourced materials, minimizing electronic waste, and promoting sustainability through the use of renewable resources.

Serverless Computing

Expected to see significant expansion with a Compound Annual Growth Rate (CAGR) of 23.17% between 2023 and 2028, serverless computing brings forth novel methods for creating and operating software applications and services. This emerging paradigm eradicates the necessity of infrastructure management, empowering users to write and deploy code devoid of the complexities of underlying systems.

This transition offers numerous benefits for developers, including quicker time-to-market, improved scalability, and decreased deployment costs for new services. As a result, developers can focus on innovation rather than the intricacies of managing infrastructure.

Conclusion

In the whirlwind of technological advancement, keeping pace with cloud computing trends is both exhilarating and essential. Drawing from insights of industry experts and a survey of over 400 technology professionals by MSys, we’ve distilled the top five trends for 2024. From AI integration to sustainable practices and serverless architectures, these trends promise to reshape our approach to technology. By embracing them, we can propel our organizations forward and stay ahead of the competition. This guide offers actionable insights to navigate these trends effectively. Let’s embark on this journey together, pushing the boundaries of cloud computing’s possibilities.

About Sameer

Sameer is a seasoned technology marketing professional with 16 years of full-stack marketing experience. He believes in 2 Cs – ‘Customer Value’ and Communications. All his Marketing campaigns and projects are packaged with it.

He drives phygital (physical + digital) campaigns that attract and pull customers towards the brand’s value. His marketing strategies apply omnichannel, conversational marketing tactics (Storytelling, Social, and Chatbot), AI-Enabled Inbound Marketing, backed by solid analytics and insights with ‘content’ as a core part of the strategy.

Sameer is a team sport with meticulous planning, attention to detail, and the ability to perform effortlessly under pressure.

Is your supply chain secure? Double check with our framework

Thu, 02 May 2024 16:00:00 GMT

A secure supply chain is a critical piece of cloud native security, and it can be tricky to get right because it covers such a broad expanse of factors from code to pipelines and beyond.

Join us on June 26 & 27 for CloudNativeSecurityCon North America 2024 in Seattle

The breadth of the supply chain also makes it vulnerable, and according to a survey from Security Magazine, 91% of organizations experienced attacks in 2023. The top three types of attacks were exploited vulnerabilities or misconfigurations, stolen secrets, and data breaches. The reverberations of a supply chain attack go far beyond the organization and include reputational damage, loss of revenue, and even legal liability. In fact, IBM’s 2023 “Cost of a Data Breach” survey found attackers cost organizations worldwide an average of $4.45 million, which is a 15% increase over the last three years.

Not surprisingly 51% of survey respondents told IBM their organizations were planning to increase spending on security.

So, no matter where your organization is on the journey to a more secure supply chain, taking extra steps is never a bad idea. Our Security Technical Advisory Group has created a series of questions teams can ask to dig deeper. The framework is divided into four areas: source code, materials, build pipelines, and artefacts and deployments.

Start by verifying the source code, asking questions including:

Do you require signed commits?
Do you use git hooks or automated scans to prevent committing secrets to source control?
Have you defined an unacceptable risk level for vulnerabilities? For example: no code may be committed that includes Critical or High vulnerabilities

Next, verify materials:

Do you verify that dependencies meet your minimum thresholds for quality and reliability?
Do you automatically scan dependencies for security issues and license compliance?
Do you automatically perform Software Composition Analysis on dependencies when they are downloaded/installed?

Make certain the build pipelines are protected:

Do you use hardened, minimal containers as the foundation for your build workers?
Do you maintain your build and test pipelines as Infrastructure-as-Code?
Do you automate every step in your build pipeline outside of code reviews and final sign-offs?

And finally, protect artefacts and deployments:

Is every artefact you produce (including metadata and intermediate artefacts) signed?
Do you distribute metadata in a way that can be verified by downstream consumers of your products?

Dive into the entire framework, but don’t stop there!

Join us in Seattle for CloudNativeSecurityCon North American 2024 on June 26 and 27 to learn from and network with experts in every facet of cloud native security.

Register! Learn more!

Early explorations and practices of Xline, a stateful application managed by Karmada

Wed, 01 May 2024 16:00:00 GMT

Member post by DatenLord

Background and Motivation

More and more IT vendors are now embracing cross-cloud multi-clustering as cloud-native technologies and cloud markets continue to mature. Here’s Flexera’s mid-2023 survey on the cloud-native market’s acceptance of multi-cloud, multi-cluster management. (info.flexera.com)

As you can see from Flexera’s report, more than 87 percent of organizations in the overall cloud-native market are already using services from multiple cloud vendors at the same time, with only 13 percent using a single public cloud and a single private cloud. Only 13% are using a single public cloud or single private cloud, while 15% of those using multi-cloud deployments are choosing multi-public or multi-private cloud deployments, and 72% are adopting hybrid cloud deployments. These statistics reflect the maturity of cloud-native technologies and the cloud marketplace, and the future will be the era of programmatic multi-cloud managed services.

In addition to external trends, the limitations of single-cluster deployments have become an intrinsic motivation for users to embrace multi-cloud, multi-cluster management. Limitations of single cluster deployments include, but are not limited to:

A single point of failure, where cluster-level failures are difficult to tolerate, and a small cluster federation outperforms a large K8s cluster.
Boundary constraints of a single cluster, e.g., a Node has only 110 Pods by default, and a cluster can hold up to 5000 Nodes.
Business-level development needs, e.g., Xline itself is a cross-cluster cluster.
….

Karmada, as an open source multi-cluster management tool, has been used by Shopee, DaoCloud and other companies in the production environment. However, since Karmada currently lacks support for stateful application management, it is still mainly used for stateless application management in practice.

To better cope with the future trend of multi-cloud and multi-cluster management, and to better manage stateful applications in multi-cloud and multi-cluster scenarios, Xline and the Karmada community set up a working group to jointly promote Karmada’s support for stateful application management.

What are the challenges of managing stateful applications with Karmada?

Before understanding how Karmada manages stateful applications across multiple clusters, we need to look back at the K8s implementation of managing stateful applications in a single cluster.

Back in 2012, Randy Bias gave an influential talk on “Open and Scalable Cloud Architecture”. In that talk, he proposed a “Pets” versus a “Cattle”.

These two concepts correspond to stateless and stateful applications, respectively. Cattle don’t need names, and they are not unique. This means we can easily replace one with another when one of them has some problems. Pets are different. Each pet is unique, with its own name, and should be looked after carefully when it has some problems.

StatefulSet was introduced in Kubernetes 1.5 and stabilized in version 1.9. It provides a fixed Pod identity for managing Pods, persistent storage for each Pod, and a strict start/stop order among Pods.

The problems are: what exactly constitutes a state, and how Kubernetes addresses it.

In the Karmada multi-cluster scenario, stateful applications pose the following problems:

How to ensure that multiple application instances across clusters can have a globally uniform start/stop order, which affects the scale in/out and rolling updates of some application instances. For a distributed KV storage based on consensus protocol, the process of scale in/out needs to go through membership change, which involves the determination of majority change in the cluster. If multiple member clusters scale out at the same time without a globally standardized ordering, it will affect the correctness of the consensus reached by the consensus protocol.
How to ensure that all applications across clusters have globally unique instance identifiers, a natural solution is to incorporate member cluster ids into instance identifiers.
How to solve the problem of cross-cluster application communication and provide a globally uniform network identity. Currently, in our attempts and practices, we use submariners to bridge the network communication between multiple member clusters. The current implementation relies on a specific network plugin.
How to solve the common functions such as cross-cluster stateful application update and capacity expansion and contraction, and provide more fine-grained update policies, such as realizing the function of Partition Update in member clusters.

In order to better solve the above-mentioned problems, we need to introduce a new workload on Karmada to implement a cross-cluster version of “StatefulSet”.

Some early attempts at Xline

Since the Karmada community has not yet discussed the implementation details of the new API, we have made some simple attempts to deploy, scale up and down, and update Xline under Karmada. The overall architecture of the program is as follows:

In the overall architecture, we adopt a two-tier Operator approach, in the control plane of Karmada, we deploy a Karmada Xline Operator, which is responsible for interpreting and splitting some Xline resources defined in Karmada, and sending them to member clusters. The Xline Operator on the member cluster monitors the creation of the corresponding resource and then enters the Reconcile process to complete the operation.

Deployment

Let’s take a look at a common deployment method for distributed application clusters under a single cluster (using etcd operator to deploy an etcd cluster as an example). etcd-operator can deploy an etcd cluster in two phases:

Bootstrap: Create a seed node of etcd with an initial-cluster-state of new and a unique initial-clsuter-token.
Scale out: perform a member add on the seed cluster to update the cluster network topology, and then start a new etcd node with an updated network topology and an initial-cluster-state of existing.

However, in the cross-cluster scenario, due to the lack of a globally standardized startup order for pods in different member clusters, Xline Operators in different member clusters will concurrently perform cluster expansion operations, which will adversely affect the membership change process of the consensus protocol. In order to bypass this problem, Xline adopts a static deployment method, as shown in the following figure:

First of all, users need to define the corresponding resources on karmada to describe the cluster topology of a cross-cluster Xline cluster. karmada Xline Operator, after monitoring the resources being applied, will interpret and split the resources into the CR of XlineCluster on the member cluster, and then issue the CR of XlineCluster to the member cluster. The XlineCluster CR contains the number of replicas that should be created for the current member cluster, as well as the member cluster ids of the other clusters and the corresponding number of replicas. The Xline Operator on the member cluster, after monitoring the creation of the CR, will enter the Reconcile process to generate the DNS names of the other nodes in the Xline cluster using the issued cluster topology, and start the Xline Pod.

In the early days of exploration, the static deployment approach bypassed the lack of a globally uniform startup order for application instances under Karmada’s multiple clusters because it did not involve a membership change in the deployment process. However, there is no silver bullet in the software industry, and the same is true for static deployments, which have some trade-offs as follows. The following table compares the characteristics of dynamic and static deployments in a single cluster vs. multi-cluster scenario:

Scaling Up and Down

There are two specific types of scale in/out for stateful applications under Karmada:

Horizontal scale in/out — remove/add a member cluster and scale in/out nodes on it.
Vertical scale in/out — scale in/out on existing member clusters.

Horizontal scale out

As shown above, the overall process is as follows:

Create the corresponding member cluster, configure the submariner network, and add it to Karmada for management.
Modify the Xline resources on Karmada, and add a new record member4: 4in the member cluster field to indicate that you want to expand 4 Xline resources on member4.
Karmada Xline Operator will split the resources and distribute them to member4.
Xline Operator on member4 receives the corresponding resources, enters the corresponding Reconcile process, calls the Xline client to execute member add, reaches a consensus, starts the new Xline Pod, and repeats the above process until the number of Xline replicas on member4 reaches the specified number. on member4 reaches the specified number

Vertical scale out

For vertical scale out, the general process is also shown above:

Modify the Xline resources on Karmada, e.g., specify that the Xline Pod in member1 should be expanded from 3 to 4
Karmada Xline Operator will split the resource and distribute it to member1
When Xline Operator on member1 receives the notification of resource modification, it enters the corresponding Reconcile process, calls the Xline client to perform member add, and then starts the new Xline Pod after consensus is reached, and repeats the above process until the number of replicas of Xline on member1 reaches the specified number. replicas of Xline on member1 reach the specified number

Currently, because scale in/out inevitably involves a membership change process, and there is a lack of synchronization between member clusters under Karmada, there are limitations to the scale process: a horizontal scale out can only scale a cluster, and a vertical scale out can only scale a cluster on a specified member cluster.

Rolling updates

For a rolling update, the general process is shown above:

The user modifies the Xline resource on Karmada to change the Xline mirror version.
The Karmada Xline Operator splits the resource and distributes it to the member clusters.
The Xline Operator on the member cluster monitors the resource changes and enters the corresponding Reconcile process to perform a rolling update. The update process on the member cluster is no different from the update on a single cluster.

Currently, the main supported update method is the default rolling update, but from the perspective of practical application scenarios, at least the following two issues need to be considered:

The update process involves the stopping of old Xline nodes and the starting of new Xline nodes, which requires additional mechanisms to ensure that the update process is not unavailable.
More fine-grained update policies, such as Partition Update, should be supported; among multiple member clusters, priority should be given to updating clusters where only the follower exists, and when updating the member cluster where the leader resides, the leader should be transferred to the updated member cluster to avoid extreme situations where the leader frequently steps down due to rolling updates.

Conclusion

Given the trend of multi-cloud and multi-cluster management and the nature of Xline’s business, the Karmada and Xline communities have formed a working group to promote stateful application management in Karmada multi-clusters. In order to solve the problem of managing stateful applications in Karmada multi-clusters more elegantly, we need to introduce a new Karmada workload, and since the Karmada community has not yet reached a consensus on the implementation details of the new workload, in the early stage of experimentation, Xline adopts a two-tier Operator approach, which is implemented through the Karmada Xline Operator to the Karmada Xline Operator. The Karmada Xline Operator interprets and splits the top-level resources and sends them to the member clusters, and then the Xline Operator on the member clusters tunes the resources.

In this way, we have made some early attempts to deploy Xline on Karmada and explore rolling updates, and made some preliminary preparations for the development and design of the new Karmada StatefulSet workload in the future.

Accelerating Machine Learning with GPUs in Kubernetes using the NVIDIA Device Plugin

Mon, 29 Apr 2024 16:00:00 GMT

Member post originally published on the SuberOrbital blog by Keegan McCallum

NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads.

Introduction

Generative AI is having a moment right now, in no small part due to the immense scale of computing resources being leveraged to train and serve these models. Kubernetes has revolutionized the way we deploy and manage applications at scale, making it a natural choice for building large-scale computing platforms.

GPUs, with their parallel processing capabilities and high memory bandwidth, have become the go-to hardware for accelerating machine learning tasks. NVIDIA’s CUDA platform has emerged as the dominant framework for GPU computing, enabling developers to harness the power of GPUs for a wide range of applications. By combining the capabilities of Kubernetes with the extreme parallel computing power of modern GPUs like the NVIDIA H100, organizations are pushing the boundaries of what is possible with computers, from realistic video generation to analyzing entire novels worth of text and accurately answering questions about the contents.

However, orchestrating GPU-accelerated workloads in Kubernetes environments presents its own set of challenges. This is where the NVIDIA Device Plugin comes into play. It seamlessly integrates with Kubernetes, allowing you to expose GPUs on each node, monitor their health, and enable containers to leverage these powerful accelerators. By combining these two best of breed solutions, organizations are building robust, performant computing platforms to power the next generation of intelligent software.

Understanding the Nvidia Device Plugin for Kubernetes

The NVIDIA Device Plugin is a Kubernetes Daemonset that simplifies the management of GPU resources across a cluster. Its primary function is to automatically expose the number of GPUs on each node, making them discoverable and allocatable by the Kubernetes scheduler. This allows pods to request and consume GPU resources in a similar way to cpu and memory. Under the hood, the device plugin communicates with the kubelet on each node, providing information about the available GPUs and their capacities. It also monitors the health of the GPUs, ensuring they are functioning optimally and reporting any issues to Kubernetes.

Some of the benefits of the NVIDIA Device Plugin include:

Automatic GPU discovery and allocation, eliminating the need to manually configure GPUs resources on each node.
Seamless integration with Kubernetes, allowing you to manage GPUs with familiar tools and workflows
GPU health monitoring, allowing Kubernetes to maintain stability and reliability for GPU-accelerated workloads.
Resource sharing, which allows multiple pods to utilize the same GPU, which is crucial in an environment like today where GPUs are scarce and expensive.

Installing and Configuring the Nvidia Device Plugin

Prerequisites

Ensure that your GPU nodes have the necessary NVIDIA drivers (version ~= 384.81) installed.
Install the nvidia-container-toolkit (version >= 1.7.0) on each GPU node.
Configure the nvidia-container-runtime as the default runtime for Docker or containerd.
Kubernetes version >= 1.10
If using AWS EKS for example, these will be handled for you by default when using GPU nodes

Deploying the Device Plugin

First, we’ll install the daemonset using helm. To install the latest version (v0.14.5 at the time of writing) into a cluster with default settings, the most basic command is:

helm upgrade -i nvdp nvidia-device-plugin \
  --repo https://nvidia.github.io/k8s-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version v0.14.5

This will install OR upgrade a helm release named nvdp into the nvidia-device-plugin namespace, with default settings.

This will give you a basic setup, but there are many reasons you may want to customize the chart via values.yaml. We’ll dive into some of the most useful options as well as some best practices, but you can see the full set of values here. You’ll likely want to add taints to your GPU nodes (the method used will depend on your kubernetes setup and how you are provisioning node) and then configure tolerations to ensure that the device plugin is only scheduled on GPU-enabled devices. We’ll dive deeper into these types of configurations in part 2 of this series.

The nvidia-device-plugin supports 3 strategies for GPU sharing and oversubscription, allowing you to optimize GPU utilization based on your specific workload’s requirements. A quick overview of each, with examples of how to configure via values.yaml:

Time-slicing: This strategy allows multiple workloads to share a GPU by interleaving their execution. Each workload is allocated a specific time slice during which it has exclusive access to the GPU. Time-slicing is useful when you have many small workloads that don’t require the full power of a GPU simultaneously. One important point to note is that nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the full GPU memory and runs in the same fault-domain as all of the others (meaning that if one pod’s workload crashes, they all do). In my experience, time-slicing usually isn’t what you’re looking for when it comes to GPU resource sharing, it’s basically just letting all the pods access the single GPU in a free-for-all manner and executing things concurrently without any regard for each other. If you have workloads that don’t mind this, such as Jupyter notebooks for research that aren’t utilizing the GPU at the same time, this setting COULD be useful, but I’d recommend looking at the other options first unless you know what you’re doing.Example values.yaml for time-slicing:

config:
  map:
    default: |-
      version: v1
      sharing:
        timeSlicing:
          resources:
          - name: nvidia.com/gpu
            replicas: 10

Multi-Instance GPUs (MIG): To mitigate the potential downsides of time-slicing, NVIDIA supports MIG. MIG is a feature supported on certain NVIDIA GPUs (e.g., A100) that enables partitioning a single GPU into multiple smaller, isolated instances. Each instance behaves like a separate GPU with its own memory and compute resources. MIG is beneficial when you have workloads with varying resource requirements and want to ensure strict isolation between them. This is in contrast to MPS which gives you more fine-grained control over memory and compute resource allocation, but doesn’t provide full memory protection and error isolation between them. MIG supports both mixed and single strategies for exposing GPUs to kubernetes, if interested you can read more about how they work here. Mixed is more flexible and I’d recommend using mixed unless you have a cluster large enough that exposing only a single MIG type per node is feasible. MIG is only supported on NVIDIA Ampere GPUs and while less flexible than MPS, MIG is the most complete solution for workload isolation if your workloads require that.Example values.yaml for MIG:

config:
  map:
    default: |
      version: v1
      flags:
        migStrategy: "mixed"

CUDA Multi-Process Service (MPS): MPS is a runtime service that enables multiple CUDA processes to share a single GPU context. It allows fine-grained sharing of GPU resources among multiple pods by running CUDA kernels concurrently. This mode feels the most similar to the way kubernetes can allocate cpu and memory resources in a fine-grained way, and is supported on almost every CUDA-compatible GPU. MPS will split up a GPU into equal slices of compute and memory, and the MPS control daemon will enforce these limits. Sharing with MPS is currently not supported on devices with MIG enabled. Sharing with MPS is currently not supported on devices with MIG enabled. MPS is suitable when you have workloads that can efficiently share GPU resources without strict isolation requirements. If you don’t have strict isolation requirements, MPS is probably the right choice for you.Example values.yaml for MPS:

config:
  map:
    default: |-
      version: v1
      sharing:
        mps:
          resources:
          - name: nvidia.com/gpu
            replicas: 10

This should be a good introduction to GPU sharing to get you started. We will go into more detail about advanced configuration and best practices in part 2 of this series.

Allocating GPUs to Pods Using the Nvidia Device Plugin

Allocating GPUs to pods when using the nvidia-device-plugin is straightforward and should feel familiar to anyone comfortable with kubernetes. It is highly recommended to use NVIDIA base images for your containers in order to have all the necessary dependencies installed and configured properly for your underlying workload. Setting a limit for nvidia.com/gpu is crucial, otherwise all GPUs will be exposed inside the container. Finally, make sure to include tolerations for any taints set on your nodes so that the pod can be scheduled appropriately. Here’s a barebones example of a GPU-enabled pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Conclusion

The NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads. By abstracting the complexities of GPU management and providing seamless integration with Kubernetes, it empowers developers and data scientists to focus on building and deploying their models without worrying about the underlying infrastructure.

We’re just scratching the surface here, so if you’re interested to learn more please check out part 2 of this series where we’ll go into detail on advanced configuration, troubleshooting common issues, and some of the limitations of using the nvidia-device-plugin alone to manage GPUs. Also, check out the additional resources at the end of this article!

FinOps for Kubernetes: engineering cost optimization

Sun, 28 Apr 2024 16:00:00 GMT

Community post by Saqib Jan

Cloud has given on-demand access to compute resources, but high availability also makes cost a much more dynamic problem to forecast. This reverberates as companies continue to expand their cloud footprints and adopt more cloud technologies — the potential for waste also increases.

The 2024 State of FinOps report underscores, organizations are focused on reducing waste. And with efficiencies top of mind, it is imperative that savings and cost optimization is the top priority for engineering leaders considering a FinOps model in Kubernetes. Because understanding how to estimate cost and optimization is a black hole for platform engineering and finance teams, the biggest challenge for stakeholders is figuring out where costs originate. So, to address this, some understanding of ownership costs is important.

Big tech engineering teams with mature finance and product management practices are solving for these challenges with cost models that help measure the total cost of ownership of their services and applications. Laurent Gil, Cloud Neutrality Advocate and Co-Founder of Cast AI, exposits that the cost model often isn’t sufficient for anything but an informed starting point—a good enough point must be reached to avoid over-investment.

The first thing to consider is the cost drivers—elements that contribute to the overall cost and how to incorporate them into calculations. There is CPU, memory, and storage allocated to each service and execution in Kubernetes. Workloads also grow larger over time, and there are a variety of costs involved in hosting, integrating, running, managing, and securing cloud workloads. While some charges directly relate to compute, data transfer, and storage consumption, other factors add complexity. And there are also toolings as well as integrations with other cloud services that must be factored into the total cost of ownership (TCO) calculations.

Architect for Efficient Cloud Usage

If you run Kubernetes yourself, you need to have a strong engineering team. And unless you are in a business close to containerization and microservices technology, it’s essentially just a cost center and an inefficient use of resources.

It’s a very challenging task to build out Kubernetes yourself and being able to understand all the different nuances of all the different components that need to be set up and configured properly. And anyone who is not in the business of running infrastructure will basically benefit from managed Kubernetes. You don’t have to hire a team for anything which you want to run at any reasonable level of reliability.

Richard Hartmann, Director of Community for Grafana Labs, shares two fundamental ways to efficient cloud usage. One is to “go all-in on bespoke services, leveraging whatever you can to reduce undifferentiated heavy lifting and focus on solving problems that drive your business forward.” Alternatively, you can “use as few bespoke services as possible, relying solely on the baseline across all providers.” This approach allows you to maintain control over how your platform operates and facilitates easy migration between clouds. “Both approaches have merit,” but he cautions being in between is usually not ideal as it exposes you to the drawbacks of both tradeoffs.

Both solutions have similar problems. Cloud is expensive, and there is zero incentive for cloud providers to offer great cost controls. Hartmann points out the inherent conflict of interest, underscoring, “that would literally enable users to pay less,” something no cloud provider would want, particularly in the current macroeconomic uncertainties, adding more pressure to engineering leaders already contending with shrinking budgets, and heightened expectations for cost efficiency.

And so, there’s also a lot of interest from companies about trying to figure out what is the right model and how to find that happy balance between knowing not just what’s running in production but also how to set up a dev-test environment effectively.

“A lot of our customers today are looking at different models internally for their projects that are running in the cloud through managed Kubernetes offerings, whether it be a showback or chargeback type model,” commented GitLab Field CTO Lee Faus. And he mentioned, “We’ve had a few customers who tried to implement quotas around what they’re allowed to spend using high water, low water marks. But in doing so, they’ve realized that because of the way most managed Kubernetes clusters work, they incentivize you to build things like auto-scaling.”

There are more reasons why organizations end up in situations with over-provisioned clusters, which not only lead to poor cycle times from a CPU and memory perspective but also ultimately result in a negative experience for end-user interactions with the applications.

To counteract the risk of uncontrolled spending, Hartmann says “we have implemented deep control over what specifically we do and built our cost controls for self-managed clusters and as well managed platforms.” This approach helps scrutinize operations, ultimately enforcing a chargeback program that encourages a shared sense of accountability across stakeholders.

Both Hartmann and Faus highlight challenges in managing costs and finding the right balance between control and cost efficiency. FinOps practices, they affirm, help organizations to anticipate, control, check, and optimize their cloud investments on a proactive and reactive basis.

FinOps for Kubernetes Cost Control Strategies

Cost management on the cloud side can get out of control, but a lot of that stems from not having good rigor in the software development lifecycle — where things are pushed into production before they’re actually ready, or when they haven’t been adequately tested from a performance perspective. And considering the business value, there has never been a more important time to adopt FinOps principles ‘inform, operate, and optimize’ because existing solutions do not capture the nuances needed to economically achieve the perfect balance between cost and performance.

FinOps is the discipline to exhort shared responsibility and bring together all stakeholders (tech, business, and finance people) to establish policies and best practices for usage that are programmatically enforced. Adopting a FinOps approach can help platform engineering teams dramatically increase their visibility necessary to find ways to reduce costs without affecting performance.

When DevOps, say for instance, gives developers tools and guardrails to build, deploy, and fully own an application, it’s important to also educate them about overall cost management. This is because empowering teams to take action is the top challenge. And it’s usually not until the bill comes due at the end of the month that finance teams realize there is an issue with sudden spikes in costs.

I can tell you from supporting clients that most organizations leveraging Kubernetes struggle to manage their cloud expenses because there is no proper review and refining cycle for their processes, and also the pool of skilled workers in this segment is very dry.

Faus, in our conversation, stipulated, “There’s a term that we’re starting to see a lot of companies use, which revolves around value streams.” Value streams allow us to map back to key performance indicators (KPIs). And, these KPIs are defined at the CEO, CIO, and CFO level, where budgets are drawn, resource hiring is planned, and new product lines are decided. “This provides a high-level mapping back to those elements and around those value streams. When we drive throughout the given year, we need to have a way to ensure that we are actively tracking these aspects throughout the SDLC and in our cost management.”

Whatever you may call it, empowering development teams becomes imperative when using Kubernetes. Taking responsibility to make informed decisions will, in turn, make Kubernetes cost management timely, proactive, and cost-effective. As budgets get tighter, there is a great need for cost control strategies — to build from a knowledgeable foundation your own cost controls and implement a third-party solution, whether commercial or open source, to avoid linear cost increases. They are effective for everyone — for any cloud provider and even Kubernetes on bare metal infrastructure.

Case Study: LambdaTest

A perfect example to showcase the lasting impact of FinOps practices can be seen with LambdaTest. This young company, providing infrastructure as a cloud platform for online browser and operating system testing, quickly scaled up its services after securing initial funding but then encountered challenges with sudden spikes in cloud costs during subsequent rounds of funding.

As the senior DevOps engineering leader, Shahid Ali Khan led the responsible development of LambdaTest’s Kubernetes infrastructure and overall infrastructure system. He shared invaluable insights on navigating the exhaustive platform engineering challenges and adopting FinOps principles, imperative in the process, for optimizing cloud resources and saving cloud costs.

This case study highlights LambdaTest’s journey to FinOps maturity, emphasizing cost optimization. And outlines a systematic approach with insights from notable leaders, to proactively navigate these hurdles. The study discusses technology solutions, strategies, outcomes, and lessons from my one-on-one interviews with these leaders.

— The Challenges of Managing Infrastructure

As LambdaTest expanded their offerings, the complexity of their infrastructure also increased, relying on AWS and self-managed Kubernetes to support their data-heavy customers. This architecture allowed them to scale rapidly, and Mudit Singh, Head of Growth & Marketing, reflects their initial decision. “When we started off with Kubernetes, no cloud provider was offering a static, stable solution for it. And at that time, in around 2017, AWS released Managed Kubernetes, which remained in testing for an extended period. As a startup with a talent shortage, we were unsure about managing our own cluster.”

As their usage increased, each month ended with sudden cost spikes that created more questions around spend, like ‘How much are we spending?’ ‘Is this normal?’ ‘How should cost be divided between teams, applications, and business units?’ and ‘What is the problem, over-provisioning, or using too much compute or memory?’ remained unanswered. This situation, Singh shares, “drained our DevOps leaders, platform engineering and Finance teams (including the Founders) to invest a significant amount of time and attention in understanding the hefty incoming invoices.”

These issues even escalated over time — as usage grew, the risk of losing cost control also increased. The company sourced tools to produce cloud consumption but struggled to identify and address cost drivers. Like many, LambdaTest also faced challenges in the balancing act of trying to build a team with a FinOps culture and striving for enhanced cost visibility.

Singh stressed how identifying and addressing the underlying problems driving up costs proved to be a struggle for them while managing data centers across continents. And Khan expounds on the cross-functional initiatives they took to gain clarity on cost drivers, achieving visibility and transparency into spending and cloud usage.

— Create a Tagging Framework

It is difficult to align reports to business context without insight into workload allocations, and the industry has seen the adoption of more structured approaches to resource management.

Khan detailed, “We have multiple products that are running, and there are services shared among those products. It was getting hard for us to identify which service is contributing—or which particular service is contributing—to the cost of each product.” Tagging with labels also helps identify over-provisioned resources. And, “we began by implementing tagging and labeling, along with utilizing different node pools. This enabled us to precisely determine the cost allocation for each product in terms of specific resources, understand how their requirements have scaled over time, and effectively address those needs.”

It has now become a common practice to leverage namespaces for each product or service in Kubernetes, with a clear bifurcation of services. This approach not only lays the groundwork for resource management but also supports isolation, resource quotas, and simplified access control—enhancing operational efficiency. And most importantly makes cost analysis, reporting, and optimization easier for individual teams, services, or business lines.

— From Data to Action

The volume of data subject to analysis for cost optimization is always considerable. Being able to vectorize that data and understand where there are errors, where there might be memory spikes, CPU spikes—these are areas where you can not only optimize for the cost structure of how to manage applications but also provide feedback to the engineering teams. Faus underlines, “This involves actions like automatically promoting an issue or a ticket on the product side to ensure that something is going to be done about that cost as part of a current sprint.”

This process should also involve analyzing time-series data, which exceedingly helps to identify inefficiencies, make informed decisions about resource allocation and find potential automation opportunities. There are other strategies and optimization tactics you could adopt, but in general, at a basic level, it’s good to consider first what you are optimizing for.

So what is optimizing for cost? It’s really just another metric to consider. Say, for example, I already manage CPUs, pods, memory, storage, and compute capabilities. Each one, then, is a piece of the larger puzzle I’m piecing together. So, adding cost into the mix doesn’t change the fundamental approach; it’s just integrating another element into the array of resources we’re already balancing.

Khan emphasizes the key focus is on enabling “our tech personnel to efficiently extract and manipulate data to align with our business objectives, rather than the other way around. This strategy, which we also implement internally at LambdaTest, underscores the critical importance of fostering internal collaboration and knowledge exchange to effectively bridge the technological and business divides within our organization.”

— Cost Visibility

The very important aspect of optimizing the cost and running the cluster without impacting the performance and usage is monitoring. It is the core pillar toward building awareness and informing FinOps objectives for optimization strategies.

“We tried a lot of solutions and multiple plugins, but we could not get a clear understanding of the volume of requests, the performance of the cluster, or the overall system status,” Khan specified. “We implemented distributive tracking (and distributed tracing) inside the cluster to monitor each and every request, which helped us to identify how services are being used and pinpoint optimization opportunities within the system. This tremendously helped us to identify inefficiencies, which increased accountability – informing service owners to take action, while also enabling things like internal chargeback and showback models.”

Visibility underpins FinOps metrics (idle resources, under-optimized infrastructure) for tracking progress. But the key metric to consider is normalized cost, which, when adjusted for your operating business metrics, provides a more holistic view of your cloud spending relative to your business activities.

— Drill-Down Granularity

What you do next will depend on where your baseline is. But allowing issues to persist over time makes controlling costs at a later stage difficult and challenging. Even if you attempt to control them later, the level of effort required from your team would be very difficult, diverting focus from implementing features.

It is here that a tagging framework with a Kubernetes cost management tool becomes helpful. This helps you drill down into the layers of your environment so you can see exactly how each application is impacting your costs, enabling proactive recommendations for cost savings.

Khan shared their approach, “We began observing all attributes that influence pricing, and based on that, we looked deeper into why there has been an increase, what could have caused it, and then took rather difficult — educational path to show individual teams how their environment impacts their department’s resources.”

The ideal solution, according to Khan’s recommendation, “should provide time-saving features.” An example would be a prioritized list of your environment’s most expensive components, ranked by cost. This allows you to focus on the areas that will yield the most significant cost savings first. “We realized this and implemented a proactive approach across teams, ensuring work could proceed in a manner that does not affect production.”

— Real-Time Alerting

Now, in cloud-native environments with auto-scaling enabled, your cluster or nodes can scale up or down. Therefore, implementing budgets and alerts within the cloud system is imperative, as non-tracking can lead to significant expenses that won’t justify the solution you are providing. This is why applying custom rules programmatically allows you to receive notifications when costs increase, enabling you to take corrective actions for specific requests.

Khan remarked, “FinOps practices have significantly changed the way we work.” Through extensive data analysis, “we measure costs to a significant extent and set budgets for each product. For example, each product has clusters, and some of them share services. With simple tagging, we can set specific budgets for each product. And when we allocate a certain amount to a product, we get alerts if spending goes over a predetermined threshold.” A small increase triggers an amber alert, and a big jump triggers a red alert.

The optimal solution should also alert you to abnormal cost spikes in real-time so you can examine the issue right away and remediate it, rather than waiting for weekly or monthly reports. And sometimes, these spikes may serve as an early warning sign of a cyberattack, which requires an immediate and proactive response to safeguard your infrastructure and data integrity.

Encourage FinOps Practices

The more intentional approach you take to plan for change, you’ll target places where change will be the most effective soonest. But the worst thing you could do as an organization is to say, ‘we’re going to inform’ without understanding the extent to which overspending is ingrained in your Ops and, importantly, where some of the key drivers are coming from.

The best way to manage Kubernetes at scale is to take a holistic and intentional approach, which also helps in calculating the total cost of ownership and allocating budgets, but it is not something most companies are doing. Bifurcation of resources is not what most companies are doing either. A lot of companies are managing huge infrastructures, but what they lack is a dedicated FinOps team for such instances. And the reactive approach that they are taking for incidents, in terms of cost management, lead to significant financial burdens.

Cloud lets you accelerate, but it can also be a double-edged sword without a proactive approach. According to the CNCF microsurvey report, over provisioning or having more resources than necessary, is one of the most common factor leading to over-provisioning.

“We’ve analyzed usage data on thousands of applications, and there are three primary reasons companies overspend: over-provisioning, pod requests being set too high, and low usage of Spot instances,” Gil enumerated. The biggest source of overspend, however, is an overestimation of the real CPU/memory usage. “For more than 97% of the applications we analyzed, the pre-optimized utilization of CPU is only 12%. That means that, on average, nearly 90% of compute is paid for, but goes unused.” And these percentages, he laid out, “are consistent across application sizes and cloud providers.”

The underlying reason that causes most companies to overspend is the lack of education and empowerment. Tech, DevOps, and infrastructure teams often lack cost awareness. And change is not easy because to build the culture of transparency and openness requires sharing pricing information with engineers and creating a safe space for open communication.

This is difficult for nearly all companies because people are first concerned about not stepping on anyone else’s toes. And it can be very unhelpful if fewer people are bold enough to be involved. The key is to get everyone on the same page regarding the business objectives. This means sharing the plan, how things are looking in the near future, what kind of services are planned for wider rollout, and even the company’s gross margin. “It is transparency that spurs on shared understanding,” Hartmann remarks. When everyone sees the bigger picture, they can feel the “real pain” of overspending and how their work directly impacts it. This shared understanding empowers team members to contribute to cost control strategies.

Building on the strategies discussed, a Kubernetes governance platform can serve as an initial step to gain clarity into resource utilization and enable you to drill down into the various layers of your environment. It can also provide policy-based control for cloud-native environments, empowering teams to make informed financial decisions regarding Kubernetes by allowing them to grasp and adopt cost-control strategies.

Author: Saqib Jan

Email: sakimjan8@gmail.com

LinkedIn: https://linkedin.com/in/s-jan

BIO: Saqib Jan is a freelance analyst with experience in application development, cloud technologies, and consulting.

The hidden economy of open source software

Thu, 25 Apr 2024 16:00:00 GMT

Member post originally published on Sysdig’s blog by Nigel Douglas

The recent discovery of a backdoor in XZ Utils (CVE-2024-3094), a data compression utility used by a wide array of various open-source, Linux-based computer applications, underscores the importance of open-source software security. While it is often not consumer-facing, open-source software is a critical component of computing and internet functions, such as secure communications between machines.

Open source software (abbreviated as OSS) has become a cornerstone of the tech industry, influencing everything from small startups to global corporations. Despite its ubiquitous presence and foundational role in driving innovation, the true economic value of OSS has remained largely uncharted territory—until now. A groundbreaking study entitled “The Value of Open Source Software” by researchers Manuel Hoffmann, Frank Nagle, and Yanuo Zhou at Harvard Business School delves into this unexplored domain, revealing the astonishing economic impact of OSS throughout industry.

A Priceless Foundation with a Trillion-Dollar Impact

The study begins by addressing a fundamental paradox: How do you measure the value of something that is freely available? Traditionally, economic value is calculated by multiplying the price of a product by the quantity sold. However, this formula hits a snag when it comes to OSS—there’s no price tag on something that’s free, and tracking its usage is a Herculean task due to the decentralised nature of OSS distribution.

Leveraging unique global data sources and a novel approach, the research estimates the “supply-side” value (the cost to recreate the most widely used OSS) at $4.15 billion. But the true eye-opener is the “demand-side” value, pegged at a staggering $8.8 trillion. This figure represents the hypothetical cost that companies would face if they had to develop equivalent software internally, highlighting the immense savings and efficiency gains OSS provides to the global economy.

For instance, Falco, an open-source, cloud-native security tool, boasts contributions from 190 individuals dedicated to enhancing the software and ensuring it meets the evolving threats in cloud computing. If an organisation attempted to develop a custom threat detection engine in Go from scratch, it would be financially impractical to employ 190 staff members to continuously develop and maintain the tool. Although most of the 190 contributors likely engage with Falco as a side project rather than their primary employment, acknowledging the number of people actively committing to the project offers valuable insight into its collective human investment.

The Unsung Heroes of OSS

One of the most intriguing findings of the study is the concentration of value creation within the OSS community. A mere 5% of OSS developers are responsible for 96% of its demand-side value. This elite group of contributors has a disproportionate impact on the software landscape, emphasising the need for support and recognition from both the tech industry and policymakers.

Sticking to the topic of the recent XZ Utils backdoor, to prevent incidents like that from recurring, policymakers and software vendors must take proactive steps to enhance the security and integrity of existing OSS projects. Many OSS maintainers work on these projects voluntarily, without compensation, and often in addition to their regular employment. This can lead to overwork and burnout, creating vulnerabilities that adversaries can exploit to compromise software.

Without adequate safeguards and support systems, these maintainers operate in an environment that undervalues their crucial contributions and exposes them to significant risks. To address these challenges, there is a pressing need for policy interventions that recognise and financially support OSS development, along with industry-wide adoption of rigorous security practices. By implementing measures such as funding OSS projects, offering security training for maintainers, and developing comprehensive review processes, policymakers and vendors can protect maintainers from undue pressures and enhance the security of OSS.

The Programming Languages That Power the Economy

Digging deeper, the study finds that the lion’s share of OSS value is actually generated by a few key programming languages, with Go, JavaScript, and Java leading the pack. These languages are not just popular among developers; they are instrumental in creating billions of dollars in value, further emphasizing the strategic importance of investing in and nurturing the OSS ecosystem.

The notion of organisations opting to create proprietary programming languages rather than leveraging existing open-source options like JavaScript or Python libraries does not hold practical merit, considering the extensive resources and expertise required for such an endeavor.

Constructing a new programming language from scratch involves not just the immense initial development effort but also the continuous maintenance, development of libraries, tools, and community support to make it viable for production use. Moreover, the existing ecosystems around popular languages such as JavaScript and Python are the result of years of collective effort and contributions from a global community, encompassing vast libraries and frameworks that facilitate rapid development and deployment of applications.

These widely-used languages, however, are not without their vulnerabilities, including known Common Vulnerabilities and Exposures (CVEs) that pose significant security risks if left unpatched. Addressing these vulnerabilities often falls beyond the capacity of individual organisations, especially considering the breadth of open-source dependencies modern applications rely on. This scenario underscores the crucial role of large software vendors in enhancing the security infrastructure of the open-source ecosystem.

By contributing to the security of these languages and libraries, either through direct code contributions, funding, or the provision of advanced security tools and services, these vendors can significantly reduce the potential attack surface for organisations worldwide. Such collaborative efforts between individual maintainers, organisations, and large vendors are essential in bolstering the overall security posture of the open-source software that underpins much of today’s digital infrastructure.

How is the Falco project staying secure?

The Falco project emphasizes its commitment to maintaining vendor independence and the collective effort to bolster its security posture. A foundational pillar of Falco’s philosophy is its vendor-neutral stance, ensuring that the project benefits from a wide array of contributions without being tethered to any single company’s interests. This approach has fostered a diverse and robust community, with significant engineering resources dedicated by several leading companies.

To prove the project’s maturity and reliability, Falco successfully graduated from the Cloud Native Computing Foundation (CNCF) incubating status. This achievement was marked by a fairly rigorous Due Diligence process conducted by the CNCF Technical Oversight Committee (TOC), including a comprehensive third-party security audit. This graduation not only proved Falco’s growth and sustainability, but also solidified Falco’s position as a leader in the open-source runtime security ecosystem.

Reflecting on Falco’s commitment to an inclusive development environment, Falco boasts contributions from 17 organizations actively committing to the project. Notably, approximately 38% of contributions originated from diverse committers affiliated with renowned organizations such as Amazon, Cisco, Chainguard, Clastix, IBM, Microsoft, RedHat, SecureWorks, among others, alongside many individual contributors. This collective effort also demonstrates how Falco’s mission to foster a broad-based and resilient security tool is being enforced.

Governance practices further cement Falco’s dedication to vendor neutrality, with specific measures to prevent any single entity from dominating the project’s direction. A key governance rule caps any organization’s eligible votes at 40%, ensuring balanced representation and decision-making within the project community.

Towards a Sustainable Future for OSS

Harvard’s study revelations are a clear call to action to organisations to reflect on the value of OSS in their business, while also highlighting how many of those projects are taking appropriate steps to audit their projects. The paper further highlights the vital role of OSS in driving technological innovation and economic efficiency.

However, this digital commons, much like its physical counterparts, is vulnerable to overuse and underinvestment – as seen with the XZ Utils backdoor. The findings advocate for a concerted effort to support OSS development, ensuring its sustainability and continued contribution to the global economy.

“The Value of Open Source Software” study shines a spotlight on the hidden economic powerhouse that is OSS. By quantifying its value, the research not only celebrates the contributions of the OSS community but also highlights the critical need for strategic investment and support to secure its future. As we move forward in the digital era, the true value of OSS cannot be overstated—it is an indispensable resource that fuels innovation, drives efficiency, and shapes the technology landscape.

Open source software in AI and cloud trends to watch in 2024: thoughts from the Netris community

Thu, 25 Apr 2024 16:00:00 GMT

Member post originally published on Netris’s blog

Let’s face it: The world of open source software can feel boring – in a good way. Open source has become so pervasive, and so deeply entrenched within modern software stacks and ecosystems, that it’s easy not to think much about it. The era of AI, cloud and big data is here – and now, more than ever, open-source is playing a critical role.

Yet the recent roundtable discussion that Netris hosted with Kelsey Hightower was a reminder that there is still plenty of change afoot for open source software and everyone who uses it. The event didn’t aim to focus on open source specifically – and participants did discuss other important topics, like cloud computing trends and the relationships between cloud competitors – but open source was a key part of the conversation.

Specifically, Hightower and other participants discussed three themes that are poised to have major consequences for open source software in 2024 and beyond.

1. Open source licensing changes

Looking back at the past year, Hightower observed that the open source ecosystem had been rocked by some messy debates surrounding licensing – namely, HashiCorp’s decision to change the licensing terms for future releases of some of its products (including Terraform, a popular Infrastructure-as-Code tool) and Red Hat’s new policy of placing source code for its Linux-based operating system behind what critics deem a “paywall.”

These developments affected only a handful of open source products, and neither turned previously open source solutions into closed source software. Nonetheless, they sparked a fair amount of controversy in the open source ecosystem about the long-term viability of open source licenses.

Hightower’s take was that the license changes probably don’t signal a wholesale crisis for open source, but they do reflect a new reality that more and more companies will need to embrace if they want to continue to benefit open source: The need for a greater investment in open source projects by companies that can make meaningful contributions.

“It’s not sustainable to work for free,” Hightower said. “Open source sustainability is coming to a head.”

He added that “most people don’t know this but even Kubernetes struggled to get contributors,” referring to the open source container orchestration platform that he helped develop as a Distinguished Engineer at Google.

The solution, in Hightower’s view, is simple: Companies that want to use open source software need to pay more developers to contribute to it. “If you want to avoid a Red Hat ‘paywall,’ go help write the code,” he said.

2. Open source, AI and network infrastructure

Roundtable participants also tackled what has become a buzzworthy topic over the past year: The role of open source software in the generative AI space.

Alex Saroyan, Netris co-founder and CEO, noted that much of the discussion to date about open source and generative AI has centered on companies like Mistral, which aim to build open source generative AI models that can perform at least as well as those from vendors like OpenAI (which, despite its name, does not produce open source products).

That’s one important facet of open source in the realm of AI, Saroyan said. But another critical consideration – and one that hasn’t received nearly as much attention as it deserves – is the importance of providing open source AI projects with access to the cloud and networking infrastructure resources they need to train AI models.

The reason why is simple: Few, if any, open source projects own the massive compute infrastructure they need to train models. Instead, they rely on cloud infrastructure for training. For AI model training, as well as inference, leveraging the Big 3 public cloud providers – meaning Amazon, Azure and Google Cloud Platform – becomes prohibitively expensive, especially at scale. To “do” AI, businesses need AWS alternatives, GCP alternatives and so on.

“This is why we are seeing many new organizations launching AI cloud services for model training, as well as deploying private edge clouds for AI inference,” Saroyan said.

Indeed, making AI infrastructure more accessible through alternative public and private cloud providers is “one of the reasons why we’re seeing new generations of ethernet technology, like NVIDIA Spectrum-X,” Saroyan noted.

He added that “AI needs significantly more network and cloud infrastructure built in a highly-scalable but also highly cost-efficient manner. The new generation of networking solutions that Netris is helping customers to deliver depends on open source software and commodity hardware. DPUs are a big part of this picture,” he said, referring to special acceleration hardware known as Data Processings Units (DPUs) that are vital for scalable and efficient networks.

In short: Open source has a critical role to play in the future of generative AI, and it’s not limited to open source AI models. Expect to see open source crop up in other corners of the AI ecosystem – including the networks that serve as the vital link between AI workloads and the infrastructure they depend on.

3. Shareable open source AI models

Hightower offered another prediction about how generative AI and open source will converge: “We’ll treat models like shareable libraries.”

He meant that AI developers will use the open source example to build AI models that anyone can use and improve on. He envisions a world where borrowing someone else’s model is as simple as importing a module into your codebase or deploying a container from a public repository.

Hightower added that shareable open source AI models will require an “ecosystem of companies” to build, share and maintain AI software. “No one private entity can run away with these things,” he said.

Of course, given Hightower’s other observations during the roundtable about the importance for companies of backing open source products, any ecosystem that grows up surrounding open source models will need more than just volunteer labor. It will require committed investment from organizations with the means to support high-quality model development and training.

Conclusion: The future of open source

There’s plenty more to say about where open source is headed. But if Hightower and the rest of the Netris community are any guide – and we think they are! – expect new strategies for funding open source, as well as novel approaches to leveraging open source in the realm of AI, to become key open source trends during 2024.

Expect, too, to stop thinking of open source as a “boring” type of resource that developers can take for granted. The open source world is changing, and while we don’t know exactly what’s coming next, we are confident that developments like open source licensing changes and the advent of generative AI will force open source projects and communities to adopt new strategies.

How Katalyst guarantees memory QoS for colocated applications

Wed, 24 Apr 2024 16:00:00 GMT

Member post originally published on Katalyst’s blog

In the previous post[1], we introduced Katalyst – a QoS-based resource management system that helps ByteDance improve resource efficiency through colocation of online and offline workloads. In the colocation scenario, memory management is a crucial topic. On the one hand, when memory is tight on nodes or containers, the performance of the application may be affected, leading to issues like latency jitter or OOM (Out of Memory) errors. In colocation scenarios, where memory is overcommitted, this problem can become more severe. On the other hand, there might be some memory on nodes that is less frequently used but not released, resulting in less available memory that can be allocated to offline jobs, thus hindering effective overcommitment. To address these issues, ByteDance has summarized its refined memory management strategies practiced during large-scale colocation into a user-space Kubernetes memory management solution called Memory Advisor, which has been open-sourced in the resource management system Katalyst. This article will focus on introducing the native memory management mechanisms of Kubernetes and the Linux kernel, their limitations, and how Katalyst, through Memory Advisor, improves memory utilization while ensuring the memory QoS for business applications.

Limitations of native memory management

Memory allocation and reclamation of Linux kernel

Due to the much faster access speed of memory compared to accessing disk, Linux tends to adopt a greedy memory allocation strategy, aiming for maximum allocation. It only triggers reclamation when the memory watermark is relatively high. Memory allocation The Linux kernel has fast path and slow path for

Memory allocation:

Fast path: It first attempts to do a fast path memory allocation and then assesses whether the overall free memory level will fall below the Low Watermark after allocation. If it does, a quick memory reclaim is performed before re-evaluating the possibility of allocation. If the condition is still not met, it enters the slow path.
Slow path: In the slow path, it wakes up Kswapd to perform asynchronous memory reclaim and then attempts another round of fast memory allocation. If allocation fails, it tries memory compaction. If allocation is still unsuccessful, it attempts global direct memory reclaim, which involves scanning all zones and is time-consuming. If this also fails, it triggers a system-wide OOM event to release some memory and then retries fast memory allocation.

Memory reclamation

Memory reclamation can be categorized into two types based on the target: Memcg-based and Zone-based. The kernel’s native memory reclamation methods include the following:

Memcg-level direct memory reclaim: If the Memory Usage of a cgroup reaches a threshold, it triggers synchronous memory reclamation at the memcg level to release some memory. If this is unsuccessful, it triggers a cgroup-level OOM event.
Fast path memory reclaim: As mentioned earlier in the discussion of fast path memory allocation, fast memory reclamation is quick because it only requires reclaiming the number of pages needed for the current allocation.

Asynchronous memory reclaim: As shown in the diagram above, when the overall free memory of the system drops to the Low Watermark, Kswapd is awakened to asynchronously reclaim memory in the background until the High Watermark is reached.
Direct memory reclaim: As depicted in the diagram above, if the overall free memory of the system drops to the Min Watermark, it triggers global direct memory reclaim. Since this process is synchronous and occurs in the context of process memory allocation, it has a significant impact on the performance of the system.

Kubernetes Memory Management

Memory limit

Kubelet sets the cgroup interface memory.limit_in_bytes based on the memory limits declared by each container within the pod, constraining the maximum memory usage for both the pod and its containers. When the memory usage of the pod or container reaches this limit, it triggers direct memory reclaim or even an OOM event.

Eviction

When the memory on a node becomes insufficient, K8s selects certain pods for eviction and marks the node with the taint node.kubernetes.io/memory-pressure, preventing additional pods from being scheduled on that node. The trigger condition for memory eviction is when the node’s working set reaches a threshold:

memory.available := node.status.capacity[memory] - node.stats.memory.workingSet

Where memory.available is the threshold configured by the user. When sorting pods for eviction, the following criteria are considered:

First, it checks if a pod’s memory usage exceeds its request; if so, it’s prioritized for eviction.
Next, it compares the pods based on their priority, with lower-priority pods evicted first.
Finally, it compares the difference between a pod’s memory usage and its request; pods with higher differences are evicted first.

OOM

If direct memory reclaim still cannot meet the memory demands of processes on the node, it will trigger a system-wide OOM event. When the Kubelet starts a container, it configures /proc/<pid>/oom_score_adj based on the QoS level of the container’s associated pod and its memory request. This affects the order in which the container is selected for OOM Kill:

For containers in critical pods or Guaranteed pods, their oom_score_adj is set to -997.
For containers in BestEffort pods, their oom_score_adj is set to 1000.
For containers in Burstable Pods, their oom_score_adj is calculated using the following formula: min{max[1000 - (1000 * memoryRequest) / memoryCapacity, 1000 + guaranteedOOM]}

Memory QoS

Starting from version 1.22, K8s introduced the Memory QoS feature based on Cgroups v2 [2]. This feature ensures memory request guarantees for containers, thereby ensuring fairness in global memory reclaim among pods. The specific Cgroups configuration is as follows:

memory.min: Based on requests.memory configuration.
memory.high: Based on limits.memory * throttlingfactor (or nodeallocatablememory * throttlingfactor) configuration.
memory.max: Based on limits.memory (or nodeallocatablememory) configuration.

In version 1.27 of K8s, enhancements were made to the Memory QoS feature to address the following issues:

When container requests and limits are close, the throttle threshold configured in memory.high may not be effective due to memory.high > memory.min limitation.
The calculated memory.high may be too low, resulting in frequent throttling and affecting application performance.
The default value of throttlingfactor is too aggressive (0.8), causing frequent throttling for some Java applications that typically use more than 85% of memory.

To address these issues, the following optimizations were made:

Improvement in the calculation method of memory.high:

memory.high = floor{[requests.memory + memory throttling factor(limits.memory or node allocatable memory - requests.memory)]/pageSize} * pageSize

Adjustment of the default value of throttlingfactor to 0.9.

Limitations

From the introductions in the previous two sections, we can identify the following limitations in both K8s and the Linux kernel memory management mechanisms:

Lack of fairness mechanism in global memory reclamation: In scenarios where memory overcommitment occurs, even if the memory usage of all containers is significantly lower than the limit, the entire node’s memory may still reach the threshold for global memory reclaim. In the widely used Cgroups v1 environment, the memory request declared by containers is not reflected in Cgroups configuration by default, but serves only as a basis for scheduling. Therefore, there is a lack of fairness guarantee in global memory reclamation among pods, and available memory for containers is not divided proportionally based on requests, unlike CPU resources.
Lack of priority mechanism in global memory reclamation: In colocation scenarios, low-priority offline containers often run resource-intensive tasks and may request a large amount of memory. However, memory reclamation does not consider the priority of the applications, leading to high-priority online containers on nodes entering the slow path of direct memory reclaim, thereby disturbing the memory QoS of online applications.
Delayed triggering of native eviction mechanisms: K8s mainly ensures the priority and fairness of memory usage through kubelet-driven eviction. However, the triggering timing of native eviction mechanisms may occur after global memory reclamation, thus not taking effect promptly.
Impact on application performance by memcg-level direct memory reclaim: When the memory usage of a container reaches a threshold, memcg-level direct memory reclaim is triggered, causing latency in memory allocation, which may lead to business jitter.

Katalyst Memory Advisor

Overall architecture

The architecture of Katalyst Memory Advisor has undergone multiple discussions and iterations. It adopts a pluggable design, following a framework with plugins model, which enables developers to flexibly extend functionality and policies. The scopes of each component or module are as follows:

Katalyst Agent: Resource management agent running on each node. The following modules are involved for memory QoS management:
- Eviction Manager: A framework that extends the native eviction policies of the kubelet. It periodically invokes interfaces of eviction plugins, retrieves the results of eviction policy calculations, and executes eviction actions.
- Memory Eviction Plugins: Plugins for the Eviction Manager. The following plugins are involved for memory QoS management:
  - System Memory Pressure Plugin: Eviction strategy based on overall system-level memory pressure.
  - NUMA Memory Pressure Plugin: Eviction strategy based on NUMA Node-level memory pressure.
  - RSS Overuse Plugin: Eviction strategy based on Pod-level RSS overuse.
  - Reclaimed Resource Pressure Plugin: Eviction strategy based on memory resource fulfillment of offline pods.
- Memory QRM Plugin: Memory resource management plugin. For memory QoS management, it handles Memcg configuration for offline pods and implements the Drop Cache action.
- SysAdvisor: Algorithm module running on each node, supporting algorithm strategy extension through plugins. The following plugins are involved for memory QoS management:
  - Cache Reaper Plugin: Calculates the trigger timing for the Drop Cache action and identifies which pods need to have their cache dropped.
  - Memory Guard Plugin: Calculates the real-time Memory Limit for offline pods.
  - Memset Binder Plugin: Dynamically calculates which NUMA Node offline pods should be bound to.
- Reporter: Out-of-band information reporting framework. For memory QoS management, it reports memory pressure-related Taints to Nodes or CustomNodeResource CRDs.
- MetaServer: Metadata management component of Katalyst Agent. For memory QoS management, it provides metadata for Pods and Containers, caches metrics, and offers dynamic configuration capabilities.
Malachite: Metrics data collection component running on each node. For memory QoS management, it provides memory-related metrics at the Node, NUMA, and Container levels.
Katalyst Scheduler: The following plugins are involved for memory QoS management:
- Native TaintToleration Plugin: Filters based on Node Taints.
- Extended QoSAwareTaintToleration Plugin: Implements scheduling prohibitions based on Taints defined in CustomNodeResource CRDs for QoS awareness.

Detailed design

Multi-dimensional interference detection

Memory Advisor performs periodic interference detection to proactively sense memory pressure and trigger corresponding mitigation measures. Currently, the following dimensions of interference detection are supported:

System and NUMA-level memory watermark: Comparing the free memory watermark at the system and NUMA levels with the threshold watermark of global asynchronous memory reclamation (Low Watermark), to avoid triggering global direct memory reclaim as much as possible.
Kswapd memory reclamation rate at the system level: If the rate of global asynchronous memory reclamation is high and continues for an extended period, it indicates significant memory pressure on the system, which may likely trigger global direct memory reclaim in the future.
Pod-level RSS overuse: Overcommitment can fully utilize a node’s memory, but it cannot control whether the overcommitted memory is used for page cache or RSS. If the RSS usage of certain pods far exceeds their request, it may result in a high node memory watermark that cannot be reclaimed. This can affect other pods’ inability to use sufficient page cache, leading to performance degradation, or it may result in an OOM event.
QoS-level memory resource fulfillment: By comparing the supply of reclaimed memory on the node with the total memory request of reclaimed_cores QoS level on that node, it calculates the memory resource fulfillment of offline jobs to prevent severe impacts on the service quality of offline jobs.

Multi-tiered mitigation measures

Based on the different levels of abnormality feedback from interference detection, Memory Advisor supports multi-tiered mitigation measures. While avoiding interference with high-priority pods, it aims to minimize the impact on victim pods.

Forbid Scheduling

Forbidding scheduling is the least impactful mitigation measure. When any level of system abnormality is detected by interference detection, scheduling is forbidden on the node to prevent further scheduling of pods, thus preventing the situation from worsening. Currently, Memory Advisor supports this feature for all pods through Node Taint. In the future, we will enable the scheduler to be aware of taints extended in CustomNodeResource CRDs to achieve fine-grained scheduling prohibition for reclaimed_cores pods.

Tune Memcg

Tune Memcg is a mitigation measure with a relatively minor impact on victim pods. When the degree of abnormality detected by interference detection is low, Tune Memcg operations are triggered. This selects some reclaimed_cores pods and configures them with higher memory reclamation trigger thresholds to trigger memory reclamation earlier, thereby avoiding triggering global direct memory reclaim as much as possible. Tune Memcg is not enabled by default because it requires the use of veLinux kernel’s open-source Memcg asynchronous memory reclamation feature[3], which does not affect usage.

Drop Cache

Drop Cache is a mitigation measure with a moderate impact on victim pods. When the degree of abnormality detected by interference detection is moderate, drop cache operations are triggered. This selects some reclaimed_cores pods with high cache usage and forcefully releases their cache to avoid triggering global direct memory reclaim as much as possible. In Cgroups v1 environments, cache release is triggered through the memory.force_empty interface:

echo 0 > memory.force_empty

In Cgroups v2 environments, cache release is triggered by writing a large value to the memory.reclaim interface, such as:

echo 100G > memory.reclaim

As drop cache is a time-consuming operation, we have implemented an asynchronous task execution framework to avoid blocking the main process. Technical details of this part will be discussed in future articles.

Eviction

Eviction is a measure with a significant impact on victim pods, but it is the fastest and most effective fallback measure. When a high degree of abnormality is detected by interference detection, eviction at the system or NUMA level (or only for reclaimed_cores pods) is triggered to effectively avoid triggering global direct memory reclaim. Memory Advisor supports users to configure custom sorting logic for pods to be evicted. If users have not configured it, the default sorting logic is as follows:

Sort pods based on their QoS level, with reclaimed_cores > shared_cores / dedicated_cores.
Sort pods based on their priority, with lower priority pods evicted first.
Sort pods based on their memory usage, with higher usage pods evicted first. We have abstracted an eviction manager framework in Katalyst agent. This framework delegates eviction policies to plugins and consolidates eviction actions in the manager, offering the following advantages:

Plugins and managers can communicate through local function calls or gRPC, allowing flexible plugin start and stop.
The manager can easily support governance operations such as filtering, rate limiting, sorting, and auditing for eviction.
Support for dry run on plugins in the manager, allowing thorough validation of strategies before they take effect.

Resource cap for reclaimed_cores

To prevent offline containers from excessively using memory and affecting the service quality of online containers, we limit the total memory usage of reclaimed_cores pods through a resource cap. Specifically, we have expanded a memory guard plugin in SysAdvisor. This plugin periodically calculates the total amount of memory that reclaimed_cores pods can use as a whole and accordingly write memory.limit_in_bytes file of the BestEffort cgroup through the memory QRM plugin.

Memory migration

For applications like Flink, the performance of services is strongly correlated with memory bandwidth and memory latency, and they also consume a significant amount of memory. The default memory allocation strategy prioritizes memory allocation from the local NUMA node to achieve lower memory access latency. However, on the other hand, the default memory allocation strategy may lead to uneven memory usage across NUMA nodes, causing certain NUMA nodes to become hotspots under excessive pressure, which severely impacts service performance and leads to latency issues. Therefore, we use Memory Advisor to monitor the memory watermark of each NUMA node and dynamically adjust the NUMA node bindings of containers for memory migration to prevent any NUMA node from becoming a hotspot. During the implementation of the memory dynamic migration feature in production environments, we encountered exceptional situations that could lead to system hang-ups. As a result, we optimized the method of memory migration. This practical experience will be elaborated on in subsequent blogs.

Differentiated memcg-level reclamation strategy

Given that memcg-level direct memory reclaim can significantly impact application performance, the kernel team at ByteDance has enhanced the Linux kernel (i.e. veLinux) with memcg-level asynchronous memory reclamation features, which have been open-sourced [4]. In colocation scenarios, the typical I/O activities of online applications involve reading and writing logs, whereas those of offline tasks involve more frequent file I/O operations, with page cache having a significant impact on the performance of offline jobs. Therefore, through Memory Advisor, we support differentiated memory reclamation strategies at the memcg level:

For applications requiring a large amount of page cache (such as offline jobs), users can specify a relatively lower memcg-level asynchronous memory reclamation threshold through pod annotations. This conservative memory reclamation approach allows for more page cache usage.
Conversely, for applications requiring minimized performance degradation due to direct memory reclaim, users can configure a relatively aggressive memcg-level asynchronous reclamation strategy through pod annotations. This feature is not enabled by default as it requires patches from the veLinux kernel.

Future plans

In subsequent versions of Katalyst, we will continue to iterate on Memory Advisor to enhance its support for a wider range of user scenarios.

Decoupling some capabilities from QoS

Memory Advisor has extended some enhanced memory management capabilities in colocation scenarios, where some of these capabilities are orthogonal to QoS and remain applicable even in non-colocation scenarios. Therefore, in subsequent iterations, we will decouple features such as memcg-level differentiated reclamation strategy, interference detection, and mitigation from QoS enhancement. This will turn them into finely-grained memory management capabilities applicable to general scenarios, enabling users in non-colocation scenarios to utilize them as well.

OOM priority

In the context mentioned earlier, Kubernetes configures different oom_score_adj values for containers based on pod’s QoS level. However, the final OOM Score can still be influenced by other factors such as memory usage. In tidal colocation [5] scenarios, where offline pods belong to the same QoS level, there may be no guarantee that offline pods will be OOM-killed before online pods. Therefore, there is a need to introduce a Katalyst QoS enhancement: QoS priority. Memory Advisor should be able to configure corresponding oom_score_adj values for containers belonging to different QoS priority levels in user space, ensuring strict OOM sequence for offline pods. Additionally, the ByteDance kernel team recently submitted a patch to the Linux kernel [6], aiming to programmatically customize the kernel’s OOM behavior through BPF hooks. This initiative seeks to enhance flexibility in defining OOM strategies.

Cold memory offloading

There may be some less frequently used memory (referred to as cold memory) on the node that has not been released, leading to a limited amount of memory available for offline job usage. This situation prevents effective memory overcommitment, as the memory that could be allocated to offline jobs remains underutilized.

To increase the amount of memory available for allocation, we have referenced Meta’s Transparent Memory Offloading (TMO) paper [7]. In the future, Memory Advisor will utilize the procfs-based memory pressure monitoring framework (PSI) in user space to detect memory pressure. When memory pressure is low, memory reclamation will be triggered proactively. Additionally, we will leverage the DAMON sub-module for memory hotness detection to gather information on memory usage patterns. This information will be used to offload cold memory to relatively inexpensive storage devices or compress it using zRAM, thereby saving memory space and improving memory resource utilization. The technical details of this feature will be elaborated on in subsequent blogs.

Summary

At ByteDance, Katalyst is deployed across over 900,000 nodes, managing tens of millions of cores and unifying the management of various workload types, including microservices, search, advertising, storage, big data, and AI jobs. Katalyst has improved daily resource utilization at ByteDance from 20% to 60%, while ensuring that the QoS requirement of various workload types is satisfied at the same time. In the future, Katalyst Memory Advisor will continue to iterate and optimize. Further technical insights into features such as cold memory offloading and memory migration optimizations will be explained in subsequent blogs. Stay tuned!

References

[1] A brief introduction to Katalyst: https://www.cncf.io/blog/2023/12/26/katalyst-a-qos-based-resource-management-system-for-workload-colocation-on-kubernetes/
[2] Kubernetes eviction strategy: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
[3] Memory QoS KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos
[4] Memcg-level async reclaim：https://github.com/bytedance/kernel/commit/7d7386ec89caf078f21836c5cae33ffa886125c4
[5] Tidal colocation: https://gokatalyst.io/docs/user-guide/tidal-colocation/
[6] BPF hook for selecting victim task during OOM events: https://lore.kernel.org/lkml/20230804093804.47039-1-zhouchuyi@bytedance.com/
[7] TMO paper：https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

CNCF - Blog

Kubernetes is turning 10! Join the party on June 6th

KuberTENes Birthday Bash

Birthday Bash Livestream/Replays

KuberTENes Birthday Party in a Box & Local Meet-Ups

Adding color-blind themes to Kubecolor to make Kubernetes more inclusive

What’s the problem ?

But there’s more…

So, what to do with this ?

Understanding KubeColor Themes

How to build a theme

Using ENV Variables

Using the config file

Color Blind Theme

Using the themes

Updating the theme

Wrapping it up

Top 5 cloud computing trends of 2024

Top 5 Key Cloud Computing Trends to Watch

AI As A Service (AIaaS)

Hybrid & Multi-Cloud Strategies

Edge AI Computing

Sustainable Cloud Computing

Serverless Computing

Conclusion

Is your supply chain secure? Double check with our framework

Join us on June 26 & 27 for CloudNativeSecurityCon North America 2024 in Seattle

Register! Learn more!

Early explorations and practices of Xline, a stateful application managed by Karmada

Accelerating Machine Learning with GPUs in Kubernetes using the NVIDIA Device Plugin

NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads.

Introduction

Understanding the Nvidia Device Plugin for Kubernetes

Installing and Configuring the Nvidia Device Plugin

Prerequisites

Deploying the Device Plugin

Configuring GPU Sharing and Oversubscription

Allocating GPUs to Pods Using the Nvidia Device Plugin

Conclusion

Further Reading and Resources

FinOps for Kubernetes: engineering cost optimization

Architect for Efficient Cloud Usage

FinOps for Kubernetes Cost Control Strategies

Case Study: LambdaTest

— The Challenges of Managing Infrastructure

— Create a Tagging Framework

— From Data to Action

— Cost Visibility

— Drill-Down Granularity

— Real-Time Alerting

Encourage FinOps Practices

The hidden economy of open source software

A Priceless Foundation with a Trillion-Dollar Impact

The Unsung Heroes of OSS

The Programming Languages That Power the Economy

How is the Falco project staying secure?

Towards a Sustainable Future for OSS

Open source software in AI and cloud trends to watch in 2024: thoughts from the Netris community

1. Open source licensing changes

2. Open source, AI and network infrastructure

3. Shareable open source AI models

Conclusion: The future of open source

How Katalyst guarantees memory QoS for colocated applications

Limitations of native memory management

Memory allocation and reclamation of Linux kernel

Memory allocation:

Memory reclamation

Kubernetes Memory Management

Memory limit

Eviction

OOM

Memory QoS

Limitations

Katalyst Memory Advisor

Overall architecture

Detailed design

Multi-dimensional interference detection

Multi-tiered mitigation measures

Forbid Scheduling

Tune Memcg