Sitemap

Cloud Native Weekly:GPU Sharing in Kubernetes

5 min readApr 15, 2025

Open Source project recommendations

A2A

Google’s Agent2Agent (A2A) protocol is an open-source standard designed to facilitate interoperability between AI agents built by different frameworks and vendors. It allows agents to securely exchange information, collaborate on tasks, and seamlessly work across enterprise platforms and cloud environments using a unified protocol.

A2A is designed with five core principles: support for natural agent collaboration, built upon existing standards, security by default, support for long-duration tasks, and multiple interaction modes (e.g., text, audio, video). It uses an “Agent Card” mechanism to enable agent discovery, allowing client agents to identify and interact with others. A2A also supports task lifecycle management, user experience negotiation, and cross-agent function invocation.

ThreatMapper

ThreatMapper, developed by Deepfence, is an open-source Cloud Native Application Protection Platform (CNAPP) that provides runtime security observability for development and operations teams. It uses lightweight sensors and agentless cloud scanning tasks to automatically detect and map vulnerabilities, sensitive data, misconfigurations, and compliance issues across containers, Kubernetes, virtual machines, and serverless environments (such as AWS Fargate). With the ThreatGraph visualization feature, users can prioritize critical security issues based on exploitability, attack paths, and risk levels.

Plandex

Plandex is an open-source terminal-based AI coding engine built for handling large projects and complex tasks. It interacts with developers via a command-line interface and supports intelligent context management, multi-model selection, and a version-controlled sandbox mechanism to ensure safe automation with human-in-the-loop control. Capable of processing millions of lines of code, Plandex integrates multiple models from providers like OpenAI, Anthropic, and Google to boost development productivity. Licensed under MIT, it supports cross-platform usage and is ideal for feature development, refactoring, and test generation in large-scale projects.

Direktiv

Direktiv is an open-source, event-driven serverless workflow engine designed for automation, integration, and orchestration tasks in containerized environments. At its core is a state machine that uses containers as functional units in workflows, passing information between states via JSON. Direktiv supports retries, error handling, and conditional logic, and allows dynamic transformation of state data during execution using JQ or JavaScript.

Technical recommendations

QCon London: Three-Step Approach to Managing Open Source Risk

At the 2025 QCon London Conference, Celine Pypaert, Head of Vulnerability Management at Johnson Matthey, shared a three-step method for managing risks associated with open-source dependencies, balancing innovation with security. She emphasized that open-source components are widely used in commercial codebases, but over-reliance on commonly used software can introduce security risks, as seen in incidents like the XZ Utils backdoor and the Left-pad removal on npm. To address these challenges, Pypaert proposed the following strategies:

1.Identification and Prioritization: Organizations should implement Software Composition Analysis (SCA) tools to audit open-source dependencies, especially in testing environments, to detect vulnerabilities early. When addressing vulnerabilities, one should consider both severity and fixability, and develop a phased remediation plan.

2.Accountability and Responsibility: Developers should collaborate with security teams and use a risk register to raise awareness among leadership. By building risk portfolios that link technical risks to business risks, organizations can better understand the potential business continuity impacts of software supply chain issues.

3.Proactive Remediation:Automating security tasks wherever possible is encouraged, such as integrating vulnerability scanning tools (e.g., GitHub Dependabot) with project management systems (e.g., Jira) to auto-assign security tasks and reduce team workload.

GPU Sharing in Kubernetes: NVIDIA KAI vs. Exostellar SDG

KAI-Scheduler is an open-source Kubernetes-native GPU scheduler by NVIDIA, aimed at optimizing resource allocation for AI and machine learning workloads. Initially developed by Run:ai, it is now open-sourced under the Apache 2.0 license and supports efficient management of large-scale GPU clusters.

Zhiming Shen, co-founder and CTO of Exostellar, conducted an in-depth comparison between NVIDIA’s KAI Scheduler and Exostellar’s Software-Defined GPU (SDG), focusing on different approaches to GPU sharing in Kubernetes environments.

The KAI scheduler supports GPU sharing via time-slicing, enabling multiple workloads to share a single physical GPU. However, it does not enforce GPU memory isolation, which can lead to memory contention and performance degradation when running multiple workloads.

In contrast, Exostellar’s SDG offers more fine-grained GPU virtualization capabilities, allowing users to dynamically allocate GPU memory and compute resources based on actual demand. This ensures better resource utilization and performance isolation.

GPU sharing goes beyond just scheduling — it also involves resource isolation and reliability, especially in multi-tenant and high-load AI/ML environments.

What’s new in cloud native

Kafka 4.0: KRaft Simplifies Architecture

Apache Kafka 4.0 has officially been released, marking a major architectural shift. The new version enables KRaft mode by default, completely removing the dependency on Apache ZooKeeper, thereby simplifying deployment and management. KRaft mode uses the Raft protocol, enhancing scalability and system recovery capabilities.

Kafka 4.0 also introduces KIP-848, the next-generation consumer group protocol, significantly improving rebalancing performance and reducing downtime and latency. In addition, the release offers early support for queues (KIP-932), enabling point-to-point messaging via “shared groups,” expanding Kafka’s use cases. This version also removes APIs unused for at least 12 months, updates the minimum Java requirement, and encourages adoption of newer Java features to align with modern tech stacks. Kafka 4.0 represents a major step toward platform modernization and reflects the community’s continued innovation on its 15th anniversary.

OpenStack 2025.1 Epoxy Released

The OpenStack community released its 31st version in April 2025 — OpenStack 2025.1 “Epoxy,” marking a significant milestone in cloud computing. This release aims to strengthen OpenStack’s position as a viable alternative to VMware, especially after Broadcom’s acquisition of VMware and subsequent licensing changes, prompting enterprises to reassess their virtualization strategies.

Epoxy introduces key features, such as integrating Prometheus as a data source in the Watcher project to optimize resource allocation and monitor VMware infrastructure, ensuring smooth migration.

The Cinder project has enhanced support for storage solutions like NetApp, Pure Storage, and Hitachi, simplifying workload migration. On the security front, the Manila project now allows admins to dynamically adjust shared file system permissions, switching from “read-only” to “read-write” for finer-grained access control. Epoxy’s release is the result of contributions from over 450 developers across organizations like BBC R&D, Blizzard Entertainment, Canonical, Ericsson, Mirantis, and NVIDIA, with more than 7,600 changes, showcasing OpenStack’s sustained vitality and innovation on its 15th anniversary.

About KubeSphere

KubeSphere is an open source container platform built on top Kubernetes with applications at its core. It provides full-stack IT automated operation and streamlined DevOps workflows.

KubeSphere has been adopted by thousands of enterprises across the globe, such as Aqara, Sina, Benlai, China Taiping, Huaxia Bank, Sinopharm, WeBank, Geko Cloud, VNG Corporation and Radore. KubeSphere offers wizard interfaces and various enterprise-grade features for operation and maintenance, including Kubernetes resource management, DevOps (CI/CD), application lifecycle management, service mesh, multi-tenant management, monitoring, logging, alerting, notification, storage and network management, and GPU support. With KubeSphere, enterprises are able to quickly establish a strong and feature-rich container platform.

To stay updated, visit our official website or follow us on Twitter.

--

--

KubeSphere
KubeSphere

Written by KubeSphere

KubeSphere (https://kubesphere.io) is an open source distributed operating system providing cloud native stack with Kubernetes as its kernel.

No responses yet