k8s支持GPU虚拟化

官网参考链接:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Ubuntu

配置apt源

cat >nvidia-container-toolkit.list <<EOF
deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
EOF
apt update

安装包

sudo apt-get install -y nvidia-container-toolkit

RedHat

配置yum源

wget https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo

安装包

yum install -y nvidia-container-toolkit

Docker支持GPU

配置docker运行时为nvidia,并设置默认运行时,会自动重写daemon.json文件

nvidia-ctk runtime configure --runtime docker --nvidia-set-as-default

重启docker

systemctl restart docker

查看Runtime

docker info|grep Runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc

测试docker支持gpu

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800-SXM4-80GB Off | 00000000:23:00.0 Off | 0 |
| N/A 29C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM4-80GB Off | 00000000:29:00.0 Off | 0 |
| N/A 31C P0 53W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM4-80GB Off | 00000000:52:00.0 Off | 0 |
| N/A 30C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM4-80GB Off | 00000000:57:00.0 Off | 0 |
| N/A 28C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A800-SXM4-80GB Off | 00000000:8D:00.0 Off | 0 |
| N/A 29C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A800-SXM4-80GB Off | 00000000:92:00.0 Off | 0 |
| N/A 30C P0 53W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A800-SXM4-80GB Off | 00000000:BF:00.0 Off | 0 |
| N/A 31C P0 55W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A800-SXM4-80GB Off | 00000000:C5:00.0 Off | 0 |
| N/A 29C P0 54W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

k8s支持GPU

下载部署插件配置文件

wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

编辑nvidia-device-plugin.yml,修改image地址和增加PASS_DEVICE_SPECS配置

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: "system-node-critical"
containers:
- image: 10.232.16.103:5001/nvidia/k8s-device-plugin:v0.15.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: PASS_DEVICE_SPECS
value: "true"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

确认pod running

kubectl get pod -n kube-system|grep nvidia-device-plugin-daemonset

kubectl查看节点是否支持gpu

kubectl describe node hgx-a800-131
Capacity:
cpu: 128
ephemeral-storage: 459329648Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1056450800Ki
nvidia.com/gpu: 8
pods: 110

如果GPU型号为:NVIDIA A800-SXM4-80GB,需要安装nvidia-fabric-manager服务

apt install nvidia-fabricmanager-535

编辑配置文件

cat /usr/share/nvidia/nvswitch/fabricmanager.cfg
FABRIC_MODE=1

启动服务

systemctl start nvidia-fabricmanager.service
systemctl enable nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service

创建测试pod,测试k8s使用gpu

apiVersion: v1
kind: Pod
metadata:
name: test-gpu
spec:
containers:
- name: test
image: 10.232.16.103:5001/jt-llm/cm57b-vllm:v1.0.0
resources:
limits:
nvidia.com/gpu: 2 # 请求两张GPU
command: ["/bin/bash", "-c", "python -c 'import torch; print(torch.cuda.is_available())'; tail -f /dev/null"]
文章作者: 慕容峻才
文章链接: https://www.acaiblog.top/k8s支持GPU虚拟化/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 阿才的博客
微信打赏
支付宝打赏