官网参考链接:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Ubuntu 配置apt源
cat >nvidia-container-toolkit.list <<EOF deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) / #deb https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) / EOF apt update
安装包
sudo apt-get install -y nvidia-container-toolkit
RedHat 配置yum源
wget https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
安装包
yum install -y nvidia-container-toolkit
Docker支持GPU 配置docker运行时为nvidia,并设置默认运行时,会自动重写daemon.json文件
nvidia-ctk runtime configure --runtime docker --nvidia-set-as-default
重启docker
查看Runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
测试docker支持gpu
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A800-SXM4-80GB Off | 00000000:23:00.0 Off | 0 | | N/A 29C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A800-SXM4-80GB Off | 00000000:29:00.0 Off | 0 | | N/A 31C P0 53W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A800-SXM4-80GB Off | 00000000:52:00.0 Off | 0 | | N/A 30C P0 58W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A800-SXM4-80GB Off | 00000000:57:00.0 Off | 0 | | N/A 28C P0 58W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A800-SXM4-80GB Off | 00000000:8D:00.0 Off | 0 | | N/A 29C P0 59W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A800-SXM4-80GB Off | 00000000:92:00.0 Off | 0 | | N/A 30C P0 53W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A800-SXM4-80GB Off | 00000000:BF:00.0 Off | 0 | | N/A 31C P0 55W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A800-SXM4-80GB Off | 00000000:C5:00.0 Off | 0 | | N/A 29C P0 54W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
k8s支持GPU 下载部署插件配置文件
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
编辑nvidia-device-plugin.yml,修改image地址和增加PASS_DEVICE_SPECS配置
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: "system-node-critical" containers: - image: 10.232.16.103:5001/nvidia/k8s-device-plugin:v0.15.0 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" - name: PASS_DEVICE_SPECS value: "true" securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
确认pod running
kubectl get pod -n kube-system|grep nvidia-device-plugin-daemonset
kubectl查看节点是否支持gpu
kubectl describe node hgx-a800-131
Capacity: cpu: 128 ephemeral-storage: 459329648Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1056450800Ki nvidia.com/gpu: 8 pods: 110
如果GPU型号为:NVIDIA A800-SXM4-80GB
,需要安装nvidia-fabric-manager
服务
apt install nvidia-fabricmanager-535
编辑配置文件
cat /usr/share/nvidia/nvswitch/fabricmanager.cfg FABRIC_MODE=1
启动服务
systemctl start nvidia-fabricmanager.service systemctl enable nvidia-fabricmanager.service systemctl status nvidia-fabricmanager.service
创建测试pod,测试k8s使用gpu
apiVersion: v1 kind: Pod metadata: name: test-gpu spec: containers: - name: test image: 10.232.16.103:5001/jt-llm/cm57b-vllm:v1.0.0 resources: limits: nvidia.com/gpu: 2 # 请求两张GPU command: ["/bin/bash", "-c", "python -c 'import torch; print(torch.cuda.is_available())'; tail -f /dev/null"]