NVIDIA A800初始化失败

解决方式

NVIDIA A800显卡初始化提示以下错误:

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

查看显卡状态

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... Off | 00000000:04:00.0 Off | 0 |
| N/A 32C P0 52W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... Off | 00000000:14:00.0 Off | 0 |
| N/A 28C P0 52W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

开启显卡性能模式

nvidia-smi -pm 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... On | 00000000:04:00.0 Off | 0 |
| N/A 32C P0 52W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... On | 00000000:14:00.0 Off | 0 |
| N/A 28C P0 52W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

下载nvidia-fabric-managerhttps://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/
安装nvidia-fabric-manager,版本需要和显卡驱动一致

yum install nvidia-fabric-manager-535.54.03 nvidia-fabric-manager-devel-535.54.03

编辑配置文件/usr/share/nvidia/nvswitch/fabricmanager.cfg

FABRIC_MODE=1

启动服务

systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

python测试,返回True则表示显卡可以正常使用

python -c "import torch; print(torch.cuda.is_available())"

已知问题

创建分区失败

问题现象

  1. 启动fabricmanager服务后,创建分区失败,提示如下错误:
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 1
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 3
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 6
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 7
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 9
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 11
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 24.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 25.
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 1
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 3
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 6
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 7
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 9
    [Mar 04 2025 11:19:14] [ERROR] [tid 71824] request to send socket message to local fabric manager for fid 0 failed with error -3
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 11
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 26.
    [Mar 04 2025 11:19:14] [ERROR] [tid 71824] request to send config deinit done message to fid 0 failed with error -72
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 27.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 28.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 29.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 30.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 31.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 32.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 33.
    [Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 34.
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] start NVSwitch 0/10 routing configuration
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send socket message to local fabric manager for fid 0 failed with error -3
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send switch port config to fid 0 for NVSwitch physical id 10 failed with error -72
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure NVSwitch for fid 0 NVSwitch physical id 10 with error -72
    [Mar 04 2025 11:19:14] [INFO] [tid 69977] training all NVLink connections to off
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send sync message to local fabric manager for fid 0 failed with error -3
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send GPU detach message to fid 0 failed with error -72
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure local fabric manager fid 0
    [Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure all the available GPUs or NVSwitches

    解决方法

查看默认内核

grubby --default-kernel
/boot/vmlinuz-4.18.0-553.el8_10.x86_64

查看所有内核信息

grubby --info=ALL
index=0
kernel="/boot/vmlinuz-5.10.134-12.2.el8.bclinux.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/bel-swap rd.lvm.lv=bel/root rd.lvm.lv=bel/swap rhgb quiet $tuned_params mitigations=off"
root="/dev/mapper/bel-root"
initrd="/boot/initramfs-5.10.134-12.2.el8.bclinux.x86_64.img $tuned_initrd"
title="BigCloud Enterprise Linux (5.10.134-12.2.el8.bclinux.x86_64) 8.2 (Core)"
id="571369c2a99c4933990e14dd9c532a46-5.10.134-12.2.el8.bclinux.x86_64"
index=1
kernel="/boot/vmlinuz-4.19.0-372.26.3.el8_2.bclinux.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/bel-swap rd.lvm.lv=bel/root rd.lvm.lv=bel/swap rhgb quiet $tuned_params"
root="/dev/mapper/bel-root"
initrd="/boot/initramfs-4.19.0-372.26.3.el8_2.bclinux.x86_64.img $tuned_initrd"
title="BigCloud Enterprise Linux (4.19.0-372.26.3.el8_2.bclinux.x86_64) 8.2 (Core)"
id="571369c2a99c4933990e14dd9c532a46-4.19.0-372.26.3.el8_2.bclinux.x86_64"
index=2
kernel="/boot/vmlinuz-0-rescue-571369c2a99c4933990e14dd9c532a46"
args="ro crashkernel=auto resume=/dev/mapper/bel-swap rd.lvm.lv=bel/root rd.lvm.lv=bel/swap rhgb quiet"
root="/dev/mapper/bel-root"
initrd="/boot/initramfs-0-rescue-571369c2a99c4933990e14dd9c532a46.img"
title="BigCloud Enterprise Linux (0-rescue-571369c2a99c4933990e14dd9c532a46) 8.2 (Core)"
id="571369c2a99c4933990e14dd9c532a46-0-rescue"

确认对应内核文件是否存在,如果不存在重新安装内核

/boot/vmlinuz-4.19.0-372.26.3.el8_2.bclinux.x86_64
/boot/initramfs-4.19.0-372.26.3.el8_2.bclinux.x86_64.img

设置默认启动内核,并再次确认

grubby --set-default-index=1
The default is /boot/loader/entries/571369c2a99c4933990e14dd9c532a46-4.19.0-372.26.3.el8_2.bclinux.x86_64.conf with index 1 and kernel /boot/vmlinuz-4.19.0-372.26.3.el8_2.bclinux.x86_64
grubby --default-kernel
/boot/vmlinuz-4.19.0-372.26.3.el8_2.bclinux.x86_64

重启服务器

reboot

重置GPU

nvidia-smi --gpu-reset
The following GPUs could not be reset:
GPU 00000000:04:00.0: In use by another client
GPU 00000000:0D:00.0: In use by another client

2 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.

如果提示以上错误,则说明GPU正在被其他进程使用,需要先杀死所有使用该GPU的进程,然后重试。

systemctl stop docker

再次重置GPU

nvidia-smi --gpu-reset
GPU 00000000:04:00.0 was successfully reset.
GPU 00000000:0D:00.0 was successfully reset.
GPU 00000000:1D:00.0 was successfully reset.
All done.

启动服务


文章作者: 慕容峻才
文章链接: https://www.acaiblog.top/NVIDIA-A800初始化失败/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 阿才的博客
微信打赏
支付宝打赏