解决方式
NVIDIA A800显卡初始化提示以下错误:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized |
查看显卡状态
+-----------------------------------------------------------------------------+ |
开启显卡性能模式
nvidia-smi -pm 1 |
+-----------------------------------------------------------------------------+ |
下载nvidia-fabric-manager
:https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/
安装nvidia-fabric-manager
,版本需要和显卡驱动一致
yum install nvidia-fabric-manager-535.54.03 nvidia-fabric-manager-devel-535.54.03 |
编辑配置文件/usr/share/nvidia/nvswitch/fabricmanager.cfg
FABRIC_MODE=1 |
启动服务
systemctl enable nvidia-fabricmanager |
python测试,返回True则表示显卡可以正常使用
python -c "import torch; print(torch.cuda.is_available())" |
已知问题
创建分区失败
问题现象
- 启动fabricmanager服务后,创建分区失败,提示如下错误:
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 1
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 3
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 6
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 7
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 9
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:1C:00.0 link index 11
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 24.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 25.
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 1
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 3
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 6
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 7
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 9
[Mar 04 2025 11:19:14] [ERROR] [tid 71824] request to send socket message to local fabric manager for fid 0 failed with error -3
[Mar 04 2025 11:19:14] [INFO] [tid 69977] GPU link state information is missing for fid 0 GPU pci bus id 00000000:04:00.0 link index 11
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 26.
[Mar 04 2025 11:19:14] [ERROR] [tid 71824] request to send config deinit done message to fid 0 failed with error -72
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 27.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 28.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 29.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 30.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 31.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 32.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 33.
[Mar 04 2025 11:19:14] [WARNING] [tid 69977] failed to create fabric partition mapping for fid 0 partition id 34.
[Mar 04 2025 11:19:14] [INFO] [tid 69977] start NVSwitch 0/10 routing configuration
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send socket message to local fabric manager for fid 0 failed with error -3
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send switch port config to fid 0 for NVSwitch physical id 10 failed with error -72
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure NVSwitch for fid 0 NVSwitch physical id 10 with error -72
[Mar 04 2025 11:19:14] [INFO] [tid 69977] training all NVLink connections to off
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send sync message to local fabric manager for fid 0 failed with error -3
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] request to send GPU detach message to fid 0 failed with error -72
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure local fabric manager fid 0
[Mar 04 2025 11:19:14] [ERROR] [tid 69977] failed to configure all the available GPUs or NVSwitches解决方法
查看默认内核
grubby --default-kernel |
查看所有内核信息
grubby --info=ALL |
确认对应内核文件是否存在,如果不存在重新安装内核
/boot/vmlinuz-4.19.0-372.26.3.el8_2.bclinux.x86_64 |
设置默认启动内核,并再次确认
grubby --set-default-index=1 |
重启服务器
reboot |
重置GPU
nvidia-smi --gpu-reset |
如果提示以上错误,则说明GPU正在被其他进程使用,需要先杀死所有使用该GPU的进程,然后重试。
systemctl stop docker |
再次重置GPU
nvidia-smi --gpu-reset |
启动服务