一、故障描述
某次一线反映,某虚拟桌面发布后,无法开机,报错:Passthrough device ‘pciPassthru0’ vGPU ‘grid_t4-1q’ disallowed by vmkernel: Failure,如下所示:
Module ‘DevicePowerOn’ power on failed.
Could not initialize plugin ‘/usr/lib64/vmware/plugin/libnvidia-vgx.so’ for vGPU ‘grid_t4-1q’.
Passthrough device ‘pciPassthru0’ vGPU ‘grid_t4-1q’ disallowed by vmkernel: Failure
二、分析处理
1)关于事件中涉及的:Virtual machine zz3-dl0000 powered On with vNICs connected to dvPorts that have a port level configuration, 与分布式端口组上配置的策略不同;经检查与本次故障无关;
2)在Nvidia官方文档里的相关说明:
known-issues。
发布说明。
NVIDIA P4/P6/P40 GPU graphic card based on the Pascal architecture you need to disable the “ECC Memory” otherwise your VMs will not power on and you will be left with the error:
Could not initialize plugin ‘/usr/lib64/vmware/plugin/libnvidia-vgx.so’ for vGPU “profile_name.
GPUs based on the Pascal GPU architecture are supplied with ECC memory enabled by default.
#禁用esxi主机ECC/etc/init.d/xorg stop
nv-hostengine -t
nv-hostengine -d
/etc/init.d/xorg start
nvidia-smi -e 0rebootnvidia-smi -q #验证esxcli software vib list | grep NVIDIA #检查vgx vib
3)vmware官方相关经验表明,这是因为ESXi主机图像显卡GPU模式使用了Shared (vSGA) 而不是Shared Direct (vGPU),之后重启xorg service(etc/init.d/xorg restart)。
实际检查环境中并不存在该问题:现场用NVIDIA Tesla T4 graphic card
4)现场将故障vm迁移到可用的主机上后,开启恢复;
5)对原涉主机的显卡进行配置修改再重置操作:
#pic_center
将开机的原故障vm迁移到原先主机,操作失败,报错如下:
综上,本次故障时因esxi主机本身显卡配置与vm的vGPU 定义文件冲突所致。故需要针对主机进入维护模式,排查甚至重新安装 vgx VIB。
6)检查vm的高级配置:
pciPassthru.use64bitMMIO= “TRUE”
pciPassthru.64bitMMIOSizeGB = “64”
默认只有如下3项:加上上述2项试下
推荐本站淘宝优惠价购买喜欢的宝贝:
本文链接:https://hqyman.cn/post/5639.html 非本站原创文章欢迎转载,原创文章需保留本站地址!
打赏微信支付宝扫一扫,打赏作者吧~
休息一下~~