Debugging Issues
Users may encounter issues with XID 119 and XID 120 on the host. To troubleshoot efficiently, debug logs should be captured using the provided script after a reboot and before any workload starts. The script is designed to collect detailed logs when these XIDs occur.
XID 119: Indicates GSP entering a hang state.
XID 120: Indicates a GSP crash.
The script monitors XIDs (e.g., 119, 120) and captures additional debug logs for effective troubleshooting.
Next Steps
-
Driver Setup: Install vGPU Driver version 16.6, 17.2, NV-AIE 5.0, or 4.2 on GPUs with Hopper or Ada architectures.
-
Run Script:
-
Execute the script after rebooting and before starting workloads.
-
Specify XIDs to monitor (e.g., 119, 120):
-
Linux:
./nvidia-xid-monitor-linux.sh 119,120
-
ESXi:
./nvidia-xid-monitor-vmware.sh 119,120
-
-
The script generates a bug report and saves it with a timestamp.
-
-
Wait for Logs:
-
After XID 119 or 120 occurs, avoid rebooting, shutting down, or migrating VMs until the script completes log generation.
-
For more detailed instructions and additional information, visit the full article here.
When troubleshooting vGPU-related issues, providing the correct logs and diagnostic information is essential for faster resolution.
Next Steps
-
Generate an NVIDIA bug report from the host and attach the generated log file when reaching out for support.
-
Collect system information from a Windows VM, save the report as a .nfo file and attach it when requesting support.
-
Use NVIDIA SMI commands for debugging:
-
Monitor GPU usage by application:
nvidia-smi vgpu -p
-
Capture frame buffer session:
nvidia-smi vgpu -fs
-
Check encoder session usage (should be minimal for vGPU workloads):
nvidia-smi vgpu -es
-
-
Attach the nvidia-bug-report from the host and the msinfo32 report from the Windows VM when reporting an issue. This ensures a faster and more accurate diagnosis.