Until recently I was running Ubuntu 19.10 and had Win 10 VM with a GPU that was passed through. Everything worked fine. A few days ago I made a clean installation of 20.04 and I have not been able to make the same setup work. I was hoping I would just import the old config and everything would work but that was not the case. At this point I got to a stage when the VM boots and can use the GPU (GTX 1080) when there is no driver installed. However once I try to install the driver the screen eventually goes black and stays that way until the VM reboots into recovery.
I’m not sure what details would be most helpful to debug the issue as obviously there is a lot of config involved. Initially I followed that guide, for 18.04 in the past and now for 20.04. Then I picked up different pieces here and there.
vfiowas moved to the kernel my grub boot line looks like that:
intel_iommu=on iommu=pt apparmor=0 security='' vfio-pci.ids=10de:1b80,10de:10f0 pcie_aspm=off vfio_iommu_type1.allow_unsafe_interrupts=1 vfio_pci.disable_vga=1 vfio_pci.disable_idle_d3=1 acpi=off intremap=no_x2apic_optout nox2apic
I can see that the GPU is controlled by
kvmthe following is passed:
options kvm ignore_msrs=1(verified by checking
To fight the nvidia driver error ’43’ I added the changes suggested here
Comparing the new setup on 20.04 to the old one on 19.10 I’d say there are two major changes:
vfiowas moved to the kernel so instead of passing the parameters to the module I now pass them to the kernel directly. That seems to work as I wrote above.
- I had to switch from
i440fxthe VM refused to acknowledge the presence of the monitor connected to passed through GPU. It acknowledged the presence of the GPU although with an error status but could not use it. And any attempt to install the nvidia driver led to BSOD. With
q35it is able to use the screen when there is no driver installed.
Other parts of the configuration did not change and that’s the main reason why I did not provide additional details above since for now I still assume that they are “correct” and “should work” as is.
I tried creating a fresh VM but it has the same issue: once I try to install the nvidia driver in the VM the screen goes black at some point and the VM crashes after a little while. If I stop passing through the GPU then I can remove the driver and then use the GPU again.
I’m not able to find anything useful in the logs. I do see some things that may be related but I’m not sure:
Jul 12 00:05:56 homenest libvirtd: host doesn't support hyperv 'relaxed' feature Jul 12 00:05:56 homenest libvirtd: host doesn't support hyperv 'vapic' feature
Disabling the above features make the messages disappear but makes no difference for nvidia driver in the VM.
Jul 12 00:06:16 homenest kernel: [ 5476.135668] kvm : vcpu0, guest rIP: 0xfffff803343862f7 ignored rdmsr: 0x611 Jul 12 00:06:16 homenest kernel: [ 5476.135672] kvm : vcpu0, guest rIP: 0xfffff8033438630d ignored rdmsr: 0x641 Jul 12 00:06:16 homenest kernel: [ 5476.135674] kvm : vcpu0, guest rIP: 0xfffff80334386323 ignored rdmsr: 0x606 Jul 12 00:06:16 homenest kernel: [ 5476.135676] kvm : vcpu0, guest rIP: 0xfffff80334386134 ignored rdmsr: 0x606 Jul 12 00:06:16 homenest kernel: [ 5476.135679] kvm : vcpu0, guest rIP: 0xfffff803343811bc ignored rdmsr: 0x641 Jul 12 00:06:16 homenest kernel: [ 5476.135681] kvm : vcpu0, guest rIP: 0xfffff80334381207 ignored rdmsr: 0x611
That is some spam that can besafely ignored as far as I can understand.
Jul 12 18:42:48 homenest kernel: [ 212.184333] vfio-pci 0000:05:00.0: enabling device (0100 -> 0103) Jul 12 18:42:48 homenest kernel: [ 212.184520] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Jul 12 18:42:48 homenest kernel: [ 212.186952] vfio-pci 0000:05:00.1: enabling device (0100 -> 0102)
Don’t know and could not find what
vfio_ecap_initmessage means but the other two lines suggest that
vfiosuccessfully passes the devices through.
Sadly I don’t have the info about the versions of various components on 19.10. Now I have the following:
Compiled against library: libvirt 6.0.0 Using library: libvirt 6.0.0 Using API: QEMU 6.0.0 Running hypervisor: QEMU 4.2.0 5.4.0-40-generic
I wanted to test older kernels but with
5.3.18-050318-genericthe system failed to boot due to failure to boot from zfs and with
5.4.0-26-genericthe system booted but later appeared to miss some kernel modules so the VM did not start at all. And since I was not sure that these kernels would actually make a difference I did not feel like investing time into fixing the issues.
So that is it in terms of an overview of the problem. I would appreciate any pointers as to what to try and where to look to fix the issue. If any additional info is needed I would be happy to provide it.
Additional experiments revealed that the problem gets triggered on its own, if the VM with a passed through GPU is left for just a few minutes it will crash, the screen will go black, etc. After that it can be “recovered” by booting it without passing through the GPU. I also tried to pass through another GPU and the behavior is exactly the same.
Also I found another message that now appears in the logs:
kernel: [ 608.945246] vfio-pci 0000:03:00.0: vfio_bar_restore: reset recovery - restoring BARs
I still need to try baremetal Windows installation and Ubuntu 19.10 installation where the setup used to work fine. Just need to find time to run all these experiments…