NVidia drivers not running on AWS after restarting the AMI
Asked Answered
Z

2

20

everybody, I have the following problem:

I started a P2 instance with this AMI. I installed some tools like screen, torch, etc. Then I successfully run some experiments using GPU and I created an image of the instance, so that I can terminate it and run it again later.

Later I started a new instance from the AMI I created before. Everything looked fine - screen, torch, my experiments were present on the system, but I couldn't run the same experiments as before:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

To me it looks like the drivers might be installed (because all other tools are installed from before), but they are not running. Is it a correct assumption? How can I start them?

Zrike answered 23/10, 2016 at 8:52 Comment(1)
I noticed that the kernel changed. From kernel 4.4.0-1049-aws to 4.4.0-1061-aws.Underlie
P
19

We had this problem recently. In our case, it seems that the default kernel on AWS instance was upgraded (from 4.4.0-1049-aws to 4.4.0-1061-aws), but the new kernel did not have nvidia modules installed:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1061-aws | grep -i nvidia
ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1049-aws | grep -i nvidia
-rw-r--r--  1 root root    87368 Jun 27 10:21 nvidia-drm.ko
-rw-r--r--  1 root root  1155304 Jun 27 10:21 nvidia-modeset.ko
-rw-r--r--  1 root root  1163016 Jun 27 10:21 nvidia-uvm.ko
-rw-r--r--  1 root root 18014088 Jun 27 10:21 nvidia.ko

Check your kernel version (uname -a) to see if this is the case for you. GRUB configuration allowed booting an old kernel image (1049), but by default it was loading the new one (1061). The relevant portion of /boot/grub/cfg:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ grep -i -e "ubuntu, with linux" /boot/grub/grub.cfg
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-recovery-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-recovery-XXXX' {

You can force that on the next reboot it loads the old kernel by using grub-reboot:

sudo /usr/sbin/grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-1049-aws"
sudo reboot

This will boot the instance with the old kernel, for which you have nvidia modules.

Prissy answered 2/7, 2018 at 15:25 Comment(5)
Still relevant to 4.4.0-1077-aws kernel version. I followed the instructions and revert the kernel to 4.4.0.1075-aws version.Armalda
Although what @Armalda says makes GPU available, it's not a perfect solution since it might not be compatible with what you had installed before, i.e I am getting this error when I run my PyTorch code: The NVIDIA driver on your system is too old (found version 9000). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver so caution is neededConclusion
Just to add, while 4.4.0-1075-aws results in nvidia driver not being available.Conclusion
Great answer, saved my day!Ramin
Actually it seems that this solution only works for the next reboot, but doesn't change the kernel permanently. To change it permanently, I deleted the latest kernel (which I didn't want to use) following this advice askubuntu.com/a/329943Ramin
Z
0

Reinstalling the nvidia driver solved the problem.

Zrike answered 23/10, 2016 at 15:29 Comment(3)
Did you figure out how to be able reboot without having to reinstall the drivers?Boozer
What kind of answer is this? You answer by yourself,then at least provide more details,so other people could learn from you.Bufford
@MichaelIV It's short, doesn't mean it's bad. I don't think there is much more to say, Reinstalling Nvidia drivers solved OP's problem that's it. Maybe he could add steps like "1. uninstall Nvidia Drivers 2. Reinstall Nvidia Drivers 3. Reboot" but i'am unsure whether or not it's usefull to tell peoples on SO how to re-install some drivers.Guidebook

© 2022 - 2024 — McMap. All rights reserved.