geoffwilliams@home:~$

Nvidia drivers on Debian and Ubuntu

Official Documentation:

nvidia meme

Dont forget to ON the GPU in UEFI/BIOS!

Debian

  1. enable non-free and contrib for all repos. Eg, update every entry in /etc/apt/sources.list:
# old
deb http://deb.debian.org/debian bookworm main non-free-firmware

# change to 
deb http://deb.debian.org/debian bookworm main non-free-firmware non-free contrib
  1. apt install nvidia-driver firmware-misc-nonfree firmware-misc-nonfree firmware-misc-nonfree
  2. For Secure Boot: mokutil --import /var/lib/dkms/mok.pub
  3. Xorg -configure
  4. cp /root/xorg.conf.new /etc/X11/xorg.conf
  5. echo -e 'Section "Device"\n\tIdentifier "My GPU"\n\tDriver "nvidia"\nEndSection' > /etc/X11/xorg.conf.d/20-nvidia.conf
  6. Settings take effect after reboot

Ubuntu

Ubuntu does most of the setup for you via the Ubuntu Drivers click-through GUI install tool and most of the time this is fine. Its still possible to mess things up so badly your Laptop cant even get to a terminal though! This is a collection of fixes I’ve had to do in the past obtained from roots ~/.bash_history. You can see the driver version change as Im forced to do an emergency upgrade.

My latest Nvidia disaster seemed to have been caused by most of the Nvidia packages being randomly removed from the system. Not sure if this was an errant corporate compliance script, bad packaging or sheer bad luck. It was so bad I even had to reset my XFCE session as well for some reason.

Usually some combination of rebooting, updating the UEFI/BIOS, updating device firmware with fwupd and these tricks will get you up and running again:

Laptop wont boot!

Eliminate hardware problem

EFI/BIOS vendor splash screen/messages appear and screen looks OK? - hardware probably fine. If not (black screen), there are very limited options these days:

  1. Remove the battery (if not glued… sigh…) wait… replace… reboot
  2. Reset via the tiny hole if you have one - look carefully, I found one on my laptop in a different location
  3. Perhaps the laptop screen died? Boot with an external monitor
  4. Wiggle the screen. Flashing or distorted graphics mean screen is not completely broken but has physical problem (bad/loose/kinked cable, cracked, etc)
  5. Take Laptop apart and remove any disk drive. Not joking this has happened to me before, a 2.5” SSD failed completely at random after a reboot and whatever happened was so bad that not only was the drive instantly toast, it also prevented the Laptop from booting at all. If the laptop now magically boots, replace drive and reinstall OS.
  6. No idea, anything spilt inside the computer? At this point youve confirmed bad hardware

Linux problem

Black screen/crashed AFTER the UEFI/BIOS vendor splash screen/messages AND the grub menu (assuming your system is configured to show one at all…). Could it be the Nvidia driver? Maybe.

If you already configured grub to boot linux into text mode you should see some messages indicating where the bootstrap is failing that should be helpful. If not configured, you may be stuck with a completely black screen. Pressing esc should show the boot messages but its possible your laptop has completely frozen. Trying to toggle capslock will often prove this as will the magic sysreq keys:

  1. ctrl + alt + sysreq/print screen + s - emergency sync
  2. ctrl + alt + sysreq/print screen + u - emergency unmount
  3. ctrl + alt + sysreq/print screen + b - emergency reboot

If nothing happens its safe to say your Laptop OS is toast for the moment so hold power button for 10 seconds and go find a Ubuntu Live bootable USB matching the laptop OS.

Disconnect all external monitors, boot the USB and mount the host filesystems - see notes in fix/setup grub for how to do this including if your system is using LUKS.

With a chroot into your Laptop, you can try the steps below:

ubuntu-drivers CLI

If your lucky, you may still have access to a terminal somewhere. In this case you can try to use ubuntu-drivers to install the Nvidia drivers:

ubuntu-drivers --list
# eg:
ubuntu-drivers install nvidia-driver-525-server

If this looks like it did something, then reboot and hope for the best. Sometimes though, this command will tell you the drivers are already installed when they arent, or at least they arent working. Use nvidia-smi to test (not in Live USB).

Try to fix nvidia packages by remove/reinstall

Remove the nvidia drivers and then try to reinstall them yourself using apt, like this:

# use dpkg and grep to find nvidia related packages, eg:
apt remove xserver-xorg-video-nvidia-525 nvidia-prime nvidia-settings screen-resolution-extra

# remove random crap/free up space
apt autoremove

Make sure all available firmwares are installed:

apt install linux-firmware linux-firmware-nonfree firmware-linux-misc

Then try to reinstall the drivers:

apt install nvidia-driver-530 nvidia-dkms-530

DKMS should rebuild the nvidia modules for you and update initramfs. Verify the module files exist for your kernel (pick the right version yourself):

# eg:
find /lib/modules/5.19.0.45-generic -iname "*nvidia*"

Missing files? Try to force a recompile (DKMS is supposed to do this though):

dpkg-reconfigure nvidia-kernel-source-525
dpkg-reconfigure nvidia-dkms-530

Force rebuilding initramfs:

update-initramfs -u

Nvidia driver fails to build - not enough free space in /boot

Oldschool install guides say to use just a few hundred M for /boot which is barely enough for one jumbo Ubuntu kernel let alone a handful. A typical 700M /boot can only support about 2 kernels so upgrades are risky.

Long term, plan to increase space in /boot to about 2G to allow keeping a few known good kernels on hand and allow routine upgrades to succeed by shrinking the main (LUKS?) partition, moving it left/right and growing the /boot partition.

The quick fix here is to free up space in /boot by removing old/unused kernels

Find installed kernels:

dpkg -l |grep linux-image

Kernel running now (irrelevant for Live USB) - dont remove this kernel unless you know what your doing:

uname -a

Remove an old kernel and its modules. Removing may fail due to lack of space, in this case - keep removing more old kernels until there is enough space for scripts to run:

apt remove --purge linux-image-5.19.0-46-generic linux-modules-5.19.0-46-generic

Reinstall the running kernel. This is for experts only, but as you might have gathered from this post the whole procedure is. There is a good chance to break your system enough to need a rescue USB here but this is a good way to re-run previously failed post install scripts for the current kernel that previously failed due to lack of free space on /boot if enough is now available:

dpkg --force-all --purge linux-image-5.19.0-46-generic linux-modules-5.19.0-46-generic
apt install linux-image-5.19.0-46-generic linux-modules-5.19.0-46-generic

# You can also reinstall using apt, from my history I did this too. Pretty sure
# this was deadlocked due to lack of space in `/boot` which meant only `dpkg`
# worked
apt install --reinstall linux-image-5.19.0-46-generic linux-modules-5.19.0-46-generic

Nvidia driver built but refuses to modprobe/errors in dmesg

Kernel module files were generated, included in initramfs, definitely using Nvidia GPU in UEFI/BIOS and double checked Nvidia GPU hardware is really present in laptop SKU?

Could be something to do with Secure Boot… For you to find out. The phrase to google is “enrolling a Machine-Owners’ Key” or MOK.

Disabling secure boot is a quick workaround but its a pain to turn on and off. Business users should probably leave it on too…

Reboot

Thats about all of my ideas for fixing Nvidia drivers - all thats left to do is reboot and hope for the best.

Give up/use Nouveo

nouveau Sadly does not allow my external display to work but if its stable and works more power to you.

Something like this should switch drivers or at least get you back to a command prompt

# remove all nvidia packages
apt install xserver-xorg-video-nouveau
rm /etc/modprobe.d/blacklist-nvidia-nouveau.conf
update-initramfs -u
reboot

Laptop suspend/resume crashes

Before suspecting Nvidia driver:

  1. Ensure running the lastest UEFI/BIOS
  2. Ensure “Windows/Linux” sleep mode set in UEFI/BIOS if available. If this doesnt work try Linux/S3

Black/frozen screen on resume? Not sleeping (also check you configured power management to suspend in the first place…)? Completely dead system? Kernel command line options seem to be able to fix this for some models of ThinkPad at least:

edit /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="init_on_alloc=0 intel_iommu=off enable_mtrr_cleanup mtrr_spare_reg_nr=4 text"

This will update the default Linux menu item for your convenience. That makes it easy to not use these options by selecting rescue instead if still having problems.

Dont forget to run update-grub after changing /etc/default/grub.

Testing the Nvidia driver

  1. External display working? (some laptops only support external display with the proprietary Nvidia driver)
  2. nvidia-smi should output something like:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| N/A   53C    P8    N/A /  N/A |      0MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. glxgears loads and displays some gears
  2. glxinfo | grep render lists vendor as Nvidia. In my case its listed as Mesa so system is not fully accelerated but I have very limited requirements since Im not doing 3D work on this machine. This may be something to do with Nvidia Optimus (prime)
  3. Test if suspend/resume works

Further Reading

Post comment