Ubuntu 18.04.1 - Cuda 10.1 installation, updates nvidia driver to 455 which is not compatible with tensorflow,

My tensorflow 2.3.1 setup with cuda 10.1 was working fine till the time I mistakenly updated nvidia drivers and cuda.

Following are the steps I am using to install cuda 10-1

  1. Purge all cuda and nvidia drivers

sudo apt-get --purge remove "cublas" "cuda*" "nsight*"

sudo apt-get --purge "nvidia*"

sudo apt-get autoremove sudo apt-get autoclean sudo rm -rf /usr/local/cuda*

Reboot

  1. After this I follow instructions from tensorflow page

wget

sudo apt-key adv --fetch-keys

sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb

sudo apt-get update

wget

sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt-get update

  1. sudo apt-get install --no-install-recommends nvidia-driver-450

  2. sudo apt-get install --no-install-recommends cuda-10-1

It creates 2 folders in my /usr/local cuda-10.1 cuda-10.2

at this step, it removes 450 driver and installs 455, following are part of the messages I get

The following packages will be REMOVED: libnvidia-cfg1-450 libnvidia-compute-450 libnvidia-decode-450 libnvidia-encode-450 libnvidia-extra-450 libnvidia-fbc1-450 libnvidia-gl-450 libnvidia-ifr1-450 nvidia-compute-utils-450 nvidia-dkms-450 nvidia-driver-450 nvidia-kernel-common-450 nvidia-kernel-source-450 nvidia-utils-450 xserver-xorg-video-nvidia-450

If I go forward and install libcudnn7, and tensorflow

sudo apt-get install --no-install-recommends
libcudnn7=7.6.5.32-1+cuda10.1
libcudnn7-dev=7.6.5.32-1+cuda10.1

I get this in python

tf.config.list_physical_devices("GPU")

2020-10-07 13:10:02.262260: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 450.80.2 does not match DSO version 455.23.5 -- cannot find working devices in this configuration

To fix this I tried

  1. uninstalling 455

sudo apt purge nvidia-455*

reinstalling tensorflow, Now I get this error in python

tf.config.list_physical_devices("GPU")

2020-10-07 13:20:46.923513: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-10-07 13:20:46.959289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-07 13:20:46.959608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2020-10-07 13:20:46.959626: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-10-07 13:20:46.959769: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory

How to fix this, Thanks

2

1 Answer

Terrance's reply helped fixing the issue of driver upgrade but had to install additional packages and set the config files.

this helped with additional steps

Following are the steps I used for cuda10.1 with nvidia 450 driver for unix 18.04

Steps:

Before installing cuda from run file, we need to install Driver

##Driver, this is as per tensorflow requirement, 455 doesnt work for current tensorflow version

  1. sudo apt-get install --no-install-recommends nvidia-driver-450

##get runfile for cuda 10.1

  1. wget

##install dependencies

  1. sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev

##Follow installation steps by running following

  1. sudo sh cuda_10.1.243_418.87.00_linux.run

#installer gives warning about preexisting driver, continue #select everything except driver in the menu, cuda will be installed, use ls /usr/local

Folder cuda-10.1

  1. Create bash file for cuda profile

#you can use any text editor,

vim /etc/profile.d/cuda.sh

##add the following lines to this file to add path

export PATH=$PATH:/usr/local/cuda-10.1/bin export CUDADIR=/usr/local/cuda-10.1

##Create another file for LD_LIBRARY_PATH

vim /etc/ld.so.conf.d/cuda.conf

#add this line

/usr/local/cuda-10.1/lib64

#run

sudo ldconfig

  1. For Cudnn, use these steps for tar file installation

These are 4 commands

tar -xzvf cudnn-10.1-linux-x64-v7.6.5.32.tgz

sudo cp cuda/include/cudnn*.h /usr/local/cuda/include

sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

  1. If you get this error while using tf

failed call to cuInit: CUDA_ERROR_UNKNOWN

#use this sudo apt install nvidia-modprobe

  1. If somebody wants to install tensorRT, these links are helpful

Why do I get "/sbin/ldconfig.real: /usr/local/cuda/lib64/libcudnn.so.7 is not a symbolic link"?

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like