My tensorflow 2.3.1 setup with cuda 10.1 was working fine till the time I mistakenly updated nvidia drivers and cuda.
Following are the steps I am using to install cuda 10-1
- Purge all cuda and nvidia drivers
sudo apt-get --purge remove "cublas" "cuda*" "nsight*"
sudo apt-get --purge "nvidia*"
sudo apt-get autoremove sudo apt-get autoclean sudo rm -rf /usr/local/cuda*
Reboot
- After this I follow instructions from tensorflow page
wget
sudo apt-key adv --fetch-keys
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-driver-450
sudo apt-get install --no-install-recommends cuda-10-1
It creates 2 folders in my /usr/local cuda-10.1 cuda-10.2
at this step, it removes 450 driver and installs 455, following are part of the messages I get
The following packages will be REMOVED: libnvidia-cfg1-450 libnvidia-compute-450 libnvidia-decode-450 libnvidia-encode-450 libnvidia-extra-450 libnvidia-fbc1-450 libnvidia-gl-450 libnvidia-ifr1-450 nvidia-compute-utils-450 nvidia-dkms-450 nvidia-driver-450 nvidia-kernel-common-450 nvidia-kernel-source-450 nvidia-utils-450 xserver-xorg-video-nvidia-450
If I go forward and install libcudnn7, and tensorflow
sudo apt-get install --no-install-recommends
libcudnn7=7.6.5.32-1+cuda10.1
libcudnn7-dev=7.6.5.32-1+cuda10.1
I get this in python
tf.config.list_physical_devices("GPU")
2020-10-07 13:10:02.262260: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 450.80.2 does not match DSO version 455.23.5 -- cannot find working devices in this configuration
To fix this I tried
- uninstalling 455
sudo apt purge nvidia-455*
reinstalling tensorflow, Now I get this error in python
tf.config.list_physical_devices("GPU")
2020-10-07 13:20:46.923513: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-10-07 13:20:46.959289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-07 13:20:46.959608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2020-10-07 13:20:46.959626: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-10-07 13:20:46.959769: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
How to fix this, Thanks
21 Answer
Terrance's reply helped fixing the issue of driver upgrade but had to install additional packages and set the config files.
this helped with additional steps
Following are the steps I used for cuda10.1 with nvidia 450 driver for unix 18.04
Steps:
Before installing cuda from run file, we need to install Driver
##Driver, this is as per tensorflow requirement, 455 doesnt work for current tensorflow version
- sudo apt-get install --no-install-recommends nvidia-driver-450
##get runfile for cuda 10.1
- wget
##install dependencies
- sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev
##Follow installation steps by running following
- sudo sh cuda_10.1.243_418.87.00_linux.run
#installer gives warning about preexisting driver, continue #select everything except driver in the menu, cuda will be installed, use ls /usr/local
Folder cuda-10.1
- Create bash file for cuda profile
#you can use any text editor,
vim /etc/profile.d/cuda.sh
##add the following lines to this file to add path
export PATH=$PATH:/usr/local/cuda-10.1/bin export CUDADIR=/usr/local/cuda-10.1
##Create another file for LD_LIBRARY_PATH
vim /etc/ld.so.conf.d/cuda.conf
#add this line
/usr/local/cuda-10.1/lib64
#run
sudo ldconfig
- For Cudnn, use these steps for tar file installation
These are 4 commands
tar -xzvf cudnn-10.1-linux-x64-v7.6.5.32.tgz
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
- If you get this error while using tf
failed call to cuInit: CUDA_ERROR_UNKNOWN
#use this sudo apt install nvidia-modprobe
- If somebody wants to install tensorRT, these links are helpful
Why do I get "/sbin/ldconfig.real: /usr/local/cuda/lib64/libcudnn.so.7 is not a symbolic link"?