瞧瞧我对服务器干了些什么!

Git与Github

  • 在Linux的Ubuntu发行版上一般都会默认安装了Git,所以不需要自己手动安装,拿来即用即可。
1
2
git config --global user.name "SSH keys Name"
git config --global user.email "SSH keys Email"
1
ssh-keygen -t rsa -C "Email of Github Account"
1
2
3
4
5
6
7
8
9
10
(base) houjinliang@3080server:~/userdoc/d2cv$ git config --global user.name 'hjl_3080server'
(base) houjinliang@3080server:~/userdoc/d2cv$ git config --global user.email 'cosmicdustycn@outlook.com'
(base) houjinliang@3080server:~/userdoc/d2cv$ ssh-keygen -t rsa -C "cosmicdustycn@outlook.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/mnt/houjinliang/.ssh/id_rsa):
Created directory '/mnt/houjinliang/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /mnt/houjinliang/.ssh/id_rsa.
Your public key has been saved in /mnt/houjinliang/.ssh/id_rsa.pub.
  • 不需要担心Git的用户配置会对本服务器上的其他用户会产生影响。ssh-keygen生产的的用户密钥会保存在个人账号的目录下。
1
2
3
4
5
6
7
8
9
(dlpy310pth113) houjinliang@3080server:~/.ssh$ pwd
/mnt/houjinliang/.ssh
(dlpy310pth113) houjinliang@3080server:~/.ssh$ ll
总用量 20
drwx------ 2 houjinliang houjinliang 4096 11月 1 10:19 ./
drwxr-xr-x 12 houjinliang houjinliang 4096 11月 1 10:17 ../
-rw------- 1 houjinliang houjinliang 1675 11月 1 10:17 id_rsa
-rw-r--r-- 1 houjinliang houjinliang 407 11月 1 10:17 id_rsa.pub
-rw-r--r-- 1 houjinliang houjinliang 444 11月 1 10:19 known_hosts
  • 复制id_rsa.pub文件下的内容,到Github的Setting中设置SSH Keys。如下。

image-20231101105539214

image-20231102171332923

1
2
ssh -T git@github.com
Hi murphyhoucn! You've successfully authenticated, but GitHub does not provide shell access.
1
2
3
4
5
(base) houjinliang@3080server:~/userdoc$ git clone git@github.com:murphyhoucn/DeepLearningforCV.git
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git status
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git add .
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git commit -m "add new file"
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git push

查看GPU占用情况

nvidia-smi

image-20231101110320173

gpustat

GitHub - wookayin/gpustat: 📊 A simple command-line utility for querying and monitoring GPU status

1
2
(dlpy310pth113) houjinliang@3080server:~/userdoc$ pip install gpustat
(dlpy310pth113) houjinliang@3080server:~/userdoc$ gpustat

image-20231101110430041

nvitop

GitHub - XuehaiPan/nvitop: An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.

nvitop: 史上最强GPU性能实时监测工具 - 知乎 (zhihu.com)

1
2
3
4
5
6
(dlpy310pth113) houjinliang@3080server:~$ pip install nvitop
Requirement already satisfied: nvitop in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (1.3.0)
Requirement already satisfied: nvidia-ml-py<12.536.0a0,>=11.450.51 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (12.535.108)
Requirement already satisfied: psutil>=5.6.6 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (5.9.5)
Requirement already satisfied: cachetools>=1.0.1 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (5.3.1)
Requirement already satisfied: termcolor>=1.0.0 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (2.3.0)

image-20231101110731169

Clash for Linux

Ubuntu配置 命令行Clash 教程 - 知乎 (zhihu.com)

终端使用代理加速的正确方式(Clash) | Ln’s Blog (weilining.github.io)

2024.01.10

1
2
3
4
gunzip clash-linux-amd64-v1.18.0.gz
mv clash-linux-amd64-v1.18.0 clash
chmod u+x clash
./clash
1
在 ~/.config/clash/config.yaml 写入订阅的内容

image-20240110153702123

1
2
3
4
5
6
7
8
9
10
11
`~/.bashrc`

function proxy() {
export http_proxy=http://127.0.0.1:7890
export https_proxy=$http_proxy
echo -e "proxy on!"
}
function unproxy(){
unset http_proxy https_proxy
echo -e "proxy off"
}
1
2
3
4
5
(base) houjinliang@3080server:~/userdoc$ source ~/.bashrc
(base) houjinliang@3080server:~/userdoc$ proxy
proxy on!
(base) houjinliang@3080server:~/userdoc$ unproxy
proxy off
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(base) houjinliang@3080server:~/userdoc$ wget www.zhihu.com
URL transformed to HTTPS due to an HSTS policy
--2024-01-10 15:33:59-- https://www.zhihu.com/
正在连接 127.0.0.1:7890... 已连接。
已发出 Proxy 请求,正在等待回应... 302 Found
位置://www.zhihu.com/signin?next=%2F [跟随至新的 URL]
URL transformed to HTTPS due to an HSTS policy
--2024-01-10 15:33:59-- https://www.zhihu.com/signin?next=%2F
再次使用存在的到 www.zhihu.com:443 的连接。
已发出 Proxy 请求,正在等待回应... 200 OK
长度: 39879 (39K) [text/html]
正在保存至: “index.html”

index.html 100%[===================================================================================================================>] 38.94K --.-KB/s 用时 0.04s



2024-01-10 15:33:59 (944 KB/s) - 已保存 “index.html” [39879/39879])

(base) houjinliang@3080server:~/userdoc$ wget www.google.com
--2024-01-10 15:34:14-- http://www.google.com/
正在连接 127.0.0.1:7890... 已连接。
已发出 Proxy 请求,正在等待回应... 200 OK
长度: 未指定 [text/html]
正在保存至: “index.html.1”

index.html.1 [ <=> ] 18.72K --.-KB/s 用时 0.07s

2024-01-10 15:34:16 (257 KB/s) - “index.html.1” 已保存 [19169]

3080Server - MMDetection

  • Ubuntu 18.04.6 LTS
  • gcc version 7.5.0
  • CUDA 11.3
  • cuDNN 8.9.5

MMDetection

版本选择参考镜像:

open-mmlab/mmdetection3d/mmdetection3d-1.1: mmdetection3d-1.1版本 - CG (codewithgpu.com)

image-20240105160510106

CUDA 11.3.1 & CUDNN 8.9.5

  之前安装的是CUDA 11.6,后面感觉这个版本有点儿高了,在看到一些实例之后,决定退回到CUDA 11.3版本。首先第一步是要卸载掉CUDA 11.6,在搜索了之后,发现并没有找到能用的方法,于是决定直接rm -rf cuda-11.6,这样吧CUDA的文件删掉之后再重装。

CUDA Toolkit 11.3 Update 1 Downloads | NVIDIA Developer

1
2
wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run

非root用户安装cuda与cudnn - 知乎 (zhihu.com)

image-20240105154958922
image-20240105154857088
image-20240105155054178
image-20240105154820598
image-20240105154838934
image-20240105154907733-17193723323271
image-20240105155142558
1
2
3
4
5
6
7
8
9
(base) houjinliang@3080server:~/userdoc/cuda_and_cudnn$ sh ./cuda_11.3.1_465.19.01_linux.run

= Summary =
Driver: Not Selected Toolkit: Installed in /mnt/houjinliang/cuda-11.3/ Samples: Not Selected
Please make sure that
PATH includes /mnt/houjinliang/cuda-11.3/bin
LD_LIBRARY_PATH includes /mnt/houjinliang/cuda-11.3/lib64, or, add /mnt/houjinliang/cuda-11.3/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /mnt/houjinliang/cuda-11.3/bin ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 465.00 is required for CUDA 11.3 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run --silent --driver
Logfile is /tmp/cuda-installer.log
1
2
3
4
5
6
7
8
9
vim ~/.bashrc

```
# cuda environment variables
# murpy insert
export CUDA_HOME=$CUDA_HOME:/mnt/houjinliang/cuda-11.3
export PATH=$PATH:/mnt/houjinliang/cuda-11.3/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mnt/houjinliang/cuda-11.3/lib64
```
1
2
3
4
5
6
7
8
(base) houjinliang@3080server:~$ source ~/.bashrc

(base) houjinliang@3080server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
(base) houjinliang@3080server:~/userdoc/cuda_and_cudnn$ tar xvJf cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz


(py38mmdetection) houjinliang@3080server:~/userdoc/cuda_and_cudnn/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ ll
总用量 48
drwxr-xr-x 4 houjinliang houjinliang 4096 8月 3 2022 ./
drwxrwxr-x 3 houjinliang houjinliang 4096 1月 5 16:32 ../
drwxr-xr-x 2 houjinliang houjinliang 4096 8月 3 2022 include/
drwxr-xr-x 2 houjinliang houjinliang 4096 8月 3 2022 lib/
-rw-r--r-- 1 houjinliang houjinliang 28994 8月 3 2022 LICENSE


(py38mmdetection) houjinliang@3080server:~/userdoc/cuda_and_cudnn/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp lib/* ~/cuda-11.3/lib64/
(py38mmdetection) houjinliang@3080server:~/userdoc/cuda_and_cudnn/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp include/* ~/cuda-11.3/include

chmod +x ~/cuda-11.3/include/cudnn.h
chmod +x ~/cuda-11.3/lib64/libcudnn*

(base) houjinliang@3080server:~$ cat ~/cuda-11.3/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

PyTorch 1.11

1
2
3
4
5
6
7
8
9
10
11
(base) houjinliang@3080server:~$ conda create -n py38mmdetection python=3.8 -y
(base) houjinliang@3080server:~$ conda activate py38mmdetection
(py38mmdetection) houjinliang@3080server:~$ conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

(py38mmdetection) houjinliang@3080server:~$ python
Python 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True

阿里云源

1
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

mmdet installation

开始你的第一步 — MMDetection 3.3.0 文档

3080Server - MMYOLO

Overview — MMYOLO 0.6.0 documentation

1
2
3
4
5
6
7
8
9
10
11
(base) houjinliang@3080server:~$ conda create -n py38mmyolo python=3.8

(base) houjinliang@3080server:~$ conda activate py38mmyolo
(py38mmyolo) houjinliang@3080server:~$ pip config list
global.index-url='https://mirrors.aliyun.com/pypi/simple'

(py38mmyolo) houjinliang@3080server:~$ conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

(py38mmyolo) houjinliang@3080server:~$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.11.0
True
1
2
3
4
pip install -U openmim
mim install "mmengine>=0.6.0"
mim install "mmcv>=2.0.0rc4,<2.1.0"
mim install "mmdet>=3.0.0,<4.0.0"
1
mim install "mmyolo"
1
2
3
4
5
6
7
8
9
git clone https://github.com/open-mmlab/mmyolo.git
cd mmyolo
# Install albumentations
pip install -r requirements/albu.txt
# Install MMYOLO
mim install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.
1
2
3
4
5
6
7
8
9
10
(base) houjinliang@3080server:~/userdoc/offlinefile$ wget  http://images.cocodataset.org/zips/val2017.zip
--2024-01-10 16:17:46-- http://images.cocodataset.org/zips/val2017.zip
正在解析主机 images.cocodataset.org (images.cocodataset.org)... 3.5.7.141, 52.216.215.25, 52.216.185.83, ...
正在连接 images.cocodataset.org (images.cocodataset.org)|3.5.7.141|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 815585330 (778M) [application/zip]
正在保存至: “val2017.zip”

val2017.zip 100%[===================================================================================================================>] 777.80M 3.89MB/s 用时 2m 22ss
2024-01-10 16:20:08 (5.48 MB/s) - 已保存 “val2017.zip” [815585330/815585330])

目录占用空间大小查询

  • 查看文件以及文件夹大小
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
(py38mmyolo) houjinliang@3080server:~/userdoc/offlinefile$ ll
总用量 26251480
drwxrwxr-x 6 houjinliang houjinliang 4096 1月 10 21:36 ./
drwxrwxr-x 9 houjinliang houjinliang 4096 1月 10 15:59 ../
-rw-rw-r-- 1 houjinliang houjinliang 3996930 1月 10 14:43 clash-linux-amd64-v1.18.0.gz
drwxr-xr-x 5 houjinliang houjinliang 4096 8月 26 2022 coco/
-rw-rw-r-- 1 houjinliang houjinliang 6983030 1月 10 17:00 coco128.zip
-rw-rw-r-- 1 houjinliang houjinliang 48639045 1月 10 16:21 coco2017labels.zip
-rw-rw-r-- 1 houjinliang houjinliang 4372979 1月 10 14:48 curl-8.5.0.tar.gz
-rw-rw-r-- 1 houjinliang houjinliang 12353723 1月 5 16:32 pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
drwxrwxr-x 2 houjinliang houjinliang 1429504 8月 31 2017 test2017/
-rw-rw-r-- 1 houjinliang houjinliang 6646970404 1月 10 17:47 test2017.zip
drwxrwxr-x 2 houjinliang houjinliang 4112384 8月 31 2017 train2017/
-rw-rw-r-- 1 houjinliang houjinliang 19336861798 1月 10 21:35 train2017.zip
drwxrwxr-x 2 houjinliang houjinliang 167936 8月 31 2017 val2017/
-rw-rw-r-- 1 houjinliang houjinliang 815585330 7月 11 2018 val2017.zip

(py38mmyolo) houjinliang@3080server:~/userdoc/offlinefile$ ll -hl
总用量 26G
drwxrwxr-x 6 houjinliang houjinliang 4.0K 1月 10 21:36 ./
drwxrwxr-x 9 houjinliang houjinliang 4.0K 1月 10 15:59 ../
-rw-rw-r-- 1 houjinliang houjinliang 3.9M 1月 10 14:43 clash-linux-amd64-v1.18.0.gz
drwxr-xr-x 5 houjinliang houjinliang 4.0K 8月 26 2022 coco/
-rw-rw-r-- 1 houjinliang houjinliang 6.7M 1月 10 17:00 coco128.zip
-rw-rw-r-- 1 houjinliang houjinliang 47M 1月 10 16:21 coco2017labels.zip
-rw-rw-r-- 1 houjinliang houjinliang 4.2M 1月 10 14:48 curl-8.5.0.tar.gz
-rw-rw-r-- 1 houjinliang houjinliang 12M 1月 5 16:32 pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
drwxrwxr-x 2 houjinliang houjinliang 1.4M 8月 31 2017 test2017/
-rw-rw-r-- 1 houjinliang houjinliang 6.2G 1月 10 17:47 test2017.zip
drwxrwxr-x 2 houjinliang houjinliang 4.0M 8月 31 2017 train2017/
-rw-rw-r-- 1 houjinliang houjinliang 19G 1月 10 21:35 train2017.zip
drwxrwxr-x 2 houjinliang houjinliang 164K 8月 31 2017 val2017/
-rw-rw-r-- 1 houjinliang houjinliang 778M 7月 11 2018 val2017.zip
  • 如要查看当前目录已经使用总大小及当前目录下一级文件或文件夹各自使用的总空间大小
1
2
3
4
5
6
7
8
9
10
11
12
(py38mmyolo) houjinliang@3080server:~$ du -h --max-depth=1
6.5M ./.config
8.0K ./.conda
1.1G ./.vscode-server
12G ./cuda-11.3
86G ./userdoc
8.0K ./.gnupg
16K ./.ssh
8.0K ./.nv
2.7G ./.cache
24G ./miniconda3
125G .

3090Server

  • Ubuntu 18.04.6 LTS
  • gcc version 7.5.0
  • CUDA 11.3
  • cuDNN 8.9.5

系统详细

1
2
3
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-150-generic x86_64)
Model name: Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

NV Driver

1
2
3
(base) houjinliang@3090server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

个人目录

1
2
3
4
5
6
7
8
9
10
houjinliang@3090server:~$ ll
total 40
drwxr-xr-x 4 houjinliang houjinliang 4096 6月 26 10:18 ./
drwxrwxrwx 21 super super 4096 6月 26 10:17 ../
-rw-r--r-- 1 houjinliang houjinliang 220 4月 5 2018 .bash_logout
-rw-r--r-- 1 houjinliang houjinliang 3771 4月 5 2018 .bashrc
drwx------ 2 houjinliang houjinliang 4096 6月 26 10:18 .cache/
-rw-r--r-- 1 houjinliang houjinliang 8980 4月 16 2018 examples.desktop
drwx------ 3 houjinliang houjinliang 4096 6月 26 10:18 .gnupg/
-rw-r--r-- 1 houjinliang houjinliang 807 4月 5 2018 .profile

Miniconda

下载Miniconda的sh脚本文件,增加文件可执行的权限,然后执行下载脚本.

1
2
3
houjinliang@3090server:~/MyDownloadFiles$ wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
houjinliang@3090server:~/MyDownloadFiles$ chmod +x Miniconda3-latest-Linux-x86_64.sh
houjinliang@3090server:~/MyDownloadFiles$ ./Miniconda3-latest-Linux-x86_64.sh

安装过程中会有选择安装路径的选择,直接选择默认路径.

1
2
3
4
5
6
7
8
9
# 默认安装路径
Miniconda3 will now be installed into this location:
/mnt/houjinliang/miniconda3

- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below

[/mnt/houjinliang/miniconda3] >>>

image-20241023134128208

这里选择输入yes,然后会自动配置 ~/.bashrc,关闭Terminal然后再重启一个,就能看到命令行前面的base了;

如果是输入no的话,手动输入下面的内容到 ~/.bashrc中。

安装完成之后conda命令在终端是识别不到的,需要配置环境变量.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 配置miniconda的环境变量(根据实际情况更改)
(base) houjinliang@3090server:~$ vim ~/.bashrc

# 这里vim用的不熟练就使用vscode打开这个‘~/.bashrc’文件,然后再末尾增加moniconda的配置文件


# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mnt/houjinliang/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/mnt/houjinliang/miniconda3/etc/profile.d/conda.sh" ]; then
. "/mnt/houjinliang/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/mnt/houjinliang/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<


houjinliang@3090server:~$ source ~/.bashrc
(base) houjinliang@3090server:~$

检查一下Minconda的基本信息.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 查看一下miniconda的基本信息
(base) houjinliang@3090server:~$ conda info

active environment : base
active env location : /mnt/houjinliang/miniconda3
shell level : 1
user config file : /mnt/houjinliang/.condarc
populated config files :
conda version : 24.4.0
conda-build version : not installed
python version : 3.12.3.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.4.0=0
__cuda=11.7=0
__glibc=2.27=0
__linux=5.4.0=0
__unix=0=0
base environment : /mnt/houjinliang/miniconda3 (writable)
conda av data dir : /mnt/houjinliang/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /mnt/houjinliang/miniconda3/pkgs
/mnt/houjinliang/.conda/pkgs
envs directories : /mnt/houjinliang/miniconda3/envs
/mnt/houjinliang/.conda/envs
platform : linux-64
user-agent : conda/24.4.0 requests/2.31.0 CPython/3.12.3 Linux/5.4.0-150-generic ubuntu/18.04.6 glibc/2.27 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8 aau/0.4.4 c/. s/. e/.
UID:GID : 1035:1035
netrc file : None
offline mode : False

conda换源,换成阿里云源

1
2
3
# 我这里没换

参考: https://developer.aliyun.com/article/1291651

pip换源,换成阿里云源

直接用命令的方式,如下.

1
2
(base) houjinliang@3090server:~$ pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
Writing to /mnt/houjinliang/.config/pip/pip.conf

或者是修改 ~/.config/pip/pip.conf (没有就创建一个), 内容如下:

1
2
3
(base) houjinliang@3090server:~$ cat ~/.config/pip/pip.conf
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/

NV Driver

1
2
3
(base) houjinliang@3080server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 525.60.11 Wed Nov 23 23:04:03 UTC 2022
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

CUDA 11.3.1 & CUDNN 8.9.5

跟之前的服务器CUDA版本一样,这里还是参照上面的进行安装.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@3090server:~/MyDownloadFiles$ wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run


(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 3224920
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:10 ./
drwxr-xr-x 10 houjinliang houjinliang 4096 6月 26 11:04 ../
-rw-rw-r-- 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

# "cuda_11.3.1_465.19.01_linux.run"文件没有"x"的权限,加一个权限
(base) houjinliang@3090server:~/MyDownloadFiles$ chmod +x cuda_11.3.1_465.19.01_linux.run
(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 3224920
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:10 ./
drwxr-xr-x 10 houjinliang houjinliang 4096 6月 26 11:04 ../
-rwxrwxr-x 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run*
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

# 执行安装
(base) houjinliang@3090server:~/MyDownloadFiles$ ./cuda_11.3.1_465.19.01_linux.run
# 还是跟上面的安装步骤的图片有一样

image-20241023134348370

出现这样的不要害怕,直接Continue就好了,然后按照下面的步骤。

NPU_2024-06-26_11-29-59
NPU_2024-06-26_11-31-04
NPU_2024-06-26_11-31-37
NPU_2024-06-26_11-32-40
NPU_2024-06-26_11-34-55
NPU_2024-06-26_11-35-29
image-20240626114348562
NPU_2024-06-26_11-36-37
NPU_2024-06-26_11-37-23

安装完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(base) houjinliang@3090server:~/MyDownloadFiles$ ./cuda_11.3.1_465.19.01_linux.run
===========
= Summary =
===========

Driver: Not Selected
Toolkit: Installed in /mnt/houjinliang/cuda-11.3/
Samples: Not Selected

Please make sure that
- PATH includes /mnt/houjinliang/cuda-11.3/bin
- LD_LIBRARY_PATH includes /mnt/houjinliang/cuda-11.3/lib64, or, add /mnt/houjinliang/cuda-11.3/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /mnt/houjinliang/cuda-11.3/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 465.00 is required for CUDA 11.3 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver

Logfile is /tmp/cuda-installer.log

image-20240626114642588

安装完成之后,最好把这个/tmp/cuda-installer.log文件删除了,如果不删的话,后面的用户再安装就会有影响。为了不妨碍他人,最好把这个删掉。

配置CUDA Toolkit 的环境变量,使用vim或vscode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(base) houjinliang@3090server:~$ vim ~/.bashrc


# >>> cuda environment variables >>>
# murpy insert
export CUDA_HOME=$CUDA_HOME:/mnt/houjinliang/cuda-11.3
export PATH=$PATH:/mnt/houjinliang/cuda-11.3/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mnt/houjinliang/cuda-11.3/lib64
# <<< cuda environment variables <<<

(base) houjinliang@3090server:~$ source ~/.bashrc

# CUDA安装和配置完成
(base) houjinliang@3090server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

cudann安装。cudnn的下载需要到nVidia的网站,登录账号才行,这里我就直接用之前安装的时候已经下载好的了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 4062292
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:56 ./
drwxr-xr-x 11 houjinliang houjinliang 4096 6月 26 11:48 ../
-rwxrwxr-x 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run*
-rw-rw-r-- 1 houjinliang houjinliang 857460936 6月 26 11:57 cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

# cudnn 压缩包解压缩
(base) houjinliang@3090server:~/MyDownloadFiles$ tar xvJf cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_infer_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_infer_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_train_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_train_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_infer_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_infer_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_train_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_train_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_infer_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_infer_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_train_static.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_train_static_v8.a
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_infer.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_infer.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_infer.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_train.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_train.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_adv_train.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_infer.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_infer.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_infer.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_train.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_train.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_cnn_train.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_infer.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_infer.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_infer.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_train.so
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_train.so.8
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/lib/libcudnn_ops_train.so.8.9.5
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_adv_infer_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_adv_train_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_backend_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_cnn_infer_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_cnn_train_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_ops_infer_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_ops_train_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_version_v8.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_adv_infer.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_adv_train.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_backend.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_cnn_infer.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_cnn_train.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_ops_infer.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_ops_train.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/include/cudnn_version.h
cudnn-linux-x86_64-8.9.5.29_cuda11-archive/LICENSE

# 查看cudnn的解压缩文件
(base) houjinliang@3090server:~/MyDownloadFiles$ cd cudnn-linux-x86_64-8.9.5.29_cuda11-archive/
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ ll
total 48
drwxr-xr-x 4 houjinliang houjinliang 4096 9月 7 2023 ./
drwxrwxr-x 3 houjinliang houjinliang 4096 6月 26 11:58 ../
drwxr-xr-x 2 houjinliang houjinliang 4096 9月 7 2023 include/
drwxr-xr-x 2 houjinliang houjinliang 4096 9月 7 2023 lib/
-rw-r--r-- 1 houjinliang houjinliang 29662 9月 7 2023 LICENSE

# 把cudnn的文件copy到cuda目录下
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp lib/* ~/cuda-11.3/lib64/
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp include/* ~/cuda-11.3/include
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ chmod +x ~/cuda-11.3/include/cudnn.h
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ chmod +x ~/cuda-11.3/lib64/libcudnn*

# 检查cudnn版本和验证cudnncopy是否成功
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cat ~/cuda-11.3/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

Git & Github

1
2
3
4
5
6
7
8
9
10
(base) houjinliang@3090server:~$ git config --global user.name 'hjl_3090server'
(base) houjinliang@3090server:~$ git config --global user.email 'cosmicdustycn@outlook.com'
(base) houjinliang@3090server:~$ ssh-keygen -t rsa -C "cosmicdustycn@outlook.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/mnt/houjinliang/.ssh/id_rsa):
Created directory '/mnt/houjinliang/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /mnt/houjinliang/.ssh/id_rsa.
Your public key has been saved in /mnt/houjinliang/.ssh/id_rsa.pub.
1
2
3
4
5
6
7
8
(base) houjinliang@3090server:~/.ssh$ pwd
/mnt/houjinliang/.ssh
(base) houjinliang@3090server:~/.ssh$ ll
total 16
drwx------ 2 houjinliang houjinliang 4096 6月 26 12:11 ./
drwxr-xr-x 12 houjinliang houjinliang 4096 6月 26 12:11 ../
-rw------- 1 houjinliang houjinliang 1679 6月 26 12:11 id_rsa
-rw-r--r-- 1 houjinliang houjinliang 407 6月 26 12:11 id_rsa.pub
1
2
3
4
5
6
7
8
9
10
(base) houjinliang@3090server:~$ git config user.name
hjl_3090server
(base) houjinliang@3090server:~$ git config user.email
cosmicdustycn@outlook.com
(base) houjinliang@3090server:~$ ssh -T git@github.com
The authenticity of host 'github.com (20.205.243.166)' can't be established.
ECDSA key fingerprint is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,20.205.243.166' (ECDSA) to the list of known hosts.
Hi murphyhoucn! You've successfully authenticated, but GitHub does not provide shell access.

3090Server2

  • Ubuntu 20.04.5 LTS
  • gcc version 9.4.0
  • CUDA 11.3
  • cuDNN 8.9.5

NV Driver

1
2
3
(base) houjinliang@3090server2:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

4090Server

  • Ubuntu 22.04.2 LTS
  • gcc 11.4.0
  • CUDA11.6 : cuda_11.6.2_510.47.03_linux.run
  • cuDNN 8.9.5: cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz

NV Driver

1
2
3
(sr_benchmark) houjinliang@4090server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.06 Wed Jun 26 06:46:07 UTC 2024
GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

CUDA 11.6 & cuDNN 8.9.5

1
2
3
4
5
(base) houjinliang@4090server:~/MyDownloadFiles$ ./cuda_11.6.2_510.47.03_linux.run
(base) houjinliang@3090server:~/MyDownloadFiles$ cd cudnn-linux-x86_64-8.9.5.29_cuda11-archive/

# CUDA版本是11.6.2
# CUDNN版本还是用的之前的

安装过程跟上面的一样,记得把11.3都换成11.6

image-20241023150732692

之后再配置Git。

至于conda env,我把之前服务器上的环境使用conda-pack打包,然后使用scp传过来,然后解压到对应文件夹下。虽然之前cuda113,torch也是113版本的,但是在cuda116的服务器上也能用(那就先用着?!

问题:Failed to initialize NVML: Driver/library version mismatch

环境正常运行了很长一段时间,但是突然有一天,在运行程序的时候出现了这样一个报错!

1
ERROR: cuda is not available, try running on CPU

这个error是我自己的程序里写得报错提示,系统的cuda不可用了?!这是咋回事?!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
(base) houjinliang@4090server:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.216

(base) houjinliang@4090server:~$ nvitop
NVML ERROR: RM has detected an NVML/RM version mismatch.

(base) houjinliang@4090server:~$ gpustat
Error on querying NVIDIA devices. Use --debug flag to see more details.
RM has detected an NVML/RM version mismatch.

(base) houjinliang@4090server:~$ gpustat --debug
Error on querying NVIDIA devices. Use --debug flag to see more details.
RM has detected an NVML/RM version mismatch.

Traceback (most recent call last):
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/gpustat/cli.py", line 58, in print_gpustat
gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/gpustat/core.py", line 402, in new_query
N.nvmlInit()
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 1947, in nvmlInit
nvmlInitWithFlags(0)
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 1937, in nvmlInitWithFlags
_nvmlCheckReturn(ret)
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 899, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_LibRmVersionMismatch: RM has detected an NVML/RM version mismatch.


(sr_benchmark) houjinliang@4090server:~$ python
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/mnt/houjinliang/miniconda3/envs/sr_benchmark/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False

Failed to initialize NVML: Driver/library version mismatch 的解决方法 - 知乎

4090Server2

  • Ubuntu 22.04.3 LTS
  • gcc version 12.3.0

NV Driver

1
2
3
(base) houjinliang@4090server2:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.107.02 Wed Jul 24 23:53:00 UTC 2024
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

CUDA 12.4.1 & cuDNN 8.9.7

CUDA 12.4.1 : CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer

1
(base) houjinliang@4090server2:~/MyDownloadFiles$ wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
image-20241024224047453
image-20241024224102744
image-20241024224157698
image-20241024224232933
image-20241024224310577
image-20241024224318258
image-20241024224333861

image-20241024224612237

记得把这个log文件删掉!

配置CUDA的环境变量

1
(base) houjinliang@4090server2:~/MyDownloadFiles$ vim ~/.bashrc
1
2
3
4
5
6
# >>> cuda environment variables >>>
# murpy insert
export CUDA_HOME=$CUDA_HOME:/data/houjinliang/cuda-12.4
export PATH=$PATH:/data/houjinliang/cuda-12.4/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/houjinliang/cuda-12.4/lib64
# <<< cuda environment variables <<<
1
2
3
4
5
6
7
(base) houjinliang@4090server2:~/MyDownloadFiles$ source ~/.bashrc
(base) houjinliang@4090server2:~/MyDownloadFiles$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

CUDNN : cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar

https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(base) houjinliang@4090server2:~/MyDownloadFiles$ tar xvJf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz

(base) houjinliang@4090server2:~/MyDownloadFiles$ cd cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ ll
total 48
drwxr-xr-x 4 houjinliang houjinliang 4096 11月 30 2023 ./
drwxrwxr-x 3 houjinliang houjinliang 4096 10月 24 22:53 ../
drwxr-xr-x 2 houjinliang houjinliang 4096 11月 30 2023 include/
drwxr-xr-x 2 houjinliang houjinliang 4096 11月 30 2023 lib/
-rw-r--r-- 1 houjinliang houjinliang 29662 11月 30 2023 LICENSE

(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ cp lib/* ~/cuda-12.4/lib64/
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ cp include/* ~/cuda-12.4/include
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ chmod +x ~/cuda-12.4/include/cudnn.h
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ chmod +x ~/cuda-12.4/lib64/libcudnn*

(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ cat ~/cuda-12.4/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 7
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

git install

这台服务器上没有git,使用deb包安装一个

1
(base) houjinliang@4090server2:~/MyDownloadFiles$ wget http://archive.ubuntu.com/ubuntu/pool/main/g/git/git_2.34.1-1ubuntu1.11_amd64.deb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(base) houjinliang@4090server2:~/MyDownloadFiles$ cd ~

# 先创建文件夹,把git安装到这里
(base) houjinliang@4090server2:~$ mkdir git
(base) houjinliang@4090server2:~$ dpkg -x ./MyDownloadFiles/git_2.34.1-1ubuntu1.11_amd64.deb ./git

(base) houjinliang@4090server2:~$ cd git/
(base) houjinliang@4090server2:~/git$ ll
total 20
drwxr-xr-x 5 houjinliang houjinliang 4096 5月 20 20:14 ./
drwxr-x--- 14 houjinliang houjinliang 4096 10月 24 23:22 ../
drwxr-xr-x 3 houjinliang houjinliang 4096 5月 20 20:14 etc/
drwxr-xr-x 5 houjinliang houjinliang 4096 5月 20 20:14 usr/
drwxr-xr-x 3 houjinliang houjinliang 4096 5月 20 20:14 var/
1
(base) houjinliang@4090server2:~$ vim ~/.bashrc
1
2
3
4
5
# >>> git environment variables >>>
# murpy insert
export PATH=$PATH:~/git/usr/bin
export GIT_EXEC_PATH=~/git/usr/lib/git-core
# <<< git environment variables <<<
1
(base) houjinliang@4090server2:~$ source ~/.bashrc
1
2
(base) houjinliang@4090server2:~$ git --version
git version 2.34.1

git 配置

1
2
3
4
5
6
7
8
9
10
11
(base) houjinliang@4090server2:~$ git config --global user.name 'hjl_4090server2'
(base) houjinliang@4090server2:~$ git config --global user.email 'cosmicdustycn@outlook.com'
(base) houjinliang@4090server2:~$ ssh-keygen -t rsa -C "cosmicdustycn@outlook.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/data/houjinliang/.ssh/id_rsa):
Created directory '/data/houjinliang/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /data/houjinliang/.ssh/id_rsa
Your public key has been saved in /data/houjinliang/.ssh/id_rsa.pub
(base) houjinliang@4090server2:~$ cat ~/.ssh/id_rsa.pub
1
2
3
4
5
6
7
8
9
10
11
(base) houjinliang@4090server2:~$ git config user.name
hjl_4090server2
(base) houjinliang@4090server2:~$ git config user.email
cosmicdustycn@outlook.com
(base) houjinliang@4090server2:~$ ssh -T git@github.com
The authenticity of host 'github.com (20.205.243.166)' can't be established.
ED25519 key fingerprint is SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'github.com' (ED25519) to the list of known hosts.
Hi murphyhoucn! You've successfully authenticated, but GitHub does not provide shell access.

conda env

虽然4090server2上面的CUDA环境是12.4,但这里还是用了在3080上配置的sr_benchmark的环境。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@4090server2:~$ mkdir ~/miniconda3/envs/sr_benchmark
(base) houjinliang@4090server2:~$ tar -xzvf ./MyDownloadFiles/sr_benchmark.tar.gz -C ~/miniconda3/envs/sr_benchmark
(base) houjinliang@4090server2:~$ conda env list
# conda environments:
#
base * /data/houjinliang/miniconda3
sr_benchmark /data/houjinliang/miniconda3/envs/sr_benchmark

(base) houjinliang@4090server2:~$
(base) houjinliang@4090server2:~$ conda activate sr_benchmark
(sr_benchmark) houjinliang@4090server2:~$ python
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>>

# torch版本还是之前的
torch 1.10.1+cu113
torchvision 0.11.2+cu113

参考链接

CUDA Toolkit and Corresponding Driver Versions

CUDA 12.6 Update 2 Release Notes

image-20241030165027775
image-20241030165047346

GCC与CUDA版本对应

image-20241023135851280

  • 3080Server - gcc 7.5.0 (Ubuntu 18.04.6 LTS)-> CUDA 11.3
  • 3090Server - gcc 7.5.0 (Ubuntu 18.04.6 LTS)-> CUDA 11.3
  • 3090Server2 - gcc 9.4.0 (Ubuntu 20.04.5 LTS)-> CUDA 11.3
  • 4090Server - gcc 11.4.0 (Ubuntu 22.04.2 LTS)-> CUDA 11.6
  • 4090Server - gcc 12.3.0 (Ubuntu 22.04.3 LTS)-> CUDA 12.4

image-20241023142730502

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

cuDNN docs

image-20241023145233186

CUDA Toolkit Archive

image-20241023145423648

CUDA Toolkit Archive | NVIDIA Developer

cuDNN Archive

image-20241024223544376

image-20241024223923702

Docker

Docker Install

需要管理员用户!

  • 使用APT安装(具体步骤参考网上教程)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Step 1
sudo apt update
sudo apt install \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release

# Step 2
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Step 3
echo \
"deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Step 4
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io

# Step 5
sudo systemctl enable docker
sudo systemctl start docker
  • 为了让非管理员用户也能使用docker,需要建立用户组,赋予用户组内的用户权限
1
2
3
4
5
6
7
8
9
10
11
12
# 建立 docker 组
sudo groupadd docker

# 将当前用户加入 docker 组
sudo usermod -aG docker $USER
# 将xxx用户加入 docker 组
sudo usermod -aG docker xxxxxxxx

# 查看docker用户组用户 - 方法1
getent group docker
# 查看docker用户组用户 - 方法2
grep '^docker:' /etc/group

配置docker代理

docker 代理配置需要管理员用户

上网代理,参考瞧瞧我对服务器干了些什么! - MurphyHou (cosmicdusty.cc)

一、配置镜像服务器(很多镜像服务器已经不能用了)

1
2
3
4
5
6
7
8
9
10
11
12
13
vim /etc/docker/daemon.json

# 在json配置文件中,输入以下配置
{
"registry-mirrors": [
"https://hub-mirror.c.163.com",
"https://mirror.baidubce.com"
]
}

# 然后重启docker服务
sudo systemctl daemon-reload
sudo systemctl restart docker

二、docker pull代理

1
2
3
4
5
6
7
8
9
10
11
12
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo touch /etc/systemd/system/docker.service.d/proxy.conf

# 在json配置文件中,输入以下配置 -> (7890端口号是因为clash是代理的这个端口)
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:7890/"
Environment="HTTPS_PROXY=http://127.0.0.1:7890/"
Environment="NO_PROXY=localhost,127.0.0.1,.example.com"

# 然后重启docker服务
sudo systemctl daemon-reload
sudo systemctl restart docker

三、Container代理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 1、用户级代理(这个就不需要管理员用户了,使用自己的用户登录)
vim ~/.docker/config.json

# 在json配置文件中,输入以下配置 -> (7890端口号是因为clash是代理的这个端口)
{
"proxies":
{
"default":
{
"httpProxy": "http://127.0.0.1:7890",
"httpsProxy": "http://127.0.0.1:7890",
"noProxy": "localhost,127.0.0.1,.example.com"
}
}
}

测试Docker配置是否成功

Ubuntu | Docker — 从入门到实践 (gitbook.io)

1
docker run --rm hello-world

image-20241205160737158

配置overleaf

上述的docker环境配置好之后,可以配置一下overleaf. 特别是得配置好网络环境,要不然Docker Image拉取不下来

配置

1
2
3
4
5
6
7
8
# Step 1:下载源码
git clone https://github.com/overleaf/toolkit.git ./overleaf-toolkit && cd overleaf-toolkit

# Step 2:初始化配置
bin/init

# Step 3:建立服务
bin/up

image-20241205161651945

1
2
3
4
5
# 启动服务
bin/start

# 结束服务
bin/stop

远程访问

因为服务是在远程服务器上,为了在本地能直接方法,需要修改端口和外网访问

./config/overleaf.rc中,需要修改以下字段:

1
2
OVERLEAF_LISTEN_IP=xx.xx.xx.xx # 远程服务器IP
OVERLEAF_PORT=80 # 默认是80

Overleaf 容器启动之后,可以打开 http://xx.xx.xx.xx:xx/launchpad 注册管理员帐户。之后我们就可以用这个帐户登录 Overleaf 平台。

网上教程中还给出了一些复杂的配置,后面根据需要再配置吧。

后记

因为Overleaf官网对于免费用户,只有20s的编译时间,超过时间限制则无法编译。对于这种情况,只能付费解决。如果面对我遇到这样的情况的话,我可能也会选择付费的方式。但在网上看到了可以在服务器上搭建自己的Overleaf,所以想跟着教程自己试一下。按照教程一步步走下来,最后也配置成功了。也许最后并不会使用自己配置的这个,但折腾永不停息,万一用到了呢?!


瞧瞧我对服务器干了些什么!
https://cosmicdusty.cc/post/Tools/WorkingWithGPUServer/
作者
Murphy
发布于
2023年11月1日
许可协议