【SLAM】于AutoDL云上GPU运行GCNv2_SLAM的记录

配置GCNv2_SLAM所需环境并实现AutoDL云端运行项目的全过程记录。

1. 引子

前几天写了一篇在本地虚拟机里面CPU运行GCNv2_SLAM项目的博客：链接，关于GCNv2_SLAM项目相关的介绍请移步此文章，本文不再重复说明。

在之前的测试中，本地虚拟机CPU运行的效果非常差，推理速度只有可怜兮兮的0.5 HZ，但是我手头又没有带显卡的环境，所以想到了可以去网上租个带显卡的容器化环境。

AutoDL就是一个租GPU环境的平台: https://www.autodl.com/，而且autodl租显卡是可以按小时付费的，比按月付费的更加划算，更好过自己买个显卡在本地倒腾ubuntu环境，所以就直接开整了！

先注册一个AutoDL的账户，给里面充值一丢丢钱，然后就可以租一个显卡容器化环境来运行GCNv2_SLAM啦！

2. AutoDL环境选择

老版本PyTorch的镜像由于4090无法使用太低的cuda版本导致无法选择，如果需要使用更低版本的pytorch镜像，则需要租用2080ti或者1080ti显卡的环境。

2080ti显卡可以选择如下环境，实测可用：

1
2
3

PyTorch  1.5.1
Python  3.8(ubuntu18.04)
Cuda  10.1

创建环境后，建议使用左侧的ssh登录指令直接在本地终端里面执行，登录到云端。如果你没有本地的ssh终端，也可以点击JupyterLab里面的终端来运行命令。

后文涉及到下载很多文件，如果从github下载很慢，可以在本地下好之后通过JupyterLab传到云端去。注意传文件之前要先在文件列表里面选好目标的目录。

还可以尝试autodl自带的代理：www.autodl.com/docs/network_turbo/，但是慕雪试用的时候这个代理一直返回503，不可用状态。

3. 依赖安装

3.1. 需要的apt包安装

运行之前先更新一下环境，这部分操作和在本地虚拟机里面安装环境都是一样的。

1 2	sudo apt-get update -y sudo apt-get upgrade -y

更新的时候会有一个新的sshd配置的提醒，这里直接选择1用新版本配置就可以了

A new version (/tmp/file1bBLK4) of configuration file /etc/ssh/sshd_config is available, but the version installed currently has been
locally modified.

  1. install the package maintainer's version             5. show a 3-way difference between available versions
  2. keep the local version currently installed           6. do a 3-way merge between available versions
  3. show the differences between the versions            7. start a new shell to examine the situation
  4. show a side-by-side difference between the versions
What do you want to do about modified configuration file sshd_config? 1

因为选了Pytorch镜像，Python工具组系统已经自带了，不需要安装。

安装要用的到的工具包

# 工具包
sudo apt-get install -y \
    apt-utils \
    curl wget unzip zip \
    cmake make automake \
    openssh-server \
    net-tools \
    vim git gcc g++

安装x11相关的依赖包

# x11 for gui
sudo apt-get install -y  \
    libx11-xcb1 \
    libfreetype6 \
    libdbus-1-3 \
    libfontconfig1 \
    libxkbcommon0   \
    libxkbcommon-x11-0

3.2. Pangolin-6.0

3.2.1. 依赖项安装

安装Pangolin之前先安装如下依赖包

# pangolin
sudo apt-get install -y \
    libgl1-mesa-dev \
    libglew-dev \
    libboost-dev \
    libboost-thread-dev \
    libboost-filesystem-dev \
    libpython2.7-dev \
    libglu1-mesa-dev freeglut3-dev

在AutoDL的PyTorch 1.5.1镜像中，安装Pangolin依赖包的时候的终端输出如下，出现了依赖项版本冲突问题。

root@autodl-container-e39d46b8d3-01da7b14:~# apt-get install -y     libgl1-mesa-dev     libglew-dev     libboost-dev     libboost-thread-dev     libboost-filesystem-dev     libpython2.7-dev     libglu1-mesa-dev freeglut3-dev
Reading package lists... Done
Building dependency tree     
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 freeglut3-dev : Depends: libxext-dev but it is not going to be installed
                 Depends: libxt-dev but it is not going to be installed
 libgl1-mesa-dev : Depends: mesa-common-dev (= 20.0.8-0ubuntu1~18.04.1) but it is not going to be installed
                   Depends: libx11-dev but it is not going to be installed
                   Depends: libx11-xcb-dev but it is not going to be installed
                   Depends: libxdamage-dev but it is not going to be installed
                   Depends: libxext-dev but it is not going to be installed
                   Depends: libxfixes-dev but it is not going to be installed
                   Depends: libxxf86vm-dev but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

这里依赖冲突的问题是安装的x11依赖包有两个版本过高了，需要降级下面这两个依赖包。如果你安装依赖项时没有出现依赖项冲突就成功安装了，则不需要执行下面的降级命令。

1
2
3

apt-get install -y \
    libx11-xcb1=2:1.6.4-3ubuntu0.4 \
    libx11-6=2:1.6.4-3ubuntu0.4

降级成功后，重新执行上述安装Pangolin依赖项的命令，就能成功安装了。

3.2.2. 编译安装

随后使用如下命令来编译安装Pangolin，Github地址：Pangolin-0.6。

建议这些依赖包都进入~/autodl-tmp数据盘来下载和安装，这样即便后续需要更换镜像也能保留数据，不需要重新下载

# 下载
wget -O Pangolin-0.6.tar.gz https://github.com/stevenlovegrove/Pangolin/archive/refs/tags/v0.6.tar.gz
# 解压
tar -zxvf Pangolin-0.6.tar.gz

pushd Pangolin-0.6
    rm -rf build
    mkdir build && cd build
    # 编译安装 
    cmake -DCPP11_NO_BOOST=1 ..
    make -j$(nproc)
    make install
    # 刷新动态库
    ldconfig
popd

编译安装成功

示例代码HelloPangolin也能编译成功，只不过当前我们还没有配置GUI，所以会有x11错误无法运行（后文会讲述如何配置GUI和VNC）

root@autodl-container-e39d46b8d3-01da7b14:~/pkg/Pangolin-0.6/build# cd ../examples/HelloPangolin
root@autodl-container-e39d46b8d3-01da7b14:~/pkg/Pangolin-0.6/examples/HelloPangolin# mkdir build && cd build
root@autodl-container-e39d46b8d3-01da7b14:~/pkg/Pangolin-0.6/examples/HelloPangolin/build# cmake ..
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) in CMakeLists.txt:
  No cmake_minimum_required command is present.  A line of code such as

    cmake_minimum_required(VERSION 3.10)

  should be added at the top of the file.  The version specified may be lower
  if you wish to support older CMake versions for this project.  For more
  information run "cmake --help-policy CMP0000".
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done
CMake Warning (dev) at CMakeLists.txt:5 (add_executable):
  Policy CMP0003 should be set before this line.  Add code such as

    if(COMMAND cmake_policy)
      cmake_policy(SET CMP0003 NEW)
    endif(COMMAND cmake_policy)

  as early as possible but after the most recent call to
  cmake_minimum_required or cmake_policy(VERSION).  This warning appears
  because target "HelloPangolin" links to some libraries for which the linker
  must search:

    rt, pthread, rt, pthread

  and other libraries with known full path:

    /usr/local/lib/libpangolin.so

  CMake is adding directories in the second list to the linker search path in
  case they are needed to find libraries from the first list (for backwards
  compatibility with CMake 2.4).  Set policy CMP0003 to OLD or NEW to enable
  or disable this behavior explicitly.  Run "cmake --help-policy CMP0003" for
  more information.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /root/pkg/Pangolin-0.6/examples/HelloPangolin/build
root@autodl-container-e39d46b8d3-01da7b14:~/pkg/Pangolin-0.6/examples/HelloPangolin/build# make
Scanning dependencies of target HelloPangolin
[ 50%] Building CXX object CMakeFiles/HelloPangolin.dir/main.o
[100%] Linking CXX executable HelloPangolin
[100%] Built target HelloPangolin
root@autodl-container-e39d46b8d3-01da7b14:~/pkg/Pangolin-0.6/examples/HelloPangolin/build# ./HelloPangolin 
terminate called after throwing an instance of 'std::runtime_error'
  what():  Pangolin X11: Failed to open X display
Aborted (core dumped)

3.3. OpenCV 3.4.5

先安装依赖项

sudo apt-get install -y \
    build-essential libgtk2.0-dev \
    libavcodec-dev libavformat-dev \
    libjpeg.dev libtiff5.dev libswscale-dev \
    libcanberra-gtk-module

因为AutoDL环境是amd64，所以直接用下面的命令安装libjasper就OK了，不需要额外的处理

# amd64 添加新源后继续安装
sudo apt-get install -y software-properties-common 
sudo add-apt-repository "deb http://security.ubuntu.com/ubuntu xenial-security main"
sudo apt-get -y update 
sudo apt-get install -y libjasper1 libjasper-dev

以下是安装libjasper的截图

安装好了依赖项后，使用如下命令编译opencv，Github地址：opencv的3.4.5版本。

# 下载和解压
wget -O opencv-3.4.5.tar.gz https://github.com/opencv/opencv/archive/refs/tags/3.4.5.tar.gz
tar -zxvf opencv-3.4.5.tar.gz
# 开始编译和安装
pushd opencv-3.4.5
    rm -rf build
    mkdir build && cd build 
    # 构建和编译安装，-j4代表4线程并发
    cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local ..
    make -j$(nproc)
    make install
    # 刷新动态库
    ldconfig
popd

正常编译安装，莫得问题

3.4. Eigen 3.7

Eigen包在gitlab里面下载：gitlab.com/libeigen/eigen/-/releases/3.3.7

# 下载
wget -O eigen-3.3.7.tar.gz https://gitlab.com/libeigen/eigen/-/archive/3.3.7/eigen-3.3.7.tar.gz
tar -zxvf eigen-3.3.7.tar.gz
# 开始编译和安装
cd eigen-3.3.7
mkdir build && cd build
cmake ..
make && make install
# 拷贝路径（避免头文件引用不到）
sudo cp -r /usr/local/include/eigen3/Eigen /usr/local/include

还是用相同的cpp的demo代码来测试是否安装成功（直接g++编译就可以了）

#include <iostream>
//需要将头文件从 /usr/local/include/eigen3/ 复制到 /usr/local/include
#include <Eigen/Dense>
//using Eigen::MatrixXd;
using namespace Eigen;
using namespace Eigen::internal;
using namespace Eigen::Architecture;
using namespace std;
int main()
{
        cout<<"*******************1D-object****************"<<endl;
        Vector4d v1;
        v1<< 1,2,3,4;
        cout<<"v1=\n"<<v1<<endl;
 
        VectorXd v2(3);
        v2<<1,2,3;
        cout<<"v2=\n"<<v2<<endl;
 
        Array4i v3;
        v3<<1,2,3,4;
        cout<<"v3=\n"<<v3<<endl;
 
        ArrayXf v4(3);
        v4<<1,2,3;
        cout<<"v4=\n"<<v4<<endl;
}

正常编译运行

root@autodl-container-e39d46b8d3-01da7b14:~/pkg/eigen-3.3.7/build# g++ test.cpp -o t
root@autodl-container-e39d46b8d3-01da7b14:~/pkg/eigen-3.3.7/build# ./t
*******************1D-object****************
v1=
1
2
3
4
v2=
1
2
3
v3=
1
2
3
4
v4=
1
2
3

3.5. Libtorch 1.5.0

3.5.1. 关于手工编译的说明

因为我们选择的autodl环境里面已经带了Pytorch了，所以可以不需要自己手动从源码构建了。

我尝试过从源码构建pytorch 1.1.0版本，会在构建的半路被killed掉，不清楚问题在哪里，猜测是构建占用内存cpu过多导致的，当时被kill掉的输出如下，大约在74%的时候，前后都没有出现error，就直接被干掉了。

3.5.2. 不能使用本地已有的版本

我们选用的autodl镜像里面其实已经自带了一个可用的Torch目录，路径如下所示

1	/root/miniconda3/lib/python3.8/site-packages/torch/share/cmake/Torch

但是这个目录中引用的libtorch预编译版本是不包含C++11ABI兼容机制的，会最终导致Pangolin链接失败，错误输出如下所示。这个链接失败的问题和使用的Pangolin版本没有关系，尝试过Pangolin5.0和6.0都会链接失败。

[100%] Linking CXX executable ../GCN2/rgbd_gcn
../lib/libORB_SLAM2.so: undefined reference to `pangolin::Split(std::string const&, char)'
../lib/libORB_SLAM2.so: undefined reference to `pangolin::CreatePanel(std::string const&)'
../lib/libORB_SLAM2.so: undefined reference to `DBoW2::FORB::fromString(cv::Mat&, std::string const&)'
../lib/libORB_SLAM2.so: undefined reference to `pangolin::BindToContext(std::string)'
../lib/libORB_SLAM2.so: undefined reference to `DBoW2::FORB::toString(cv::Mat const&)'
../lib/libORB_SLAM2.so: undefined reference to `pangolin::CreateWindowAndBind(std::string, int, int, pangolin::Params const&)'
collect2: error: ld returned 1 exit status
CMakeFiles/rgbd_gcn.dir/build.make:152: recipe for target '../GCN2/rgbd_gcn' failed
make[2]: *** [../GCN2/rgbd_gcn] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/rgbd_gcn.dir/all' failed
make[1]: *** [CMakeFiles/rgbd_gcn.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

在GCNv2的GITHUB中是有提到这个问题的，翻译过来就是不要使用预编译版本的libtorch，因为会出现CXX11 ABI导致的连接错误。

在Pytroch 1.3.0之后的版本，官方就已经提供了带CXX11 ABI兼容的预编译版本了，所以可以下载预编译包之后来使用。直接使用容器内的libtorch依旧会有链接问题。

3.5.3. 下载预编译版本

最开始我选择的就是pytorch1.1.0版本的镜像，但是由于没办法从源码编译所以切换成了pytorch1.5.1的镜像。因为在pytorch1.3.0之后官方才提供了CXX11 ABI兼容的预编译包，在这之前的版本都需要手工编译，否则会有链接错误。

我们需要做的操作是从官网上下一个带CXX11 ABI兼容的libtorch预编译包，下载地址中包含cxx11-abi的才是带有CXX11 ABI兼容的。1.5.0版本的libtorch包下载地址如下，其中cu101代表cuda10.1，最后的libtorch版本是1.5.0（libtorch 1.5.1版本的包下不了）

1	https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.5.0.zip

直接通过unzip解压这个目录，就能得到一个libtorch文件夹，后文需要的TORCH_PATH在libtorch的libtorch/share/cmake/Torch目录中就有：

1 2	root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp# ls libtorch/share/cmake/Torch TorchConfig.cmake TorchConfigVersion.cmake

预编译的libtorch包容量都挺大的，建议本地提前下好然后上传到autodl里面，在autodl里面直接下载太耗时了，都是钱呐！

4. 编译GCNv2_SLAM

上正主了，克隆一下代码

1	git clone https://github.com/jiexiong2016/GCNv2_SLAM.git

因为这次是在autodl环境中跑，有了显卡，pytorch的版本和之前的博客中的完全不一样，所以需要修改的代码内容也不一样。可以参考博客 GCNv2_SLAM-CPU详细安装教程(ubuntu18.04)-CSDN博客中的说明进行修改。

4.1. 修改build.sh

预编译版本的TORCH_PATH在压缩包解压后libtorch目录中，即libtorch/share/cmake/Torch目录。修改build.sh脚本中的路径为此目录就可以了

1	-DTORCH_PATH=/root/autodl-tmp/libtorch/share/cmake/Torch

修改之后就可以开始编译，并根据报错来解决后面的一些问题了

4.2. 修改代码兼容高版本libtorch

这部分修改可以在我的Github仓库中找到：github.com/musnows/GCNv2_SLAM/tree/pytorch1.5.0

4.2.1. C++14编译配置

初次运行会出现如下错误，高版本的torch需要C++14来编译，因为用到了14的新特性

1
2
3

/root/autodl-tmp/libtorch/include/c10/util/C++17.h:27:2: error: #error You need C++14 to compile PyTorch
   27 | #error You need C++14 to compile PyTorch
      |  ^~~~~

需要我们修改camke文件，修改GCNv2_SLAM/CMakeLists.txt，新增如下内容

# 头部插入
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# 修改尾部的11为14
# set_property(TARGET rgbd_gcn PROPERTY CXX_STANDARD 11)
set_property(TARGET rgbd_gcn PROPERTY CXX_STANDARD 14)

然后还需要注释掉和C++11判断相关的cmake配置，也就是下面这一堆

#Check C++11 or C++0x support
#include(CheckCXXCompilerFlag)
#CHECK_CXX_COMPILER_FLAG("-std=c++11" COMPILER_SUPPORTS_CXX11)
#CHECK_CXX_COMPILER_FLAG("-std=c++0x" COMPILER_SUPPORTS_CXX0X)
#if(COMPILER_SUPPORTS_CXX11)
#   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")
   add_definitions(-DCOMPILEDWITHC11)
#   message(STATUS "Using flag -std=c++11.")
#elseif(COMPILER_SUPPORTS_CXX0X)
#   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x")
#   add_definitions(-DCOMPILEDWITHC0X)
#   message(STATUS "Using flag -std=c++0x.")
#else()
#   message(FATAL_ERROR "The compiler ${CMAKE_CXX_COMPILER} has no C++11 support. Please use a different C++ compiler.")
#endif()

其中add_definitions(-DCOMPILEDWITHC11)不要注释掉，有用！

修改cmake后需要删除GCNv2_SLAM/build目录重新运行build.sh脚本，否则修改可能不会生效。

4.2.2. 缺少对应的operator=

报错如下

/root/autodl-tmp/GCNv2_SLAM/src/GCNextractor.cc: In constructor ‘ORB_SLAM2::GCNextractor::GCNextractor(int, float, int, int, int)’:
/root/autodl-tmp/GCNv2_SLAM/src/GCNextractor.cc:218:37: error: no match for ‘operator=’ (operand types are ‘std::shared_ptr<torch::jit::Module>’ and ‘torch::jit::Module’)
     module = torch::jit::load(net_fn);
                                     ^
In file included from /usr/include/c++/7/memory:81:0,
                 from /root/miniconda3/lib/python3.8/site-packages/torch/include/c10/core/Allocator.h:4,
                 from /root/miniconda3/lib/python3.8/site-packages/torch/include/ATen/ATen.h:3,
                 from /root/miniconda3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                 from /root/miniconda3/lib/python3.8/site-packages/torch/include/torch/script.h:3,
                 from /root/autodl-tmp/GCNv2_SLAM/include/GCNextractor.h:24,
                 from /root/autodl-tmp/GCNv2_SLAM/src/GCNextractor.cc:63:

问题主要是torch::jit::Module入参不再是一个指针了，所以要把shared_ptr给改成普通对象。

修改GCNv2_SLAM/include/GCNextractor.h文件的99行：

//原代码
std::shared_ptr<torch::jit::script::Module> module;
//更改为
torch::jit::script::Module module;

还需要对应修改GCNv2_SLAM/src/GCNextractor.cc的270行：

//原代码
auto output = module->forward(inputs).toTuple();
//更改为
auto output = module.forward(inputs).toTuple();

4.2.3. 标准库chrono编译问题

如果你的cmake修改不对，还可能会遇到chrono导致的编译报错

/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc: In function ‘int main(int, char**)’:
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:97:22: error: ‘std::chrono::monotonic_clock’ has not been declared
         std::chrono::monotonic_clock::time_point t1 = std::chrono::monotonic_clock::now();
                      ^~~~~~~~~~~~~~~
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:106:22: error: ‘std::chrono::monotonic_clock’ has not been declared
         std::chrono::monotonic_clock::time_point t2 = std::chrono::monotonic_clock::now();
                      ^~~~~~~~~~~~~~~
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:109:84: error: ‘t2’ was not declared in this scope
         double ttrack = std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count();
                                                                                    ^~
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:109:84: note: suggested alternative: ‘tm’
         double ttrack = std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count();
                                                                                    ^~
                                                                                    tm
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:109:89: error: ‘t1’ was not declared in this scope
         double ttrack = std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count();
                                                                                         ^~
/root/autodl-tmp/GCNv2_SLAM/GCN2/rgbd_gcn.cc:109:89: note: suggested alternative: ‘tm’
         double ttrack = std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count();
                                                                                         ^~
                                                                                         tm
^CCMakeFiles/rgbd_gcn.dir/build.make:62: recipe for target 'CMakeFiles/rgbd_gcn.dir/GCN2/rgbd_gcn.cc.o' failed
make[2]: *** [CMakeFiles/rgbd_gcn.dir/GCN2/rgbd_gcn.cc.o] Interrupt
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/rgbd_gcn.dir/all' failed
make[1]: *** [CMakeFiles/rgbd_gcn.dir/all] Interrupt
Makefile:83: recipe for target 'all' failed
make: *** [all] Interrupt

错误的主要含义就是std::chrono::monotonic_clock不存在，这是老版本的一个类，C++11新版本已经给它删掉了。查看GCN2/rgbd_gcn.cc代码可以发现，这里有宏定义来区分

// GCNv2_SLAM/GCN2/rgbd_gcn.cc
#ifdef COMPILEDWITHC11
        std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();
#else
        std::chrono::monotonic_clock::time_point t1 = std::chrono::monotonic_clock::now();
#endif

前文提到的GCNv2_SLAM/CMakeLists.txt中需要保留add_definitions(-DCOMPILEDWITHC11)就是这个原因。有了这个宏定义此处代码就会编译std::chrono::steady_clock，不会有编译错误了。

4.2.4. 修改PT文件

依旧需要修改3个pt文件，注意这时候修改的内容和CPU运行不一样！

修改GCNv2_SLAM/GCN2下gcn2_320x240.pt、gcn2_640x480.pt和gcn2_tiny_320x240.pt中的内容。需要先解压文件

1	unzip gcn2_320x240.pt

解压出来之后会有GCNv2_SLAM/GCN2/gcn/code/gcn.py文件，这里的grid_sampler函数在pytorch 1.3.0之前是默认传入True的，1.3.0改成默认False了，所以需要手动传入True

# 原代码
_32 = torch.squeeze(torch.grid_sampler(input, grid, 0, 0))
# 修改为
_32 = torch.squeeze(torch.grid_sampler(input, grid, 0, 0, True))

替换了之后，重新压缩pt文件，先删了原本的，重新压缩

1
2
3

rm -rf gcn2_320x240.pt
zip -r gcn2_320x240.pt gcn
rm -rf gcn #删除刚刚的gcn文件夹

这只是一个例子，其他几个gcn2压缩包都要用相同的方式修改！

unzip gcn2_640x480.pt
rm -rf gcn2_640x480.pt
# 修改下面这个文件
#   GCNv2_SLAM/GCN2/gcn2_480x640/code/gcn2_480x640.py
# 重新压缩
zip -r gcn2_640x480.pt gcn2_480x640
rm -rf gcn2_480x640

unzip gcn2_tiny_320x240.pt
rm -rf gcn2_tiny_320x240.pt
# 修改文件
#   gcnv2slam/GCNv2_SLAM/GCN2/gcn2_tiny/code/gcn2_tiny.py
# 重新压缩
zip -r gcn2_tiny_320x240.pt gcn2_tiny
rm -rf gcn2_tiny

4.3. 编译项目

修改了上面提到的几处问题，就能正常编译成功了

如果需要从头重新编译项目，需要删除build目录缓存。

rm -rf Thirdparty/g2o/build/
rm -rf Thirdparty/DBoW2/build/
rm -rf Vocabulary/*.bin
rm -rf ./build

5. 配置VNC环境

5.1. 安装VNC服务端

默认情况下autodl是没有GUI环境的，也就没有办法运行项目（会有x11报错）

所以我们需要依照官方文档来配置一下GUI：www.autodl.com/docs/gui/

# 安装基本的依赖包
apt update && apt install -y libglu1-mesa-dev mesa-utils xterm xauth x11-xkb-utils xfonts-base xkb-data libxtst6 libxv1

# 安装libjpeg-turbo和turbovnc
export TURBOVNC_VERSION=2.2.5
export LIBJPEG_VERSION=2.0.90
wget http://aivc.ks3-cn-beijing.ksyun.com/packages/libjpeg-turbo/libjpeg-turbo-official_${LIBJPEG_VERSION}_amd64.deb
wget http://aivc.ks3-cn-beijing.ksyun.com/packages/turbovnc/turbovnc_${TURBOVNC_VERSION}_amd64.deb
dpkg -i libjpeg-turbo-official_${LIBJPEG_VERSION}_amd64.deb
dpkg -i turbovnc_${TURBOVNC_VERSION}_amd64.deb
rm -rf *.deb

# 启动VNC服务端，这一步可能涉及vnc密码配置（注意不是实例的账户密码）。另外如果出现报错xauth未找到，那么使用apt install xauth再安装一次
rm -rf /tmp/.X1*  # 如果再次启动，删除上一次的临时文件，否则无法正常启动
USER=root /opt/TurboVNC/bin/vncserver :1 -desktop X -auth /root/.Xauthority -geometry 1920x1080 -depth 24 -rfbwait 120000 -rfbauth /root/.vnc/passwd -fp /usr/share/fonts/X11/misc/,/usr/share/fonts -rfbport 6006

# 检查是否启动，如果有vncserver的进程，证明已经启动
ps -ef | grep vnc | grep -v grep

启动vnc服务端会让你输入密码，为了方便我直接用了autodl实例的密码。只读密码view-only password选择n不设置。

[root@autodl-container-e39d46b8d3-01da7b14:~/vnc]$ USER=root /opt/TurboVNC/bin/vncserver :1 -desktop X -auth /root/.Xauthority -geometry 1920x1080 -depth 24 -rfbwait 120000 -rfbauth /root/.vnc/passwd -fp /usr/share/fonts/X11/misc/,/usr/share/fonts -rfbport 6006

You will require a password to access your desktops.

Password: 
Warning: password truncated to the length of 8.
Verify:   
Would you like to enter a view-only password (y/n)? n
xauth:  file /root/.Xauthority does not exist

Desktop 'TurboVNC: autodl-container-e39d46b8d3-01da7b14:1 (root)' started on display autodl-container-e39d46b8d3-01da7b14:1

Creating default startup script /root/.vnc/xstartup.turbovnc
Starting applications specified in /root/.vnc/xstartup.turbovnc
Log file is /root/.vnc/autodl-container-e39d46b8d3-01da7b14:1.log

启动vnc服务端后就能搜到进程了

1
2

root@autodl-container-e39d46b8d3-01da7b14:~/vnc# ps -ef | grep vnc | grep -v grep
root      28861      1  0 11:22 pts/0    00:00:00 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: autodl-container-64eb44b6f5-c569ba8d:1 (root) -httpd /opt/TurboVNC/bin//../java -auth /root/.Xauthority -geometr

如果关闭了实例之后需要重启vnc，执行这两个命令就行了

1
2

rm -rf /tmp/.X1*  # 如果再次启动，删除上一次的临时文件，否则无法正常启动
USER=root /opt/TurboVNC/bin/vncserver :1 -desktop X -auth /root/.Xauthority -geometry 1920x1080 -depth 24 -rfbwait 120000 -rfbauth /root/.vnc/passwd -fp /usr/share/fonts/X11/misc/,/usr/share/fonts -rfbport 6006

5.2. 本地端口绑定

随后还需要进行本地ssh端口绑定，先到autodl的控制台实例列表里面复制一下ssh链接命令，应该长这样

1	ssh -p 端口号 root@域名

使用下面这个命令在本地的终端运行，就能实现把远程的端口绑定到本地的6006端口了

1	ssh -CNgv -L 6006:127.0.0.1:6006 root@域名 -p 端口号

如果命令正确，输入这个命令后会让你键入autodl实例的密码，在控制台里面复制然后ctrl+shift+v（command+v）粘贴就行了。

期间需要保持这个终端一直开启，不然转发会终止。

5.3. 链接VNC

这里我使用了祖传的VNC Viewer来连云端，全平台都有客户端，下载安装就可以了。

安装了之后，直接在顶栏输入127.0.0.1:6006来链接云端。

如果提示connection closed大概率是vnc服务没有正常安装或者端口转发没有成功，请重试上述步骤。顺利的话，就会弹出来让你输入密码。

这里的密码是启动vnc服务端时设置的密码，根据你设置的密码输入就行。

链接成功，会是黑屏，正常情况

5.4. 测试VNC是否安装成功

我们可以用Pangolin的示例程序来试试有没有配置成功

cd Pangolin-0.6/examples/HelloPangolin
mkdir build && cd build
cmake ..
make

编译完成之后需要先执行export DISPLAY=:1启用GUI再启动需要GUI的程序

1 2	export DISPLAY=:1 ./HelloPangolin

如果没有export直接启动，还是会报错

root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp/Pangolin-0.6/examples/HelloPangolin/build# ./HelloPangolin 
terminate called after throwing an instance of 'std::runtime_error'
  what():  Pangolin X11: Failed to open X display
Aborted (core dumped)

export了环境变量之后就能正常启动，且VNC里面也能看到画面了

1
2

root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp/Pangolin-0.6/examples/HelloPangolin/build# export DISPLAY=:1
root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp/Pangolin-0.6/examples/HelloPangolin/build# ./HelloPangolin

出现下面这个魔方就是安装VNC成功啦

你也可以编译opencv的demo来测试vnc是否正常

cd opencv-3.4.5/samples/cpp/example_cmake
mkdir build && cd build 
cmake ..
make
# 导入环境变量之后再启动
export DISPLAY=:1
./opencv_example

如果正常，vnc里面会出现一个hello opencv，因为没有摄像头所以是黑屏

6. 运行GCNv2_SLAM分析TUM数据集

接下来就可以运行项目了，还是去下载TUM数据集，这里把之前博客的命令copy过来。

6.1. 下载数据集

下载地址：cvg.cit.tum.de/data/datasets/rgbd-dataset/download

下载fr1/desk数据集，这是一个桌子的RGBD数据

在GCNv2_SLAM工程下新建datasets/TUM,将数据集下载到其中

# 新建datasets/TUM数据集文件夹
mkdir -p datasets/TUM 
cd datasets/TUM
# 下载数据集到datasets/TUM文件夹内
wget -O rgbd_dataset_freiburg1_desk.tgz https://cvg.cit.tum.de/rgbd/dataset/freiburg1/rgbd_dataset_freiburg1_desk.tgz
# 解压数据集
tar -xvf rgbd_dataset_freiburg1_desk.tgz

然后还需要下载一个associate.py脚本来处理一下数据集才能正常运行

下载地址：svncvpr.in.tum.de，同时在我的Github仓库也做了留档。

1	wget -O associate.py https://svncvpr.in.tum.de/cvpr-ros-pkg/trunk/rgbd_benchmark/rgbd_benchmark_tools/src/rgbd_benchmark_tools/associate.py

这个脚本只能用python2运行，需要下载numpy库。注意autodl的环境中python绑定到了python3，环境中的python2被拦掉了，所以需要安装独立的python2命令来运行python2。

在Pytorch1.5.1版本的autodl镜像中，可以直接使用下面的命令来安装python2和pip2：

1	apt-get install -y python-dev python-pip

随后安装numpy库就ok了

root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp/GCNv2_SLAM/datasets/TUM# pip2 install numpy
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting numpy
  Downloading http://mirrors.aliyun.com/pypi/packages/3a/5f/47e578b3ae79e2624e205445ab77a1848acdaa2929a00eeef6b16eaaeb20/numpy-1.16.6-cp27-cp27mu-manylinux1_x86_64.whl (17.0 MB)
     |████████████████████████████████| 17.0 MB 21.1 MB/s 
Installing collected packages: numpy
Successfully installed numpy-1.16.6

执行脚本来处理两个文件，在数据文件夹里执行命令

1	python2 associate.py rgbd_dataset_freiburg1_desk/rgb.txt rgbd_dataset_freiburg1_desk/depth.txt > rgbd_dataset_freiburg1_desk/associate.txt

执行python命令后可以看看合并成功了没有，如下应该就是没问题了。

1
2
3

1305031472.895713 rgb/1305031472.895713.png 1305031472.892944 depth/1305031472.892944.png
1305031472.927685 rgb/1305031472.927685.png 1305031472.924814 depth/1305031472.924814.png
1305031472.963756 rgb/1305031472.963756.png 1305031472.961213 depth/1305031472.961213.png

在同一个网站下载的其他TUM数据集也需要用相同的方式进行处理

6.2. 运行项目

随后进入项目的GCN2目录执行命令，我把命令中的路径都改成了相对路径

# 注意需要导入vnc环境变量
export DISPLAY=:1
# 运行项目
cd GCN2
GCN_PATH=gcn2_320x240.pt ./rgbd_gcn ../Vocabulary/GCNvoc.bin TUM3_small.yaml ../datasets/TUM/rgbd_dataset_freiburg1_desk ../datasets/TUM/rgbd_dataset_freiburg1_desk/associate.txt

项目能正常运行，VNC中也有图像输出

运行结束后的输出如下

[root@autodl-container-e39d46b8d3-01da7b14:~/autodl-tmp/GCNv2_SLAM/GCN2]$ GCN_PATH=gcn2_320x240.pt ./rgbd_gcn ../Vocabulary/GCNvoc.bin TUM3_small.yaml ../datasets/TUM/rgbd_dataset_freiburg1_desk ../datasets/TUM/rgbd_dataset_freiburg1_desk/associate.txt

ORB-SLAM2 Copyright (C) 2014-2016 Raul Mur-Artal, University of Zaragoza.
This program comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions. See LICENSE.txt.

Input sensor was set to: RGB-D

Loading ORB Vocabulary. This could take a while...
Vocabulary loaded!


Camera Parameters: 
- fx: 267.7
- fy: 269.6
- cx: 160.05
- cy: 123.8
- k1: 0
- k2: 0
- p1: 0
- p2: 0
- fps: 30
- color order: RGB (ignored if grayscale)

ORB Extractor Parameters: 
- Number of Features: 1000
- Scale Levels: 8
- Scale Factor: 1.2
- Initial Fast Threshold: 20
- Minimum Fast Threshold: 7

Depth Threshold (Close/Far Points): 5.97684

-------
Start processing sequence ...
Images in the sequence: 573

Framebuffer with requested attributes not available. Using available framebuffer. You may see visual artifacts.New map created with 251 points
Finished!
-------

median tracking time: 0.0187857
mean tracking time: 0.0193772

Saving camera trajectory to CameraTrajectory.txt ...

trajectory saved!

Saving keyframe trajectory to KeyFrameTrajectory.txt ...

trajectory saved!

用时0.0187857，约合53hz，和论文里面GTX1070laptop的80hz还是差的有点远。

后面又跑了几次，结果更慢了。不过整体还是比CPU运行快了n多倍了！

1 2	median tracking time: 0.0225817 mean tracking time: 0.0236844

7. 尝试4090运行失败

7.1. 环境配置（PyTorch 1.11.0）

我尝试使用过4090显卡，环境如下。4090没办法选更低版本的PyTorch了。

1
2
3

PyTorch  1.11.0
Python  3.8(ubuntu20.04)
Cuda  11.3

依赖项都用相同的命令安装，以下是安装依赖项时的部分截图。

对应的Pytorch 1.11.0版本的libtorch下载链接如下。

1	https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu113.zip

整个包比较大，一共有1.6GB，需要慢慢等待下载了。建议还是本地提前下好再传上去，毕竟autodl每一分钟都是钱呐！

最终项目可以正常编译完成（也需要执行上文提到的代码修改）

7.2. 数据集处理

在Pytorch1.11.0镜像中需要用下面的方式安装python2来处理数据集，主要是python-pip包会提示不可用，没办法直接安装。

# 安装python2
apt-get install -y python-dev-is-python2
# 安装pip2
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python2 get-pip.py

获取到的python2如下，随后正常安装numpy来运行脚本就行了

root@autodl-container-64eb44b6f5-c569ba8d:~# python2 -V
Python 2.7.18
root@autodl-container-64eb44b6f5-c569ba8d:~# pip2 -V
pip 20.3.4 from /usr/local/lib/python2.7/dist-packages/pip (python 2.7)

7.3. 运行GCN2发生coredump

还是用相同的命令启动程序

1
2
3

export DISPLAY=:1
cd GCN2
GCN_PATH=gcn2_320x240.pt ./rgbd_gcn ../Vocabulary/GCNvoc.bin TUM3_small.yaml ../datasets/TUM/rgbd_dataset_freiburg1_desk ../datasets/TUM/rgbd_dataset_freiburg1_desk/associate.txt

完蛋，coredump了！

Camera Parameters: 
- fx: 267.7
- fy: 269.6
- cx: 160.05
- cy: 123.8
- k1: 0
- k2: 0
- p1: 0
- p2: 0
- fps: 30
- color order: RGB (ignored if grayscale)
terminate called after throwing an instance of 'c10::Error'
  what():  Legacy model format is not supported on mobile.
Exception raised from deserialize at ../torch/csrc/jit/serialization/import.cpp:267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fefb6de20eb in /root/autodl-tmp/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd1 (0x7fefb6dddc41 in /root/autodl-tmp/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x35dd53d (0x7feff3ef353d in /root/autodl-tmp/libtorch/lib/libtorch_cpu.so)
frame #3: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1cd (0x7feff3ef48ad in /root/autodl-tmp/libtorch/lib/libtorch_cpu.so)
frame #4: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc1 (0x7feff3ef64c1 in /root/autodl-tmp/libtorch/lib/libtorch_cpu.so)
frame #5: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x6f (0x7feff3ef65cf in /root/autodl-tmp/libtorch/lib/libtorch_cpu.so)
frame #6: ORB_SLAM2::GCNextractor::GCNextractor(int, float, int, int, int) + 0x670 (0x7ff071e213c0 in /root/autodl-tmp/GCNv2_SLAM/lib/libORB_SLAM2.so)
frame #7: ORB_SLAM2::Tracking::Tracking(ORB_SLAM2::System*, DBoW2::TemplatedVocabulary<cv::Mat, DBoW2::FORB>*, ORB_SLAM2::FrameDrawer*, ORB_SLAM2::MapDrawer*, ORB_SLAM2::Map*, ORB_SLAM2::KeyFrameDatabase*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1e7e (0x7ff071dfcf0e in /root/autodl-tmp/GCNv2_SLAM/lib/libORB_SLAM2.so)
frame #8: ORB_SLAM2::System::System(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ORB_SLAM2::System::eSensor, bool) + 0x5ae (0x7ff071de459e in /root/autodl-tmp/GCNv2_SLAM/lib/libORB_SLAM2.so)
frame #9: main + 0x22f (0x5609d811ae2f in ./rgbd_gcn)
frame #10: __libc_start_main + 0xf3 (0x7fefb704a083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: _start + 0x2e (0x5609d811c7ce in ./rgbd_gcn)

Aborted (core dumped)

GPT说此问题是因为save模型和load模型的PyTorch版本不一致，导致无法加载。如果不出意外的话GCNv2应该是用README里面写的PyTorch 1.0.1版本来保存模型的，可能是1.0.1版本已经和1.11.0版本完全不兼容了。

这个问题我没找到解决方案，于是放弃治疗。本来GCNv2就是一个很老的项目了，在40系显卡上不好运行也正常。网上其实能搜到一篇在4060拯救者上运行GCNv2的博客，但是那篇博客里面并没有提到这个coredump的问题，问GPT也没给出一个可行的方案，还是不浪费时间了。

7.4. 尝试使用 PyTorch 1.10.0 镜像

上面这个coredump搜到了几篇github issue，有的提到了可能是PyTorch 1.11.0版本和之前版本的镜像加载方式不同，导致无法load镜像。所以尝试使用PyTorch 1.10.0版本来重新测试一下。

1
2
3

PyTorch  1.10.0
Python  3.8(ubuntu20.04)
Cuda  11.3

对应版本libtorch的下载链接，其他依赖项用上文提到的命令安装就可以了。

1	https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcu113.zip

然而并不行，依旧会有错误，这一次没有加载模型的error了，变成了段错误。

Start processing sequence ...
Images in the sequence: 573

Pass 'Combine redundant instructions' is not initialized.
Verify if there is a pass dependency cycle.
Required Passes:
Segmentation fault (core dumped)

在类似项目YOLO_ORB_SLAM3的仓库中能找到相关的issue：github.com/YWL0720/YOLO_ORB_SLAM3/issues/12，依旧是libtorch版本不对导致的问题，issue中提到的解决办法是将libtorch降低到1.7.1版本。

看来是没辙啦，因为40系显卡至少需要CUDA 11.3版本，在AutoDL上最低只能选择到PyTorch 1.10.0的镜像了，没法装1.7.0的镜像。

不再尝试了。