Setting Up Setting Up GPU Environment for Model Training Environment for Model Training

In a previous post, I laid out the process of setting up TensorFlow for CPU based model training. However, as this graph shows, the GPU is significantly faster at this process. That’s why I’m documenting my attempt to install the GPU based version of TensorFlow on an HPC.

Effectiveness of GPU vs CPU in Model Training

First Steps:

Start dangerously modifying graphic card drivers straight away.
Have your boss wisely recommend the merits of using Docker.
Reluctantly begin researching Docker.
Gradually realize the merits of using Docker.

Install the necessary prerequisites, Docker, and Nvidia-Docker which takes care of setting up the Nvidia host driver environment inside the Docker containers and a few other things.

Nvidia-Docker has the additional prerequisite of requiring an installation of an Nvidia driver and CUDA. CUDA is the API that allows the use of GPU for general computing.

Before installing new drivers, check to see whether they are already installed.

Nvidia Driver check:
$cat /proc/driver/nvidia/version OR nvidia-smi

CUDA check:
$cat /usr/local/cuda/version.txt OR nvcc --version

If these drivers do not work, remove them completely before attempting fresh installs. I used:sudo apt-get remove --purge cuda*

sudo apt-get remove --purge nvidia*

The process of installing these drivers can be found here. Complete steps 2.1-2.4 and then 3.6. For my OS, I followed the recommended specifications from TensorFlow’s own website found here. Rebooting after installing any Nvidia graphics driver is essential, don’t forget. Then test that the drivers are working using the commands mentioned above.

The next step is to pull a blank Ubuntu image using the commanddocker pull ubuntu

To run the image:sudo nvidia-docker run -it -p 6006:6006 -v /sharedfolder:/root/sharedfolder ubuntu:latest bash

To save the image after any modifications you make, use the docker commit command, note that images and containers can take up a lot of space. Usedf -handdu -hto check how much space is left on your drive. Once inside the docker container. I began to setup the necessary libraries, packages and software:

1
apt update; apt install sudo -y; apt install python; apt install python-pip; apt install nano; apt install git; apt update; pip install tensorflow-gpu

Replace the original models folder with the latest one from Tensorflow

1
2
3
4
5
6
7
8
9
10
11
12
git clone https://github.com/tensorflow/models
 
apt install protobuf-compiler python-pil python-lxml python-tk;
pip install --user Cython;
pip install --user contextlib2;
pip install --user jupyter;
pip install --user matplotlib;
 
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
make
cp -r pycocotools ../../models/research/

Download the CUDA and CUDNN drivers for the ubuntu image as well and install them

1
2
3
4
5
6
7
8
9
10
11
12
dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install cuda-10-0
cat /proc/driver/nvidia/version
tar -xzvf cudnn-10.0-linux-x64-v7.5.0.56.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include
cp cuda/include/cudnn.h /usr/local/cuda-10.0/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda-10.0/lib64/libcudnn*

Configure Paths

1
2
3
4
5
6
7
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 
python setup.py build
python setup.py install
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim

Carry out training

1
python object_detection/train.py --logtostderr --train_dir=object_detection/training/ --pipeline_config_path=object_detection/training/ssd_mobilenet_v1_pets.config

Export/Freeze model into a usable state (Replace #### with the number of the latest model.ckpt file)

1
2
3
4
5
python export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path "training/ssd_mobilenet_v1_pets.config" \
--trained_checkpoint_prefix training/model.ckpt-#### \
--output_directory frozenModel

You can now test this model using tensorflow’s own object detection tutorial code. You’ll have to change the paths in boxes 5 and 9.

in Tech

# GPU Model Training

Andre Foote

13 May, 2019

Share this post

Tags

Our blogs

Archive