tensorflow与pytorch的GPU分配与使用策略详解

文章正文

发布时间：2024-07-25 18:21

前言：看了很多关于多GPU分配与使用的文章，发现很多文章都是只介绍了一些最浅显的东西，没有深入解释清楚，本文所使用的服务器上面含有4块 GTX2080Ti 的GPU。

声明：深度学习框架所使用的GPU不是以GPU本身的个数和编号而言的，而是以我们本身给框架能够看见的GPU数量而言的，什么意思呢？

一、关于GPU的可见性与框架使用的GPU的映射关系——device mapping

（1）当不设定任何限制的时候，我们的框架可以看见4块GPU，所以在使用的时候对应的关系如下：

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5 /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5 /job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5 /job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5

前面是框架所使用的的设备全名，后面是真实的硬件名称。

（2）现在我自己指定可见的GPU设备

比如现在GPU：0和GPU：3在被别人使用，我现在不能再使用者两块GPU，我只能使用GPU:1和GPU:2，我们看到的信息如下：

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device /job:localhost/replica:0/task:0/device:GPU:0 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5 /job:localhost/replica:0/task:0/device:GPU:1 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5

现在应该只有两块GPU可用，至于如何指定可见的设备，后面再说，

一定要注意到这里的 device mapping 关系，现在我们的对应关系是：

/device:GPU:0 -> device: 1

/device:GPU:1 -> device: 2

而我们在使用tensorflow分配设备的时候，能够使用的实际上是 /device:GPU:0 和 /device:GPU:1，但是他们实际上又是物理GPU的第二块和第三块，这时特别要注意的地方，否则就会出错，如果我现在使用：

with tf.device("/gpu:2"):

那么就会出错，为什么？不是我这里明明是用的是第二块和第三块GPU啊，为什么不能使用 “/gpu:2”,这是因为映射关系的存在。

记住：tensorflow和pytorch识别的设备都是通过映射关系来实现的，及我们前面的 /device:GPU:0 和 /device:GPU:1。

再比如下面的一些例子：

如果只使用第四块GPU，则 /device:GPU:0 -> device: 3

如果只是用第3,4块GPU，则 /device:GPU:0 -> device: 2 和 /device:GPU:0 -> device: 3

如果只使用第1块GPU，则 /device:GPU:0 -> device: 0

（3）GPU内存被完全占用的时候

当有某两块GPU内存被使用满了的时候，比如我现在的服务器上面第3，4两块GPU内存被占满了，这个时候我是没有办法查看到所有的GPU设备的，比如如下面的代码：

from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) '''错误原因，第三块GPU内存满了，出现了错误 tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11554717696 '''

当第3,4块GPU被完全使用的时候，若我们指定的是0,1两块GPU，则会得到下面的结果：

红色部分圈出来的表示的是现在可以使用的两块GPU，蓝色部分圈出来的是第3,4块两块GPU内存占用满了，绿色部分圈出来的是GPU设备的映射。

总结如下：为了更好地在多个GPU上面进行训练，因为这个服务器有多个人进行使用，我们最好是遵循下面的步骤

（1）第一步：明确指定可见设备。先明确指定对于tensorflow或者是pytorch明确可见的GPU是哪几块，然后会对指定的GPU完成 device mapping，映射规则如上面所示，为了方便查看GPU的实时使用情况，我们可以用下面命令进行监控：

watch -n 1 nvidia-smi

（2）在映射的GPU设备上面进一步配置GPU的使用规则。这是以第一步为基础，比如将哪一些tensor，哪一些operation分配在哪一些GPU设备上，指定的设备占用率是多少，内存允许分配多少等等。切记。这些都是在第一步的device mapping基础之上，这很重要。

（3）在指定可见设备时候，指定代码一般放在最前面，以防止因为其他的人将某一块GPU沾满出现未知错误。

二、明确指定可见GPU设备的方法

所谓的明确指定GPU，就是让框架只看得见我们制定的那几块GPU，完成 device mapping，没指定的GPU，框架根本就看不见，不管它是完全没使用还是已经内存被占用满了。

明确指定GPU的方法有很多，这里使用每一种来进行说明。

2.1 在运行脚本程序的时候在终端指定——针对tensorflow和pytorch

如下：

CUDA_VISIBLE_DEVICES = 1 python train_net.py CUDA_VISIBLE_DEVICES = 0,1 python train_net.py CUDA_VISIBLE_DEVICES = 0,2,3 python train_net.py CUDA_VISIBLE_DEVICES = "1,2" python train_net.py CUDA_VISIBLE_DEVICES = "1,2,3" python train_net.py

2.2 使用os模块在程序开头指定可见的设备——针对tensorflow和pytorch

os.environ["CUDA_VISIBLE_DIVICES"]="2" os.environ["CUDA_VISIBLE_DIVICES"]="0,2" os.environ["CUDA_VISIBLE_DIVICES"]="2,3,4"

2.3 tensorflow1.x的GPU可见性设置——tensorflow1.13及之前

# 会话GPU的相关配置 gpu_options = tf.GPUOptions() gpu_options.visible_device_list = "1,2" # 可见的两块GPU是2、3块GPU

2.4 tensorflow1.14以及tensorflow2.x

# 获取所有的物理GPU physical_devices = tf.config.list_physical_devices('GPU') # 配置可见的GPU，从第二块GPU开始 tf.config.set_visible_devices(physical_devices[1:], 'GPU')

函数原型如下：

tf.config.set_visible_devices(devices, device_type=None)

2.5 pytorch的设置方法——对于torch，控制设备可见性，推荐使用CUDA_VISIBLE_DEVICES

print(torch.cuda.is_available()) # True print(torch.cuda.device_count()) # 4 ,共有4块GPU torch.cuda.set_device(2) # 设置第3块GPU device = torch.cuda.current_device() # 当前的GPU设备是2，返回2 # 按道理这里只有设置一块GPU，即低块，我们能够使用的是只有一块GPU，也就是下面只能是 cuda:0 # 但是我们即便使用 cuda:1、cuda:2、cuda:3 均可以 cuda = torch.device("cuda:1") # 返回 cuda:1 x=torch.tensor([1,2,3],device=cuda) y=torch.tensor([4,5,6],device=cuda) z=torch.add(x,y) print(z)

所以官方不推荐使用

torch.cuda.set_device(）方法，因为他没有真正的控制到设备对于框架是否是可见的，而是推荐使用 CUDA_VISIBLE_DEVICES 的方法。

参照下面的

import torch import os os.environ["CUDA_VISIBLE_DEVICES"]="1" # 只有第2块GPU是对torch可见的，因此只有cuda:0 是真正可用的 print(torch.cuda.is_available()) # True print(torch.cuda.device_count()) # 4 ,共有4块GPU #torch.cuda.set_device(2) # 出错，因为这时候GPU:2根本对于torch是不可见的 device = torch.cuda.current_device() # 返回可见的当前的GPU设备是1，返回1 # 这句话总是不会出错，不管该GPU是否真实可见，总会打印出结果，就像这里，即使没有cuda:1,他还是会返回这个结果 # 但是，在下面指定tensor到cuda:1的时候就会出错了，显示RuntimeError: CUDA error: invalid device ordinal # 所以应该将其设置为 cuda:0，这样下面的tensor才不会出错 cuda = torch.device("cuda:0") # 返回 cuda:0 x=torch.tensor([1,2,3],device=cuda) y=torch.tensor([4,5,6],device=cuda) z=torch.add(x,y) print(z) 三、tensorflow不同版本对于GPU的常见的一些设置

3.1 tensorflow1.13 以及之前的版本

# 获取所有的GPU设备 from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) ''' 不再推荐使用此方法，因为这个方法有一个bug，当我们指定GPU 0,1 对我们的tensorflow可见的时候， 2,3因为别的人在使用已经内存全部被使用，是用这个方法就没有办法打印出设备 2,3 显示内存被耗尽了，所以不推荐使用我们应该使用高版本的tf.config.list_phisical_deivices来进行查看更加合理，即便内存全部被占用，物理GPU至少用该能够被统计才合理。 '''

相关的设置方法，指定operation和tensor的设备、限制GPU内存，设置operation的设备显示、自动分配可见设备等操作

注意：这些都是在前面设置的可见设备基础之上的哦！！！

（1）通过GPUOptions、ConfigProto、Session三者来设置

# 创建GPUOptions对象并设置相关的属性,前提都是对于可见设备而言的哦！ gpu_options = tf.GPUOptions() gpu_options.visible_device_list = "1,2" # 指定GPU的可见性 gpu_options.allow_growth = True # 允许自动达到可见GPU的最大内存 gpu_options.per_process_memory_fraction = 0.4 # 设置GPU内存占用的最大比例 # 创建ConfigProto对象，并设置它的gpu_options属性 config = tf.ConfigProto(gpu_options = gpu_options) config.log_device_placement = True # 查看每一个operation所在的设备，也是以可见的device mapping为前提的 config.allow_soft_placement = True # 每一个operation在可见的device上面自动分配 config.inter_op_parallelism_threads # 设置一个操作内部并行计算的线程数，0表示最优线程处理 config.intra_op_parallelism_threads # 设置多个操作并行计算的线程数，0表示最优线程处理 # 创建Session会话，与graph关联 with tf.Session(config = config, graph = graph) as sess: # 开始一系列操作

（2）将operation指定到某一个设备上面——以可见的device mapping为基础哦

with tf.device("/gpu:0"): # 可见设备中的第1块 graph = tf.Graph() with graph.as_default(): a = tf.constant([1.0,2.0]) b = tf.constant([3.0,4.0]) c = tf.add(a,b,name="a_add_b") x = tf.Variable(initial_value=[10.0,20.0]) y = tf.Variable(initial_value=[30.0,40.0]) z = tf.add(x,y,name="x_add_y")

总结：在TensorFlow中GPU设备名称

"/device:CPU:0": 机器中的CPU

"/GPU:0": 机器中对tensorflow可见的GPU中的第一块GPU，是一个简写，我们常用这个

"/job:localhost/replica:0/task:0/device:GPU:1": 机器中对tensorflow可见的GPU中的第二块GPU，这个是完全名称，不是简写；

四、tensorflow1.14以及之后的版本（tf2.x）中的分配与使用策略

（1）查看GPU的数量以及确保GPU可用

# tf.config.experimental.list_physical_devices('GPU') import tensorflow as tf # 查看所有的设备 print("可用GPU数量为: ", len(tf.config.experimental.list_physical_devices())) ''' [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'), # 这里的XLA指的是Accelerated Linear Algebra 加速线性代数 PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'), # 我的个人理解是该GPU是支持XLA的，因为没有使用这个优化，所以先不用管 PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'), PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU')] '''

从上面的所有的物理设备可以看出，我们的设备类型一共有4大类设备类型，即

CPU

XLA_CPU

GPU

XLA_GPU

那实际上是服务器中只安装了一个CPU一级四个GPU，这个XLA又是什么呢？它实际上加速线性代数运算的优化方法，是说明我们的设备是支持XLA，即支持线性代数加速运算的，并不是一块新的显卡，我们可以在控制台打印出来的信息查看到如下信息：

2020-05-09 13:57:39.978330: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5608d0ad8a20 executing computations on platform Host. Devices: 2020-05-09 13:57:40.636659: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5608d0b3cff0 executing computations on platform CUDA. Devices:

既然有四大设备类型我们可以只查看不同类型的物理设备，如下：

tf.config.experimental.list_physical_devices("CPU") # 返回一个 tf.config.experimental.list_physical_devices("XLA_CPU") # 返回一个 tf.config.experimental.list_physical_devices("GPU") # 返回四个 tf.config.experimental.list_physical_devices("XLA_GPU") # 返回四个

（2）限制哪一些GPU对于tensorflow可见

tf.config.set_visible_devices(devices, device_type=None) # 参见上面第二大标题，一般设置放在代买前面哦！ # 比如针对上面所返回出来的所有的物理设备，我们要使用第0,1两块GPU应该这么做 # 注意这里的0,1两块GPU的索引是2和3，不要弄错哦，因为不同的机器可能是不一样的 tf.config.set_visible_devices(gpus[2:4], 'GPU') # 特别注意索引位置不要错，要根据返回的物理设备来确定

（3）手动分配设备——与上面的是一样的

# 将tensor放在CPU上面 with tf.device('/CPU:0'): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b)

（4）查看每一个operation以及tensor所在的设备

tf.debugging.set_log_device_placement(True) # 查看每一个operation和tensor在那一个设备上面，这句话放在最前面即可

（5）查看逻辑GPU的数量，logical GPU

所谓的逻辑GPU实际上指的就是visible GPU，即对于tensorflow框架可见的GPU的个数，如果有虚拟GPU的时候，则l

逻辑GPU = 真实可见的GPU + 虚拟GPU个数

如下：

# 查看所有GPU的数量，应该为4个 gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: # 严格限制，只允许使用第一块GPU tf.config.experimental.set_visible_devices(gpus[0], 'GPU') # 查看逻辑GPU的数量 logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU") ''' 4 Physical GPUs, 1 Logical GPU '''

（6）限制内存增长

# 自动增长 tf.config.experimental.set_memory_growth(devices[0], True) # 限制内存是具体的多少 tf.config.experimental.set_virtual_device_configuration( gpus[0], # 指定的一块可见的GPU哦 [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)] # 通过虚拟GPU技术，后面也会介绍到 )

（7）自己选择在所有的可见设备上自动分配

tf.config.set_soft_device_placement(True)

（8）虚拟GPU——单GPU模拟多GPU环境

当我们的机器实际上只有一块GPU的时候，有时候为了方便编写分布式多GPU的代码，我们可以将一块GPU设置成几块虚拟的GPU，如下面的代码：

# 获取所有的物理GPU，假设这里是2块 gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: # 给第一块GPU，分成两块虚拟GPU tf.config.experimental.set_virtual_device_configuration( gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024), tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) # 查看逻辑GPU数量 logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs") ''' 2 Physical GPU, 3 Logical GPUs '''

总结：注意理解：真实的物理GPU、可见的GPU、虚拟GPU、逻辑GPU 这四者之间的关系。

五、pytorch中GPU常见的一些使用策略 torch.cuda.current_device() # 返回当前所选择的device的索引 torch.cuda.device_count() # 返回可使用的GPU的数量 torch.cuda.get_device_capability(device=None) # 查看某一个设备device的计算能力 torch.cuda.get_device_name(device=None) # 获取设备的名称 torch.cuda.is_available() # 查看GPU是否可用 torch.cuda.is_initialized() # 查看pytorch的 CUDA 状态是否初始化 torch.cuda.set_device(device) # 不推荐使用，参见前面的指定可见GPU

当然pytorch的cuda模块中还有很多其它的方法，很多也没搞懂，也没找到相关的文献，也没有使用过，暂时就先不说了，后面遇到了再补充。

六、安装GPU版本之后的一些简单的测试代码

6.1 对于tensorflow而言

tf.__verison__ tf.__xxx__ tf.version.xxxx tf.test.is_built_with_cuda() tf.test.is_gpu_available() tf.test.gpu_device_name() # 以及1.x版本与2.x版本获取所有的物理设备的方法

6.2 对于pytorch而言

torch.__version__ torch.version.cuda # 9.0 torch.cuda.is_available() torch.cuda.get_device_name(0) torch.cuda.get_device_propertise(0) torch.cuda..device_count() torch.cuda.current_device() torch.backends.cudnn.version() # 7005版本 import torch from torch.backends import cudnn x = torch.Tensor([1.0]) xx = x.cuda() print(xx) # 检测cudnn print(cudnn.is_acceptable(xx))