Python PyTorch Pipe用法及代码示例

本文简要介绍python语言中 torch.distributed.pipeline.sync.Pipe 的用法。

用法: class torch.distributed.pipeline.sync.Pipe(module, chunks=1, checkpoint='except_last', deferred_batch_norm=False)

参数：

module(torch.nn.Sequential) -使用流水线并行化的顺序模块。序列中的每个模块都必须在单个设备上具有所有参数。序列中的每个模块必须是 nn.Module 或 nn.Sequential (在单个设备上组合多个顺序模块)
chunks(int) -micro-batches 的数量(默认值：1)
checkpoint(str) -何时启用检查点，'always'、'except_last' 或 'never' 之一(默认值：'except_last')。 'never' 完全禁用检查点，'except_last' 为除最后一个以外的所有micro-batches 启用检查点，'always' 为所有micro-batches 启用检查点。
deferred_batch_norm(bool) -是否使用延迟的BatchNorm移动统计(默认： False )。如果设置为 True ，我们将跟踪多个 micro-batches 的统计信息，以更新每个 mini-batch 的运行统计信息。

抛出：

TypeError - 模块不是 nn.Sequential 。
ValueError - 无效参数

包装任意 nn.Sequential 模块以使用同步管道并行性进行训练。如果模块需要大量内存并且不适合单个 GPU，则流水线并行是一种用于训练的有用技术。

该实现基于torchgpipe 论文。

Pipe 将管道并行性与检查点相结合，以减少训练所需的峰值内存，同时最小化设备 under-utilization。

您应该将所有模块放在适当的设备上并将它们包装到定义所需执行顺序的 nn.Sequential 模块中。如果模块不包含任何参数/缓冲区，则假定该模块应在 CPU 上执行，并且模块的适当输入张量在执行前被移动到 CPU。此行为可以被 WithDevice 包装器覆盖，该包装器可用于明确指定模块应在哪个设备上运行。

例子：

跨 GPU 0 和 1 的两个 FC 层的管道。

>>> # Need to initialize RPC framework first.
>>> os.environ['MASTER_ADDR'] = 'localhost'
>>> os.environ['MASTER_PORT'] = '29500'
>>> torch.distributed.rpc.init_rpc('worker', rank=0, world_size=1)
>>>
>>> # Build pipe.
>>> fc1 = nn.Linear(16, 8).cuda(0)
>>> fc2 = nn.Linear(8, 4).cuda(1)
>>> model = nn.Sequential(fc1, fc2)
>>> model = Pipe(model, chunks=8)
>>> input = torch.rand(16, 16).cuda(0)
>>> output_rref = model(input)

注意

仅当 Pipe 的检查点参数为 'never' 时，才可以使用 torch.nn.parallel.DistributedDataParallel 包装 Pipe 模型。

注意

Pipe 目前仅支持intra-node 流水线，未来将扩展支持inter-node 流水线。转发函数返回 RRef 以允许将来进行 inter-node 流水线操作，其中输出可能位于远程主机上。对于intra-node 流水线，您可以使用 local_value() 在本地检索输出。

警告

Pipe 是实验性的，可能会发生变化。

相关用法

注：本文由纯净天空筛选整理自pytorch.org大神的英文原创作品 torch.distributed.pipeline.sync.Pipe。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。