Python tf.distribute.MirroredStrategy用法及代碼示例

在一台機器上跨多個副本進行同步訓練。

繼承自：Strategy

用法

tf.distribute.MirroredStrategy(
    devices=None, cross_device_ops=None
)

參數

devices 設備字符串列表，例如 ['/gpu:0', '/gpu:1'] 。如果 None ，則使用所有可用的 GPU。如果沒有找到 GPU，則使用 CPU。
cross_device_ops 可選，CrossDeviceOps 的後代。如果未設置，則默認使用NcclAllReduce()。如果 NCCL 不可用或者利用特定硬件的特殊實現可用，則可以自定義此設置。

屬性

cluster_resolver
返回與此策略關聯的集群解析器。
一般來說，當使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 時，有一個tf.distribute.cluster_resolver.ClusterResolver 與所使用的策略相關聯，並且這樣的實例由該屬性返回。

打算擁有關聯tf.distribute.cluster_resolver.ClusterResolver 的策略必須設置相關屬性，或覆蓋此屬性；否則，默認返回None。這些策略還應提供有關此屬性返回的內容的信息。

Single-worker 策略通常沒有 tf.distribute.cluster_resolver.ClusterResolver ，在這些情況下，此屬性將返回 None 。

當用戶需要訪問集群規範、任務類型或任務 ID 等信息時，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的 API 文檔字符串。
extended tf.distribute.StrategyExtended 與其他方法。
num_replicas_in_sync 返回聚合梯度的副本數。

此策略通常用於在具有多個 GPU 的一台機器上進行訓練。對於 TPU，請使用 tf.distribute.TPUStrategy 。要將MirroredStrategy 與多個工人一起使用，請參閱tf.distribute.experimental.MultiWorkerMirroredStrategy。

例如，在 MirroredStrategy 下創建的變量是 MirroredVariable 。如果在策略的構造函數參數中沒有指定設備，那麽它將使用所有可用的 GPU。如果沒有找到 GPU，它將使用可用的 CPU。請注意，TensorFlow 將機器上的所有 CPU 視為單個設備，並在內部使用線程進行並行處理。

strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
with strategy.scope():
  x = tf.Variable(1.)
x
MirroredVariable:{
  0:<tf.Variable ... shape=() dtype=float32, numpy=1.0>,
  1:<tf.Variable ... shape=() dtype=float32, numpy=1.0>
}

在使用分配策略時，所有變量的創建都應該在策略的範圍內完成。這將在所有副本之間複製變量，並使用all-reduce 算法使它們保持同步。

在用 tf.function 包裝的 MirroredStrategy 中創建的變量仍然是 MirroredVariables 。

x = []
@tf.function  # Wrap the function with tf.function.
def create_variable():
  if not x:
    x.append(tf.Variable(1.))
  return x[0]
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
with strategy.scope():
  _ = create_variable()
  print(x[0])
MirroredVariable:{
  0:<tf.Variable ... shape=() dtype=float32, numpy=1.0>,
  1:<tf.Variable ... shape=() dtype=float32, numpy=1.0>
}

experimental_distribute_dataset 可用於在編寫自己的訓練循環時將數據集分布在副本之間。如果您使用 tf.keras 中可用的 .fit 和 .compile 方法，那麽 tf.keras 將為您處理分發。

例如：

my_strategy = tf.distribute.MirroredStrategy()
with my_strategy.scope():
  @tf.function
  def distribute_train_epoch(dataset):
    def replica_fn(input):
      # process input and return result
      return result

    total_result = 0
    for x in dataset:
      per_replica_result = my_strategy.run(replica_fn, args=(x,))
      total_result += my_strategy.reduce(tf.distribute.ReduceOp.SUM,
                                         per_replica_result, axis=None)
    return total_result

  dist_dataset = my_strategy.experimental_distribute_dataset(dataset)
  for _ in range(EPOCHS):
    train_result = distribute_train_epoch(dist_dataset)

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.distribute.MirroredStrategy。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。