Python tf.compat.v1.distribute.experimental.MultiWorkerMirroredStrategy用法及代碼示例

一種對多個工人進行同步訓練的分配策略。

繼承自：Strategy

用法

tf.compat.v1.distribute.experimental.MultiWorkerMirroredStrategy(
    communication=tf.distribute.experimental.CollectiveCommunication.AUTO,
    cluster_resolver=None
)

屬性

cluster_resolver
返回與此策略關聯的集群解析器。
一般來說，當使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 時，有一個tf.distribute.cluster_resolver.ClusterResolver 與所使用的策略相關聯，並且這樣的實例由該屬性返回。

打算擁有關聯tf.distribute.cluster_resolver.ClusterResolver 的策略必須設置相關屬性，或覆蓋此屬性；否則，默認返回None。這些策略還應提供有關此屬性返回的內容的信息。

Single-worker 策略通常沒有 tf.distribute.cluster_resolver.ClusterResolver ，在這些情況下，此屬性將返回 None 。

當用戶需要訪問集群規範、任務類型或任務 ID 等信息時，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的 API 文檔字符串。
extended tf.distribute.StrategyExtended 與其他方法。
num_replicas_in_sync 返回聚合梯度的副本數。

該策略實現了跨多個工作人員的同步分布式訓練，每個工作人員都可能具有多個 GPU。與 tf.distribute.MirroredStrategy 類似，它將所有變量和計算複製到每個本地設備。不同之處在於它使用分布式集體實現(例如all-reduce)，以便多個工作人員可以一起工作。

您需要在每個工作人員上啟動程序並正確配置cluster_resolver。例如，如果您使用 tf.distribute.cluster_resolver.TFConfigClusterResolver ，則每個工作人員都需要在 TF_CONFIG 環境變量中設置其對應的 task_type 和 task_id。兩個工作集群的 worker-0 上的示例 TF_CONFIG 是：

TF_CONFIG = '{"cluster":{"worker":["localhost:12345", "localhost:23456"]}, "task":{"type":"worker", "index":0} }'

您的程序在每個工人as-is 上運行。請注意，集體要求每個工人都參與。所有tf.distribute 和非tf.distribute API 都可以在內部使用集合，例如檢查點和保存，因為讀取帶有 tf.VariableSynchronization.ON_READ all-reduces 值的 tf.Variable。因此，建議在每個工人上運行完全相同的程序。根據worker的task_type或task_id調度為error-prone。

cluster_resolver.num_accelerators() 確定策略使用的 GPU 數量。如果為零，則該策略使用 CPU。所有工作人員都需要使用相同數量的設備，否則行為未定義。

此策略不適用於 TPU。請改用tf.distribute.TPUStrategy。

設置 TF_CONFIG 後，使用此策略類似於使用 tf.distribute.MirroredStrategy 和 tf.distribute.TPUStrategy 。

strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(2, input_shape=(5,)),
  ])
  optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

def dataset_fn(ctx):
  x = np.random.random((2, 5)).astype(np.float32)
  y = np.random.randint(2, size=(2, 1))
  dataset = tf.data.Dataset.from_tensor_slices((x, y))
  return dataset.repeat().batch(1, drop_remainder=True)
dist_dataset = strategy.distribute_datasets_from_function(dataset_fn)

model.compile()
model.fit(dist_dataset)

您還可以編寫自己的訓練循環：

@tf.function
def train_step(iterator):

  def step_fn(inputs):
    features, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(features, training=True)
      loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

  strategy.run(step_fn, args=(next(iterator),))

for _ in range(NUM_STEP):
  train_step(iterator)

有關詳細教程，請參閱使用 Keras 進行 Multi-worker 訓練。

保存

您需要在所有工作人員上保存和檢查點，而不僅僅是一個。這是因為同步=ON_READ的變量在保存期間會觸發聚合。建議在每個工作人員上保存到不同的路徑以避免競爭條件。每個工人保存相同的東西。有關示例，請參閱 Multi-worker 使用 Keras 教程進行訓練。

已知的問題

tf.distribute.cluster_resolver.TFConfigClusterResolver 未返回正確數量的加速器。如果 cluster_resolver 是 tf.distribute.cluster_resolver.TFConfigClusterResolver 或 None ，則該策略使用所有可用的 GPU。
在 Eager 模式下，需要在調用任何其他 Tensorflow API 之前創建策略。

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.compat.v1.distribute.experimental.MultiWorkerMirroredStrategy。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。