Python tf.distribute.Strategy用法及代碼示例

設備列表上的狀態和計算分布策略。

用法

tf.distribute.Strategy(
    extended
)

屬性

cluster_resolver
返回與此策略關聯的集群解析器。
一般來說，當使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 時，有一個tf.distribute.cluster_resolver.ClusterResolver 與所使用的策略相關聯，並且這樣的實例由該屬性返回。

打算擁有關聯tf.distribute.cluster_resolver.ClusterResolver 的策略必須設置相關屬性，或覆蓋此屬性；否則，默認返回None。這些策略還應提供有關此屬性返回的內容的信息。

Single-worker 策略通常沒有 tf.distribute.cluster_resolver.ClusterResolver ，在這些情況下，此屬性將返回 None 。

當用戶需要訪問集群規範、任務類型或任務 ID 等信息時，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的 API 文檔字符串。
extended tf.distribute.StrategyExtended 與其他方法。
num_replicas_in_sync 返回聚合梯度的副本數。

有關概述和示例，請參閱指南。請參閱 tf.distribute.StrategyExtended 和 tf.distribute 以獲取此頁麵上提到的概念詞匯表，例如 "per-replica"、replica 和 reduce。

簡而言之：

要將其與 Keras compile /fit 一起使用，請閱讀。
您可以將 tf.distribute.Strategy 的後代傳遞給 tf.estimator.RunConfig 以指定 tf.estimator.Estimator 應如何分配其計算。見指南。
否則，使用tf.distribute.Strategy.scope 指定在構建執行模型時應使用的策略。 (這會將您置於此策略的“cross-replica 上下文”中，這意味著該策略可以控製變量放置等事情。)
如果您正在編寫自定義訓練循環，則需要調用更多方法，請參閱指南：
- 首先正常創建tf.data.Dataset。
- 使用 tf.distribute.Strategy.experimental_distribute_dataset 將 tf.data.Dataset 轉換為產生 "per-replica" 值的東西。如果要手動指定數據集應如何跨副本分區，請改用tf.distribute.Strategy.distribute_datasets_from_function。
- 使用 tf.distribute.Strategy.run 對每個副本運行一次函數，獲取可能是 "per-replica" 的值(例如，來自 tf.distribute.DistributedDataset 對象)並返回 "per-replica" 值。該函數在"replica context"中執行，這意味著每個操作在每個副本上單獨執行。
- 最後使用一種方法(例如 tf.distribute.Strategy.reduce )將生成的 "per-replica" 值轉換為普通的 Tensor 。

自定義訓練循環可以很簡單：

with my_strategy.scope():
  @tf.function
  def distribute_train_epoch(dataset):
    def replica_fn(input):
      # process input and return result
      return result

    total_result = 0
    for x in dataset:
      per_replica_result = my_strategy.run(replica_fn, args=(x,))
      total_result += my_strategy.reduce(tf.distribute.ReduceOp.SUM,
                                         per_replica_result, axis=None)
    return total_result

  dist_dataset = my_strategy.experimental_distribute_dataset(dataset)
  for _ in range(EPOCHS):
    train_result = distribute_train_epoch(dist_dataset)

這需要一個普通的 dataset 和 replica_fn 並使用上麵名為 my_strategy 的特定 tf.distribute.Strategy 運行它。 replica_fn 中創建的任何變量都是使用 my_strategy 的策略創建的，並且 replica_fn 調用的庫函數可以使用 get_replica_context() API 來實現 distributed-specific 行為。

您可以使用 reduce API 跨副本聚合結果，並將其用作 tf.distribute.DistributedDataset 上一次迭代的返回值。或者您可以使用tf.keras.metrics(例如損失、準確性等)在給定時期內跨步驟累積指標。

有關更詳細的示例，請參閱自定義訓練循環教程。

注意：tf.distribute.Strategy 目前不支持 TensorFlow 的分區變量(單個變量跨多個設備拆分)。

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.distribute.Strategy。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。