Python tf.distribute.Strategy用法及代码示例

设备列表上的状态和计算分布策略。

用法

tf.distribute.Strategy(
    extended
)

属性

cluster_resolver
返回与此策略关联的集群解析器。
一般来说，当使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 时，有一个tf.distribute.cluster_resolver.ClusterResolver 与所使用的策略相关联，并且这样的实例由该属性返回。

打算拥有关联tf.distribute.cluster_resolver.ClusterResolver 的策略必须设置相关属性，或覆盖此属性；否则，默认返回None。这些策略还应提供有关此属性返回的内容的信息。

Single-worker 策略通常没有 tf.distribute.cluster_resolver.ClusterResolver ，在这些情况下，此属性将返回 None 。

当用户需要访问集群规范、任务类型或任务 ID 等信息时，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有关详细信息，请参阅 tf.distribute.cluster_resolver.ClusterResolver 的 API 文档字符串。
extended tf.distribute.StrategyExtended 与其他方法。
num_replicas_in_sync 返回聚合梯度的副本数。

有关概述和示例，请参阅指南。请参阅 tf.distribute.StrategyExtended 和 tf.distribute 以获取此页面上提到的概念词汇表，例如 "per-replica"、replica 和 reduce。

简而言之：

要将其与 Keras compile /fit 一起使用，请阅读。
您可以将 tf.distribute.Strategy 的后代传递给 tf.estimator.RunConfig 以指定 tf.estimator.Estimator 应如何分配其计算。见指南。
否则，使用tf.distribute.Strategy.scope 指定在构建执行模型时应使用的策略。 (这会将您置于此策略的“cross-replica 上下文”中，这意味着该策略可以控制变量放置等事情。)
如果您正在编写自定义训练循环，则需要调用更多方法，请参阅指南：
- 首先正常创建tf.data.Dataset。
- 使用 tf.distribute.Strategy.experimental_distribute_dataset 将 tf.data.Dataset 转换为产生 "per-replica" 值的东西。如果要手动指定数据集应如何跨副本分区，请改用tf.distribute.Strategy.distribute_datasets_from_function。
- 使用 tf.distribute.Strategy.run 对每个副本运行一次函数，获取可能是 "per-replica" 的值(例如，来自 tf.distribute.DistributedDataset 对象)并返回 "per-replica" 值。该函数在"replica context"中执行，这意味着每个操作在每个副本上单独执行。
- 最后使用一种方法(例如 tf.distribute.Strategy.reduce )将生成的 "per-replica" 值转换为普通的 Tensor 。

自定义训练循环可以很简单：

with my_strategy.scope():
  @tf.function
  def distribute_train_epoch(dataset):
    def replica_fn(input):
      # process input and return result
      return result

    total_result = 0
    for x in dataset:
      per_replica_result = my_strategy.run(replica_fn, args=(x,))
      total_result += my_strategy.reduce(tf.distribute.ReduceOp.SUM,
                                         per_replica_result, axis=None)
    return total_result

  dist_dataset = my_strategy.experimental_distribute_dataset(dataset)
  for _ in range(EPOCHS):
    train_result = distribute_train_epoch(dist_dataset)

这需要一个普通的 dataset 和 replica_fn 并使用上面名为 my_strategy 的特定 tf.distribute.Strategy 运行它。 replica_fn 中创建的任何变量都是使用 my_strategy 的策略创建的，并且 replica_fn 调用的库函数可以使用 get_replica_context() API 来实现 distributed-specific 行为。

您可以使用 reduce API 跨副本聚合结果，并将其用作 tf.distribute.DistributedDataset 上一次迭代的返回值。或者您可以使用tf.keras.metrics(例如损失、准确性等)在给定时期内跨步骤累积指标。

有关更详细的示例，请参阅自定义训练循环教程。

注意：tf.distribute.Strategy 目前不支持 TensorFlow 的分区变量(单个变量跨多个设备拆分)。

相关用法

注：本文由纯净天空筛选整理自tensorflow.org大神的英文原创作品 tf.distribute.Strategy。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。