Python tf.compat.v1.distribute.experimental.ParameterServerStrategy用法及代码示例

异步 multi-worker 参数服务器 tf.distribute 策略。

继承自：Strategy

用法

tf.compat.v1.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=None
)

参数

cluster_resolver 可选的tf.distribute.cluster_resolver.ClusterResolver 对象。默认为 tf.distribute.cluster_resolver.TFConfigClusterResolver 。

属性

cluster_resolver
返回与此策略关联的集群解析器。
一般来说，当使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 时，有一个tf.distribute.cluster_resolver.ClusterResolver 与所使用的策略相关联，并且这样的实例由该属性返回。

打算拥有关联tf.distribute.cluster_resolver.ClusterResolver 的策略必须设置相关属性，或覆盖此属性；否则，默认返回None。这些策略还应提供有关此属性返回的内容的信息。

Single-worker 策略通常没有 tf.distribute.cluster_resolver.ClusterResolver ，在这些情况下，此属性将返回 None 。

当用户需要访问集群规范、任务类型或任务 ID 等信息时，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有关详细信息，请参阅 tf.distribute.cluster_resolver.ClusterResolver 的 API 文档字符串。
extended tf.distribute.StrategyExtended 与其他方法。
num_replicas_in_sync 返回聚合梯度的副本数。

该策略需要两个角色：workers 和参数服务器。变量和对这些变量的更新将分配给参数服务器，其他操作分配给工作人员。

当每个工作人员拥有多个 GPU 时，操作将在所有 GPU 上复制。即使操作可以被复制，变量也不会被复制，并且每个工作人员共享一个共同的视图，用于将变量分配给哪个参数服务器。

默认情况下，它使用TFConfigClusterResolver 来检测multi-worker 训练的配置。这需要 'TF_CONFIG' 环境变量，并且 'TF_CONFIG' 必须具有集群规范。

此类假定每个工作人员都独立运行相同的代码，但参数服务器运行的是标准服务器。这意味着虽然每个工作人员将在所有 GPU 上同步计算单个梯度更新，但工作人员之间的更新是异步进行的。仅在第一个副本上发生的操作(例如递增全局步骤)将在每个工作人员的第一个副本上发生。

即使只有 CPU 或一个 GPU，对于可能跨副本(即多个 GPU)复制的任何操作，预计都会调用 call_for_each_replica(fn, ...)。定义 fn 时，需要格外小心：

1) 一般不建议在策略的范围下开设备范围。设备范围(即调用 tf.device )将与设备合并或覆盖以进行操作，但不会更改设备的变量。

2) 也不建议在策略范围下打开托管范围(即调用tf.compat.v1.colocate_with)。对于并置变量，请改用strategy.extended.colocate_vars_with。操作的托管可能会产生设备分配冲突。

注意：此策略仅适用于 Estimator API。创建 RunConfig 时，将此策略的实例传递给 experimental_distribute 参数。然后，此RunConfig 实例应传递给调用train_and_evaluate 的Estimator 实例。

例如：

strategy = tf.distribute.experimental.ParameterServerStrategy()
run_config = tf.estimator.RunConfig(
    experimental_distribute.train_distribute=strategy)
estimator = tf.estimator.Estimator(config=run_config)
tf.estimator.train_and_evaluate(estimator,...)

相关用法

注：本文由纯净天空筛选整理自tensorflow.org大神的英文原创作品 tf.compat.v1.distribute.experimental.ParameterServerStrategy。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。