Python tf.distribute.experimental.ParameterServerStrategy用法及代码示例

带有参数服务器的multi-worker tf.distribute 策略。

继承自：Strategy

用法

tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver, variable_partitioner=None
)

参数

cluster_resolver tf.distribute.cluster_resolver.ClusterResolver 对象。
variable_partitioner
a distribute.experimental.partitioners.Partitioner指定如何对变量进行分区。如果None, 变量不会被分区。
- 预定义的分区器tf.distribute.experimental.partitioners可以用于这个论点。一个常用的分区器是MinSizePartitioner(min_shard_bytes = 256 << 10, max_shards = num_ps)，它为每个分片分配至少 256K，每个 ps 最多获得一个分片。
- 将为在策略 scope 下创建的每个变量调用 variable_partitioner 以指示应如何对变量进行分区。沿分区轴只有一个分区(即不需要分区)的变量将被创建为正常的 tf.Variable 。
- 仅支持第一个/最外层轴分区。
- div 分区策略用于对变量进行分区。假设我们沿变量的第一个轴分配连续的整数 id，然后将 id 以连续的方式分配给分片，同时尝试保持每个分片大小相同。如果 id 不均分分片的数量，则前几个分片中的每一个都将被分配一个更多的 id。例如，第一个维度为 13 的变量有 13 个 id，它们被分成 5 个分片，如下所示：[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]].
- 在strategy.extended.colocate_vars_with 下创建的变量不会被分区。

属性

cluster_resolver
返回与此策略关联的集群解析器。
一般来说，当使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 时，有一个tf.distribute.cluster_resolver.ClusterResolver 与所使用的策略相关联，并且这样的实例由该属性返回。

打算拥有关联tf.distribute.cluster_resolver.ClusterResolver 的策略必须设置相关属性，或覆盖此属性；否则，默认返回None。这些策略还应提供有关此属性返回的内容的信息。

Single-worker 策略通常没有 tf.distribute.cluster_resolver.ClusterResolver ，在这些情况下，此属性将返回 None 。

当用户需要访问集群规范、任务类型或任务 ID 等信息时，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有关详细信息，请参阅 tf.distribute.cluster_resolver.ClusterResolver 的 API 文档字符串。
extended tf.distribute.StrategyExtended 与其他方法。
num_replicas_in_sync 返回聚合梯度的副本数。

参数服务器训练是一种常见的data-parallel 方法，用于在多台机器上扩展机器学习模型。参数服务器训练集群由工作人员和参数服务器组成。变量在参数服务器上创建，并在每个步骤中由工作人员读取和更新。默认情况下，工作人员独立读取和更新这些变量，而不会相互同步。在这种配置下，它被称为异步训练。

在 TensorFlow 2 中，我们推荐一种基于中央协调的架构来进行参数服务器训练。每个worker和parameter server运行一个tf.distribute.Server，最重要的是，一个coordinator任务负责在worker和parameter server上创建资源，调度函数，协调训练。协调器使用 tf.distribute.experimental.coordinator.ClusterCoordinator 来协调集群，并使用 tf.distribute.experimental.ParameterServerStrategy 来定义参数服务器上的变量和工作器上的计算。

为使训练生效，协调员调度 tf.function s 以在远程工作人员上执行。在收到来自协调器的请求后，工作人员通过从参数服务器读取变量、执行操作并更新参数服务器上的变量来执行tf.function。每个 worker 只处理来自 coordinator 的请求，并与参数服务器进行通信，而不与集群中的其他 worker 直接交互。

因此，一些工作人员的故障不会阻止集群继续工作，这允许集群使用偶尔不可用的实例(例如抢占式或现货实例)进行训练。但是，协调器和参数服务器必须始终可用，集群才能取得进展。

请注意，协调员不是训练人员之一。相反，它会创建变量和数据集等资源，调度 tf.function s，保存检查点等。除了工作人员、参数服务器和协调器之外，还可以在一侧运行可选的评估器，定期读取协调器保存的检查点并针对每个检查点运行评估。

ParameterServerStrategy 支持两个训练 API：自定义训练循环 (CTL) 和 Keras 训练 API，也称为 Model.fit。当用户更喜欢定义训练循环的细节时，推荐使用 CTL，当用户喜欢高级抽象和训练处理时，推荐使用 Model.fit。

使用 CTL 时，ParameterServerStrategy 必须与 tf.distribute.experimental.coordinator.ClusterCoordinator 对象一起使用。

使用 Model.fit 时，目前仅支持 tf.keras.utils.experimental.DatasetCreator 输入类型。

协调器的示例代码

本节提供旨在在(唯一的)一个指定为协调器的任务上运行的代码片段。请注意，cluster_resolver , variable_partitioner 和 dataset_fn 参数在以下 "Cluster setup"、"Variable partitioning" 和 "Dataset preparation" 部分中进行了说明。

使用 CTL，

# Prepare a strategy to use with the cluster and variable partitioning info.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...,
    variable_partitioner=...)
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(
    strategy=strategy)

# Prepare a distribute dataset that will place datasets on the workers.
distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn=...)

with strategy.scope():
  model = ...
  optimizer, metrics = ...  # Keras optimizer/metrics are great choices
  checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
  checkpoint_manager = tf.train.CheckpointManager(
      checkpoint, checkpoint_dir, max_to_keep=2)
  # `load_checkpoint` infers initial epoch from `optimizer.iterations`.
  initial_epoch = load_checkpoint(checkpoint_manager) or 0

@tf.function
def worker_fn(iterator):

  def replica_fn(inputs):
    batch_data, labels = inputs
    # calculate gradient, applying gradient, metrics update etc.

  strategy.run(replica_fn, args=(next(iterator),))

for epoch in range(initial_epoch, num_epoch):
  distributed_iterator = iter(distributed_dataset)  # Reset iterator state.
  for step in range(steps_per_epoch):

    # Asynchronously schedule the `worker_fn` to be executed on an arbitrary
    # worker. This call returns immediately.
    coordinator.schedule(worker_fn, args=(distributed_iterator,))

  # `join` blocks until all scheduled `worker_fn`s finish execution. Once it
  # returns, we can read the metrics and save checkpoints as needed.
  coordinator.join()
  logging.info('Metric result:%r', metrics.result())
  train_accuracy.reset_states()
  checkpoint_manager.save()

使用 Model.fit ，

# Prepare a strategy to use with the cluster and variable partitioning info.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...,
    variable_partitioner=...)

# A dataset function takes a `input_context` and returns a `Dataset`
def dataset_fn(input_context):
  dataset = tf.data.Dataset.from_tensors(...)
  return dataset.repeat().shard(...).batch(...).prefetch(...)

# With `Model.fit`, a `DatasetCreator` needs to be used.
input = tf.keras.utils.experimental.DatasetCreator(dataset_fn=...)

with strategy.scope():
  model = ...  # Make sure the `Model` is created within scope.
model.compile(optimizer="rmsprop", loss="mse", steps_per_execution=..., ...)

# Optional callbacks to checkpoint the model, back up the progress, etc.
callbacks = [tf.keras.callbacks.ModelCheckpoint(...), ...]

# `steps_per_epoch` is required with `ParameterServerStrategy`.
model.fit(input, epochs=..., steps_per_epoch=..., callbacks=callbacks)

工作器和参数服务器的示例代码

除了协调员之外，还应该有指定为"worker" 或"ps" 的任务。他们应该运行以下代码来启动 TensorFlow 服务器，等待协调器的请求：

# Provide a `tf.distribute.cluster_resolver.ClusterResolver` that serves
# the cluster information. See below "Cluster setup" section.
cluster_resolver = ...

server = tf.distribute.Server(
    cluster_resolver.cluster_spec(),
    job_name=cluster_resolver.task_type,
    task_index=cluster_resolver.task_id,
    protocol="grpc")

# Blocking the process that starts a server from exiting.
server.join()

集群设置

为了让集群中的任务知道其他任务的地址，需要在coordinator、worker和ps中使用tf.distribute.cluster_resolver.ClusterResolver。 tf.distribute.cluster_resolver.ClusterResolver负责提供集群信息，以及当前任务的任务类型和id。有关详细信息，请参阅tf.distribute.cluster_resolver.ClusterResolver。

如果设置了TF_CONFIG 环境变量，则还应使用tf.distribute.cluster_resolver.TFConfigClusterResolver。

由于tf.distribute.experimental.ParameterServerStrategy 中围绕任务类型的命名进行了假设，因此在tf.distribute.cluster_resolver.ClusterResolver 中应使用"chief"、"ps" 和"worker" 来分别指代协调器、参数服务器和工作者。

以下示例演示了在具有 1 个主管、2 个参数服务器和 3 个工作人员的集群中，为指定为参数服务器(任务类型 "ps")和索引 1(第二个任务)的任务设置 TF_CONFIG。请注意，它需要在使用 tf.distribute.cluster_resolver.TFConfigClusterResolver 之前进行设置。

集群设置示例代码：

os.environ['TF_CONFIG'] = '''
{
  "cluster":{
    "chief":["chief.example.com:2222"],
    "ps":["ps0.example.com:2222", "ps1.example.com:2222"],
    "worker":["worker0.example.com:2222", "worker1.example.com:2222",
               "worker2.example.com:2222"]
  },
  "task":{
    "type":"ps",
    "index":1
  }
}
'''

如果您希望为所有任务运行相同的二进制文件，则需要在程序开始时让二进制文件分支到不同的角色：

# If coordinator, create a strategy and start the training program.
if cluster_resolver.task_type == 'chief':
  strategy = tf.distribute.experimental.ParameterServerStrategy(
      cluster_resolver)
  ...

# If worker/ps, create a server
elif cluster_resolver.task_type in ("worker", "ps"):
  server = tf.distribute.Server(...)
  ...

或者，您也可以提前启动一堆 TensorFlow 服务器，稍后再连接它们。协调器可以在同一个集群中，也可以在任何可以连接到工作服务器和参数服务器的机器上。这在我们的指南和教程中有介绍。

使用strategy.scope() 创建变量

tf.distribute.experimental.ParameterServerStrategy 遵循 tf.distribute API 合同，其中变量创建预计在 strategy.scope() 返回的上下文管理器内，以便以 round-robin 方式正确放置在参数服务器上：

# In this example, we're assuming having 3 ps.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...)
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(
    strategy=strategy)

# Variables should be created inside scope to be placed on parameter servers.
# If created outside scope such as `v1` here, it would be placed on the
# coordinator.
v1 = tf.Variable(initial_value=0.0)

with strategy.scope():
  v2 = tf.Variable(initial_value=1.0)
  v3 = tf.Variable(initial_value=2.0)
  v4 = tf.Variable(initial_value=3.0)
  v5 = tf.Variable(initial_value=4.0)

# v2 through v5 are created in scope and are distributed on parameter servers.
# Default placement is round-robin but the order should not be relied on.
assert v2.device == "/job:ps/replica:0/task:0/device:CPU:0"
assert v3.device == "/job:ps/replica:0/task:1/device:CPU:0"
assert v4.device == "/job:ps/replica:0/task:2/device:CPU:0"
assert v5.device == "/job:ps/replica:0/task:0/device:CPU:0"

有关详细信息，请参阅distribute.Strategy.scope。

可变分区

拥有专门的服务器来存储变量意味着能够划分，或"shard" 跨 ps 的变量。在 ps 之间划分大变量是提高训练吞吐量和减轻内存限制的常用技术。它支持对变量的不同分片进行并行计算和更新，并且通常会在参数服务器之间产生更好的负载平衡。如果没有分片，具有大变量(例如嵌入)的模型无法放入一台机器的内存中，否则将无法训练。

使用 tf.distribute.experimental.ParameterServerStrategy ，如果将 variable_partitioner 提供给 __init__ 并且满足某些条件，则在范围内创建的结果变量将以 round-robin 方式在参数服务器之间分片。从tf.Variable 返回的变量引用成为一种类型，用作分片变量的容器。可以访问此容器的variables 属性以获取实际的变量组件。如果使用 tf.Module 或 Keras 构建模型，变量组件将收集在 variables 类似属性中。

建议使用基于大小的分区器，例如tf.distribute.experimental.partitioners.MinSizePartitioner避免分割小变量，这可能会对模型训练速度产生负面影响。

# Partition the embedding layer into 2 shards.
variable_partitioner = (
  tf.distribute.experimental.partitioners.MinSizePartitioner(
    min_shard_bytes=(256 << 10),
    max_shards = 2))
strategy = tf.distribute.experimental.ParameterServerStrategy(
  cluster_resolver=...,
  variable_partitioner = variable_partitioner)
with strategy.scope():
  embedding = tf.keras.layers.Embedding(input_dim=1024, output_dim=1024)
assert len(embedding.variables) == 2
assert isinstance(embedding.variables[0], tf.Variable)
assert isinstance(embedding.variables[1], tf.Variable)
assert embedding.variables[0].shape == (512, 1024)
assert embedding.variables[1].shape == (512, 1024)

分片变量容器可以通过 tf.convert_to_tensor 转换为 Tensor 。这意味着容器可以直接在大多数 Python Ops 中使用，其中会自动发生这种 Tensor 转换。例如，在上面的代码片段中，x * self.w 将隐式应用所述张量转换。请注意，这种转换可能很昂贵，因为变量组件需要从多个参数服务器传输到使用该值的位置。

另一方面，tf.nn.embedding_lookup 不应用张量转换，而是对变量分量执行并行查找。当嵌入表变量很大时，这对于扩大嵌入查找至关重要。

当分区变量保存到 SavedModel 时，它将被保存为一个单独的变量。这通过消除处理分区方面的许多操作来提高服务效率。

变量分区的已知限制：

分区数不得在检查点保存/加载期间更改。
将分区变量保存到 SavedModel 后，无法通过 tf.saved_model.load 加载 SavedModel。
分区变量不直接与 tf.GradientTape 一起使用，请使用 variables 属性来获取实际的变量组件并在渐变 API 中使用它们。

数据集准备

使用 tf.distribute.experimental.ParameterServerStrategy ，在每个工人中创建一个数据集以用于训练。这是通过创建一个不带参数并返回 tf.data.Dataset 的 dataset_fn 并将 dataset_fn 传递给 tf.distribute.experimental.coordinator. ClusterCoordinator.create_per_worker_dataset 来完成的。我们建议对数据集进行混洗和重复，以使示例尽可能均匀地通过训练。

def dataset_fn():
  filenames = ...
  dataset = tf.data.Dataset.from_tensor_slices(filenames)

  # Dataset is recommended to be shuffled, and repeated.
  return dataset.shuffle(buffer_size=...).repeat().batch(batch_size=...)

coordinator =
    tf.distribute.experimental.coordinator.ClusterCoordinator(strategy=...)
distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn)

限制

TF2 中的tf.distribute.experimental.ParameterServerStrategy 是实验性的，API 可能会进一步更改。
使用 Model.fit 时，tf.distribute.experimental.ParameterServerStrategy 必须与 tf.keras.utils.experimental.DatasetCreator 一起使用，并且必须指定 steps_per_epoch。

相关用法

注：本文由纯净天空筛选整理自tensorflow.org大神的英文原创作品 tf.distribute.experimental.ParameterServerStrategy。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。