Python tf.distribute.experimental.ParameterServerStrategy用法及代碼示例

帶有參數服務器的multi-worker tf.distribute 策略。

繼承自：Strategy

用法

tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver, variable_partitioner=None
)

參數

cluster_resolver tf.distribute.cluster_resolver.ClusterResolver 對象。
variable_partitioner
a distribute.experimental.partitioners.Partitioner指定如何對變量進行分區。如果None, 變量不會被分區。
- 預定義的分區器tf.distribute.experimental.partitioners可以用於這個論點。一個常用的分區器是MinSizePartitioner(min_shard_bytes = 256 << 10, max_shards = num_ps)，它為每個分片分配至少 256K，每個 ps 最多獲得一個分片。
- 將為在策略 scope 下創建的每個變量調用 variable_partitioner 以指示應如何對變量進行分區。沿分區軸隻有一個分區(即不需要分區)的變量將被創建為正常的 tf.Variable 。
- 僅支持第一個/最外層軸分區。
- div 分區策略用於對變量進行分區。假設我們沿變量的第一個軸分配連續的整數 id，然後將 id 以連續的方式分配給分片，同時嘗試保持每個分片大小相同。如果 id 不均分分片的數量，則前幾個分片中的每一個都將被分配一個更多的 id。例如，第一個維度為 13 的變量有 13 個 id，它們被分成 5 個分片，如下所示：[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]].
- 在strategy.extended.colocate_vars_with 下創建的變量不會被分區。

屬性

cluster_resolver
返回與此策略關聯的集群解析器。
一般來說，當使用multi-worker tf.distribute 策略如tf.distribute.experimental.MultiWorkerMirroredStrategy 或tf.distribute.TPUStrategy() 時，有一個tf.distribute.cluster_resolver.ClusterResolver 與所使用的策略相關聯，並且這樣的實例由該屬性返回。

打算擁有關聯tf.distribute.cluster_resolver.ClusterResolver 的策略必須設置相關屬性，或覆蓋此屬性；否則，默認返回None。這些策略還應提供有關此屬性返回的內容的信息。

Single-worker 策略通常沒有 tf.distribute.cluster_resolver.ClusterResolver ，在這些情況下，此屬性將返回 None 。

當用戶需要訪問集群規範、任務類型或任務 ID 等信息時，tf.distribute.cluster_resolver.ClusterResolver 可能很有用。例如，
```
os.environ['TF_CONFIG'] = json.dumps({
  'cluster':{
      'worker':["localhost:12345", "localhost:23456"],
      'ps':["localhost:34567"]
  },
  'task':{'type':'worker', 'index':0}
})

# This implicitly uses TF_CONFIG for the cluster and current task info.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

...

if strategy.cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. Since we set this
  # as a worker above, this block will run on this particular instance.
elif strategy.cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. Since we
  # set this as a worker above, this block will not run on this particular
  # instance.
```
有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的 API 文檔字符串。
extended tf.distribute.StrategyExtended 與其他方法。
num_replicas_in_sync 返回聚合梯度的副本數。

參數服務器訓練是一種常見的data-parallel 方法，用於在多台機器上擴展機器學習模型。參數服務器訓練集群由工作人員和參數服務器組成。變量在參數服務器上創建，並在每個步驟中由工作人員讀取和更新。默認情況下，工作人員獨立讀取和更新這些變量，而不會相互同步。在這種配置下，它被稱為異步訓練。

在 TensorFlow 2 中，我們推薦一種基於中央協調的架構來進行參數服務器訓練。每個worker和parameter server運行一個tf.distribute.Server，最重要的是，一個coordinator任務負責在worker和parameter server上創建資源，調度函數，協調訓練。協調器使用 tf.distribute.experimental.coordinator.ClusterCoordinator 來協調集群，並使用 tf.distribute.experimental.ParameterServerStrategy 來定義參數服務器上的變量和工作器上的計算。

為使訓練生效，協調員調度 tf.function s 以在遠程工作人員上執行。在收到來自協調器的請求後，工作人員通過從參數服務器讀取變量、執行操作並更新參數服務器上的變量來執行tf.function。每個 worker 隻處理來自 coordinator 的請求，並與參數服務器進行通信，而不與集群中的其他 worker 直接交互。

因此，一些工作人員的故障不會阻止集群繼續工作，這允許集群使用偶爾不可用的實例(例如搶占式或現貨實例)進行訓練。但是，協調器和參數服務器必須始終可用，集群才能取得進展。

請注意，協調員不是訓練人員之一。相反，它會創建變量和數據集等資源，調度 tf.function s，保存檢查點等。除了工作人員、參數服務器和協調器之外，還可以在一側運行可選的評估器，定期讀取協調器保存的檢查點並針對每個檢查點運行評估。

ParameterServerStrategy 支持兩個訓練 API：自定義訓練循環 (CTL) 和 Keras 訓練 API，也稱為 Model.fit。當用戶更喜歡定義訓練循環的細節時，推薦使用 CTL，當用戶喜歡高級抽象和訓練處理時，推薦使用 Model.fit。

使用 CTL 時，ParameterServerStrategy 必須與 tf.distribute.experimental.coordinator.ClusterCoordinator 對象一起使用。

使用 Model.fit 時，目前僅支持 tf.keras.utils.experimental.DatasetCreator 輸入類型。

協調器的示例代碼

本節提供旨在在(唯一的)一個指定為協調器的任務上運行的代碼片段。請注意，cluster_resolver , variable_partitioner 和 dataset_fn 參數在以下 "Cluster setup"、"Variable partitioning" 和 "Dataset preparation" 部分中進行了說明。

使用 CTL，

# Prepare a strategy to use with the cluster and variable partitioning info.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...,
    variable_partitioner=...)
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(
    strategy=strategy)

# Prepare a distribute dataset that will place datasets on the workers.
distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn=...)

with strategy.scope():
  model = ...
  optimizer, metrics = ...  # Keras optimizer/metrics are great choices
  checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
  checkpoint_manager = tf.train.CheckpointManager(
      checkpoint, checkpoint_dir, max_to_keep=2)
  # `load_checkpoint` infers initial epoch from `optimizer.iterations`.
  initial_epoch = load_checkpoint(checkpoint_manager) or 0

@tf.function
def worker_fn(iterator):

  def replica_fn(inputs):
    batch_data, labels = inputs
    # calculate gradient, applying gradient, metrics update etc.

  strategy.run(replica_fn, args=(next(iterator),))

for epoch in range(initial_epoch, num_epoch):
  distributed_iterator = iter(distributed_dataset)  # Reset iterator state.
  for step in range(steps_per_epoch):

    # Asynchronously schedule the `worker_fn` to be executed on an arbitrary
    # worker. This call returns immediately.
    coordinator.schedule(worker_fn, args=(distributed_iterator,))

  # `join` blocks until all scheduled `worker_fn`s finish execution. Once it
  # returns, we can read the metrics and save checkpoints as needed.
  coordinator.join()
  logging.info('Metric result:%r', metrics.result())
  train_accuracy.reset_states()
  checkpoint_manager.save()

使用 Model.fit ，

# Prepare a strategy to use with the cluster and variable partitioning info.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...,
    variable_partitioner=...)

# A dataset function takes a `input_context` and returns a `Dataset`
def dataset_fn(input_context):
  dataset = tf.data.Dataset.from_tensors(...)
  return dataset.repeat().shard(...).batch(...).prefetch(...)

# With `Model.fit`, a `DatasetCreator` needs to be used.
input = tf.keras.utils.experimental.DatasetCreator(dataset_fn=...)

with strategy.scope():
  model = ...  # Make sure the `Model` is created within scope.
model.compile(optimizer="rmsprop", loss="mse", steps_per_execution=..., ...)

# Optional callbacks to checkpoint the model, back up the progress, etc.
callbacks = [tf.keras.callbacks.ModelCheckpoint(...), ...]

# `steps_per_epoch` is required with `ParameterServerStrategy`.
model.fit(input, epochs=..., steps_per_epoch=..., callbacks=callbacks)

工作器和參數服務器的示例代碼

除了協調員之外，還應該有指定為"worker" 或"ps" 的任務。他們應該運行以下代碼來啟動 TensorFlow 服務器，等待協調器的請求：

# Provide a `tf.distribute.cluster_resolver.ClusterResolver` that serves
# the cluster information. See below "Cluster setup" section.
cluster_resolver = ...

server = tf.distribute.Server(
    cluster_resolver.cluster_spec(),
    job_name=cluster_resolver.task_type,
    task_index=cluster_resolver.task_id,
    protocol="grpc")

# Blocking the process that starts a server from exiting.
server.join()

集群設置

為了讓集群中的任務知道其他任務的地址，需要在coordinator、worker和ps中使用tf.distribute.cluster_resolver.ClusterResolver。 tf.distribute.cluster_resolver.ClusterResolver負責提供集群信息，以及當前任務的任務類型和id。有關詳細信息，請參閱tf.distribute.cluster_resolver.ClusterResolver。

如果設置了TF_CONFIG 環境變量，則還應使用tf.distribute.cluster_resolver.TFConfigClusterResolver。

由於tf.distribute.experimental.ParameterServerStrategy 中圍繞任務類型的命名進行了假設，因此在tf.distribute.cluster_resolver.ClusterResolver 中應使用"chief"、"ps" 和"worker" 來分別指代協調器、參數服務器和工作者。

以下示例演示了在具有 1 個主管、2 個參數服務器和 3 個工作人員的集群中，為指定為參數服務器(任務類型 "ps")和索引 1(第二個任務)的任務設置 TF_CONFIG。請注意，它需要在使用 tf.distribute.cluster_resolver.TFConfigClusterResolver 之前進行設置。

集群設置示例代碼：

os.environ['TF_CONFIG'] = '''
{
  "cluster":{
    "chief":["chief.example.com:2222"],
    "ps":["ps0.example.com:2222", "ps1.example.com:2222"],
    "worker":["worker0.example.com:2222", "worker1.example.com:2222",
               "worker2.example.com:2222"]
  },
  "task":{
    "type":"ps",
    "index":1
  }
}
'''

如果您希望為所有任務運行相同的二進製文件，則需要在程序開始時讓二進製文件分支到不同的角色：

# If coordinator, create a strategy and start the training program.
if cluster_resolver.task_type == 'chief':
  strategy = tf.distribute.experimental.ParameterServerStrategy(
      cluster_resolver)
  ...

# If worker/ps, create a server
elif cluster_resolver.task_type in ("worker", "ps"):
  server = tf.distribute.Server(...)
  ...

或者，您也可以提前啟動一堆 TensorFlow 服務器，稍後再連接它們。協調器可以在同一個集群中，也可以在任何可以連接到工作服務器和參數服務器的機器上。這在我們的指南和教程中有介紹。

使用strategy.scope() 創建變量

tf.distribute.experimental.ParameterServerStrategy 遵循 tf.distribute API 合同，其中變量創建預計在 strategy.scope() 返回的上下文管理器內，以便以 round-robin 方式正確放置在參數服務器上：

# In this example, we're assuming having 3 ps.
strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...)
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(
    strategy=strategy)

# Variables should be created inside scope to be placed on parameter servers.
# If created outside scope such as `v1` here, it would be placed on the
# coordinator.
v1 = tf.Variable(initial_value=0.0)

with strategy.scope():
  v2 = tf.Variable(initial_value=1.0)
  v3 = tf.Variable(initial_value=2.0)
  v4 = tf.Variable(initial_value=3.0)
  v5 = tf.Variable(initial_value=4.0)

# v2 through v5 are created in scope and are distributed on parameter servers.
# Default placement is round-robin but the order should not be relied on.
assert v2.device == "/job:ps/replica:0/task:0/device:CPU:0"
assert v3.device == "/job:ps/replica:0/task:1/device:CPU:0"
assert v4.device == "/job:ps/replica:0/task:2/device:CPU:0"
assert v5.device == "/job:ps/replica:0/task:0/device:CPU:0"

有關詳細信息，請參閱distribute.Strategy.scope。

可變分區

擁有專門的服務器來存儲變量意味著能夠劃分，或"shard" 跨 ps 的變量。在 ps 之間劃分大變量是提高訓練吞吐量和減輕內存限製的常用技術。它支持對變量的不同分片進行並行計算和更新，並且通常會在參數服務器之間產生更好的負載平衡。如果沒有分片，具有大變量(例如嵌入)的模型無法放入一台機器的內存中，否則將無法訓練。

使用 tf.distribute.experimental.ParameterServerStrategy ，如果將 variable_partitioner 提供給 __init__ 並且滿足某些條件，則在範圍內創建的結果變量將以 round-robin 方式在參數服務器之間分片。從tf.Variable 返回的變量引用成為一種類型，用作分片變量的容器。可以訪問此容器的variables 屬性以獲取實際的變量組件。如果使用 tf.Module 或 Keras 構建模型，變量組件將收集在 variables 類似屬性中。

建議使用基於大小的分區器，例如tf.distribute.experimental.partitioners.MinSizePartitioner避免分割小變量，這可能會對模型訓練速度產生負麵影響。

# Partition the embedding layer into 2 shards.
variable_partitioner = (
  tf.distribute.experimental.partitioners.MinSizePartitioner(
    min_shard_bytes=(256 << 10),
    max_shards = 2))
strategy = tf.distribute.experimental.ParameterServerStrategy(
  cluster_resolver=...,
  variable_partitioner = variable_partitioner)
with strategy.scope():
  embedding = tf.keras.layers.Embedding(input_dim=1024, output_dim=1024)
assert len(embedding.variables) == 2
assert isinstance(embedding.variables[0], tf.Variable)
assert isinstance(embedding.variables[1], tf.Variable)
assert embedding.variables[0].shape == (512, 1024)
assert embedding.variables[1].shape == (512, 1024)

分片變量容器可以通過 tf.convert_to_tensor 轉換為 Tensor 。這意味著容器可以直接在大多數 Python Ops 中使用，其中會自動發生這種 Tensor 轉換。例如，在上麵的代碼片段中，x * self.w 將隱式應用所述張量轉換。請注意，這種轉換可能很昂貴，因為變量組件需要從多個參數服務器傳輸到使用該值的位置。

另一方麵，tf.nn.embedding_lookup 不應用張量轉換，而是對變量分量執行並行查找。當嵌入表變量很大時，這對於擴大嵌入查找至關重要。

當分區變量保存到 SavedModel 時，它將被保存為一個單獨的變量。這通過消除處理分區方麵的許多操作來提高服務效率。

變量分區的已知限製：

分區數不得在檢查點保存/加載期間更改。
將分區變量保存到 SavedModel 後，無法通過 tf.saved_model.load 加載 SavedModel。
分區變量不直接與 tf.GradientTape 一起使用，請使用 variables 屬性來獲取實際的變量組件並在漸變 API 中使用它們。

數據集準備

使用 tf.distribute.experimental.ParameterServerStrategy ，在每個工人中創建一個數據集以用於訓練。這是通過創建一個不帶參數並返回 tf.data.Dataset 的 dataset_fn 並將 dataset_fn 傳遞給 tf.distribute.experimental.coordinator. ClusterCoordinator.create_per_worker_dataset 來完成的。我們建議對數據集進行混洗和重複，以使示例盡可能均勻地通過訓練。

def dataset_fn():
  filenames = ...
  dataset = tf.data.Dataset.from_tensor_slices(filenames)

  # Dataset is recommended to be shuffled, and repeated.
  return dataset.shuffle(buffer_size=...).repeat().batch(batch_size=...)

coordinator =
    tf.distribute.experimental.coordinator.ClusterCoordinator(strategy=...)
distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn)

限製

TF2 中的tf.distribute.experimental.ParameterServerStrategy 是實驗性的，API 可能會進一步更改。
使用 Model.fit 時，tf.distribute.experimental.ParameterServerStrategy 必須與 tf.keras.utils.experimental.DatasetCreator 一起使用，並且必須指定 steps_per_epoch。

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.distribute.experimental.ParameterServerStrategy。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。