Python tf.distribute.cluster_resolver.SlurmClusterResolver用法及代碼示例

ClusterResolver 用於具有 Slurm 工作負載管理器的係統。

繼承自：ClusterResolver

用法

tf.distribute.cluster_resolver.SlurmClusterResolver(
    jobs=None, port_base=8888, gpus_per_node=None, gpus_per_task=None,
    tasks_per_node=None, auto_set_gpu=True, rpc_layer='grpc'
)

參數

jobs 以作業名稱為鍵，作業中的任務數為值的字典。默認為與 (Slurm) 任務一樣多的 'worker's。
port_base 節點上進程的第一個端口號。
gpus_per_node 每個節點上可用的 GPU 數量。默認為 nvidia-smi 報告的 GPU 數量
gpus_per_task 用於每個任務的 GPU 數量。默認為將gpus_per_node 平均分配到tasks_per_node。
tasks_per_node 每個節點上運行的任務數。如果每個節點的任務數是常量，或者將主機名映射到該節點上的任務數的字典，則可以是整數。如果未設置，則查詢 Slurm 環境以獲取正確的映射。
auto_set_gpu 通過設置 CUDA_VISIBLE_DEVICES 環境變量，在解析集群時自動設置可見的 CUDA 設備。默認為真。
rpc_layer TensorFlow 用於在節點之間進行通信的協議。默認為'grpc'。

拋出

RuntimeError 如果每個節點請求更多 GPU 然後可用或請求更多任務，則分配任務或解決環境中的缺失值失敗。

屬性

environment 返回 TensorFlow 運行的當前環境。
有兩個可能的返回值，"google"(當 TensorFlow 在 Google-internal 環境中運行時)或空字符串(當 TensorFlow 在其他地方運行時)。

如果您正在實現一個在 Google 環境和開源世界中都可以工作的 ClusterResolver(例如，TPU ClusterResolver 或類似的)，您將必須根據環境返回適當的字符串，您必須檢測到該字符串。

否則，如果您正在實現僅在開源 TensorFlow 中工作的 ClusterResolver，則無需實現此屬性。

task_id 返回此任務 IDClusterResolver表示。

在 TensorFlow 分布式環境中，每個作業可能有一個適用的任務 id，它是實例在其任務類型中的索引。當用戶需要根據任務索引運行特定代碼時，這很有用。例如，

cluster_spec = tf.train.ClusterSpec({
    "ps":["localhost:2222", "localhost:2223"],
    "worker":["localhost:2224", "localhost:2225", "localhost:2226"]
})

# SimpleClusterResolver is used here for illustration; other cluster
# resolvers may be used for other source of task type/id.
simple_resolver = SimpleClusterResolver(cluster_spec, task_type="worker",
                                        task_id=0)

...

if cluster_resolver.task_type == 'worker' and cluster_resolver.task_id == 0:
  # Perform something that's only applicable on 'worker' type, id 0. This
  # block will run on this particular instance since we've specified this
  # task to be a 'worker', id 0 in above cluster resolver.
else:
  # Perform something that's only applicable on other ids. This block will
  # not run on this particular instance.

如果此類信息不可用或不適用於當前分布式環境(例如使用 tf.distribute.cluster_resolver.TPUClusterResolver 進行訓練)，則返回 None。

有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的類文檔字符串。

task_type
返回此任務類型ClusterResolver表示。
在 TensorFlow 分布式環境中，每個作業都可能有一個適用的任務類型。 TensorFlow 中的有效任務類型包括 'chief'：被指定承擔更多責任的工作人員、'worker'：用於訓練/評估的常規工作人員、'ps'：參數服務器或 'evaluator'：評估檢查點的評估程序用於指標。

有關最常用的'chief' 和'worker' 任務類型的更多信息，請參閱Multi-worker 配置。

當用戶需要根據任務類型運行特定代碼時，訪問此類信息非常有用。例如，
```
cluster_spec = tf.train.ClusterSpec({
    "ps":["localhost:2222", "localhost:2223"],
    "worker":["localhost:2224", "localhost:2225", "localhost:2226"]
})

# SimpleClusterResolver is used here for illustration; other cluster
# resolvers may be used for other source of task type/id.
simple_resolver = SimpleClusterResolver(cluster_spec, task_type="worker",
                                        task_id=1)

...

if cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. This block
  # will run on this particular instance since we've specified this task to
  # be a worker in above cluster resolver.
elif cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. This
  # block will not run on this particular instance.
```
如果此類信息不可用或不適用於當前分布式環境(例如使用 tf.distribute.experimental.TPUStrategy 進行訓練)，則返回 None。

有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的課程文檔。

這是用於 Slurm 集群的 ClusterResolver 的實現。這允許指定作業和任務計數、每個節點的任務數量、每個節點上的 GPU 數量以及每個任務的 GPU 數量。它通過 Slurm 環境變量檢索係統屬性，解析分配的計算節點名稱，構建集群並返回可用於分布式 TensorFlow 的 ClusterResolver 對象。

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.distribute.cluster_resolver.SlurmClusterResolver。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。