Python tf.distribute.cluster_resolver.KubernetesClusterResolver用法及代碼示例

Kubernetes 的 ClusterResolver。

繼承自：ClusterResolver

用法

tf.distribute.cluster_resolver.KubernetesClusterResolver(
    job_to_label_mapping=None, tf_server_port=8470, rpc_layer='grpc',
    override_client=None
)

參數

job_to_label_mapping
TensorFlow 作業到標簽選擇器的映射。這允許用戶在一個 Cluster Resolver 中指定多個 TensorFlow 作業，並且每個作業可以擁有屬於不同標簽選擇器的 pod。例如，一個示例映射可能是
```
{'worker':['job-name=worker-cluster-a', 'job-name=worker-cluster-b'],
 'ps':['job-name=ps-1', 'job-name=ps-2']}
```
tf_server_port TensorFlow 服務器正在偵聽的端口。
rpc_layer (可選)TensorFlow 應該使用 RPC 層在 Kubernetes 中的任務之間進行通信。默認為'grpc'。
override_client Kubernetes 客戶端(通常使用 from kubernetes import client as k8sclient 自動檢索)。如果您將其傳入，您將負責手動設置 Kubernetes 憑據。

拋出

ImportError 如果沒有安裝 Kubernetes Python 客戶端並且沒有傳入override_client。
RuntimeError 如果 autoresolve_task 不是布爾值或可調用的。

屬性

environment 返回 TensorFlow 運行的當前環境。
有兩個可能的返回值，"google"(當 TensorFlow 在 Google-internal 環境中運行時)或空字符串(當 TensorFlow 在其他地方運行時)。

如果您正在實現一個在 Google 環境和開源世界中都可以工作的 ClusterResolver(例如，TPU ClusterResolver 或類似的)，您將必須根據環境返回適當的字符串，您必須檢測到該字符串。

否則，如果您正在實現僅在開源 TensorFlow 中工作的 ClusterResolver，則無需實現此屬性。

task_id 返回此任務 IDClusterResolver表示。

在 TensorFlow 分布式環境中，每個作業可能有一個適用的任務 id，它是實例在其任務類型中的索引。當用戶需要根據任務索引運行特定代碼時，這很有用。例如，

cluster_spec = tf.train.ClusterSpec({
    "ps":["localhost:2222", "localhost:2223"],
    "worker":["localhost:2224", "localhost:2225", "localhost:2226"]
})

# SimpleClusterResolver is used here for illustration; other cluster
# resolvers may be used for other source of task type/id.
simple_resolver = SimpleClusterResolver(cluster_spec, task_type="worker",
                                        task_id=0)

...

if cluster_resolver.task_type == 'worker' and cluster_resolver.task_id == 0:
  # Perform something that's only applicable on 'worker' type, id 0. This
  # block will run on this particular instance since we've specified this
  # task to be a 'worker', id 0 in above cluster resolver.
else:
  # Perform something that's only applicable on other ids. This block will
  # not run on this particular instance.

如果此類信息不可用或不適用於當前分布式環境(例如使用 tf.distribute.cluster_resolver.TPUClusterResolver 進行訓練)，則返回 None。

有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的類文檔字符串。

task_type
返回此任務類型ClusterResolver表示。
在 TensorFlow 分布式環境中，每個作業都可能有一個適用的任務類型。 TensorFlow 中的有效任務類型包括 'chief'：被指定承擔更多責任的工作人員、'worker'：用於訓練/評估的常規工作人員、'ps'：參數服務器或 'evaluator'：評估檢查點的評估程序用於指標。

有關最常用的'chief' 和'worker' 任務類型的更多信息，請參閱Multi-worker 配置。

當用戶需要根據任務類型運行特定代碼時，訪問此類信息非常有用。例如，
```
cluster_spec = tf.train.ClusterSpec({
    "ps":["localhost:2222", "localhost:2223"],
    "worker":["localhost:2224", "localhost:2225", "localhost:2226"]
})

# SimpleClusterResolver is used here for illustration; other cluster
# resolvers may be used for other source of task type/id.
simple_resolver = SimpleClusterResolver(cluster_spec, task_type="worker",
                                        task_id=1)

...

if cluster_resolver.task_type == 'worker':
  # Perform something that's only applicable on workers. This block
  # will run on this particular instance since we've specified this task to
  # be a worker in above cluster resolver.
elif cluster_resolver.task_type == 'ps':
  # Perform something that's only applicable on parameter servers. This
  # block will not run on this particular instance.
```
如果此類信息不可用或不適用於當前分布式環境(例如使用 tf.distribute.experimental.TPUStrategy 進行訓練)，則返回 None。

有關詳細信息，請參閱 tf.distribute.cluster_resolver.ClusterResolver 的課程文檔。

這是 Kubernetes 集群解析器的實現。當給定 Kubernetes 命名空間和 pod 標簽選擇器時，我們將檢索與選擇器匹配的所有正在運行的 pod 的 pod IP 地址，並根據該信息返回一個 ClusterSpec。

注意：它無法檢索 task_type , task_id 或 rpc_layer 。要將其與 tf.distribute.experimental.MultiWorkerMirroredStrategy 等一些分發策略一起使用，您需要通過設置這些屬性來指定 task_type 和 task_id。

tf.distribute.Strategy 的使用示例：

# On worker 0
  cluster_resolver = KubernetesClusterResolver(
      {"worker":["job-name=worker-cluster-a", "job-name=worker-cluster-b"]})
  cluster_resolver.task_type = "worker"
  cluster_resolver.task_id = 0
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      cluster_resolver=cluster_resolver)

  # On worker 1
  cluster_resolver = KubernetesClusterResolver(
      {"worker":["job-name=worker-cluster-a", "job-name=worker-cluster-b"]})
  cluster_resolver.task_type = "worker"
  cluster_resolver.task_id = 1
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      cluster_resolver=cluster_resolver)

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.distribute.cluster_resolver.KubernetesClusterResolver。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。