Python tf.lookup.StaticVocabularyTable用法及代碼示例

將詞匯表外鍵分配給哈希桶的 Id 表的字符串。

繼承自：TrackableResource

用法

tf.lookup.StaticVocabularyTable(
    initializer, num_oov_buckets, lookup_key_dtype=None, name=None,
    experimental_is_anonymous=False
)

參數

initializer 包含用於初始化表的數據的 TableInitializerBase 對象。如果沒有，那麽我們隻使用out-of-vocab 桶。
num_oov_buckets 用於詞匯表外鍵的桶數。必須大於零。
lookup_key_dtype 傳遞給 lookup 的鍵的數據類型。如果指定了initializer，則默認為initializer.key_dtype，否則為tf.string。必須是字符串或整數，並且必須可轉換為 initializer.key_dtype 。
name 操作的名稱(可選)。
experimental_is_anonymous 是否對表使用匿名模式(默認為 False)。在匿名模式下，表資源隻能通過資源句柄訪問。它不能通過名字來查找。當所有指向該資源的資源句柄都消失時，該資源將被自動刪除。

拋出

ValueError 當num_oov_buckets 不是正數時。
TypeError 當 lookup_key_dtype 或 initializer.key_dtype 不是整數或字符串時。同樣當初始化程序。value_dtype!= int64。

屬性

key_dtype 表鍵數據類型。
name 表的名稱。
resource_handle 返回與此資源關聯的資源句柄。
value_dtype 表值 dtype。

例如，如果 StaticVocabularyTable 的實例使用 string-to-id 初始化程序進行初始化，該初始化程序映射：

init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(['emerson', 'lake', 'palmer']),
    values=tf.constant([0, 1, 2], dtype=tf.int64))
table = tf.lookup.StaticVocabularyTable(
   init,
   num_oov_buckets=5)

Vocabulary 對象將執行以下映射：

emerson -> 0
lake -> 1
palmer -> 2
<other term> -> bucket_id ，其中 bucket_id 將介於 3 和 3 + num_oov_buckets - 1 = 7 之間，計算公式為：hash(<term>) % num_oov_buckets + vocab_size

如果 input_tensor 是：

input_tensor = tf.constant(["emerson", "lake", "palmer",
                            "king", "crimson"])
table[input_tensor].numpy()
array([0, 1, 2, 6, 7])

如果initializer 為無，則僅使用詞匯外存儲桶。

示例用法：

num_oov_buckets = 3
vocab = ["emerson", "lake", "palmer", "crimnson"]
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
f.write('\n'.join(vocab).encode('utf-8'))
f.close()

init = tf.lookup.TextFileInitializer(
    f.name,
    key_dtype=tf.string, key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
    value_dtype=tf.int64, value_index=tf.lookup.TextFileIndex.LINE_NUMBER)
table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)
table.lookup(tf.constant(["palmer", "crimnson" , "king",
                          "tarkus", "black", "moon"])).numpy()
array([2, 3, 5, 6, 6, 4])

用於生成詞匯外存儲桶 ID 的哈希函數是 Fingerprint64。

請注意，無論表值如何，詞匯表外存儲桶 ID 的範圍始終從表 size 到 size + num_oov_buckets - 1，這可能會導致意外衝突：

init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(["emerson", "lake", "palmer"]),
    values=tf.constant([1, 2, 3], dtype=tf.int64))
table = tf.lookup.StaticVocabularyTable(
    init,
    num_oov_buckets=1)
input_tensor = tf.constant(["emerson", "lake", "palmer", "king"])
table[input_tensor].numpy()
array([1, 2, 3, 3])

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.lookup.StaticVocabularyTable。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。