Python tf.lookup.StaticVocabularyTable用法及代码示例

将词汇表外键分配给哈希桶的 Id 表的字符串。

继承自：TrackableResource

用法

tf.lookup.StaticVocabularyTable(
    initializer, num_oov_buckets, lookup_key_dtype=None, name=None,
    experimental_is_anonymous=False
)

参数

initializer 包含用于初始化表的数据的 TableInitializerBase 对象。如果没有，那么我们只使用out-of-vocab 桶。
num_oov_buckets 用于词汇表外键的桶数。必须大于零。
lookup_key_dtype 传递给 lookup 的键的数据类型。如果指定了initializer，则默认为initializer.key_dtype，否则为tf.string。必须是字符串或整数，并且必须可转换为 initializer.key_dtype 。
name 操作的名称(可选)。
experimental_is_anonymous 是否对表使用匿名模式(默认为 False)。在匿名模式下，表资源只能通过资源句柄访问。它不能通过名字来查找。当所有指向该资源的资源句柄都消失时，该资源将被自动删除。

抛出

ValueError 当num_oov_buckets 不是正数时。
TypeError 当 lookup_key_dtype 或 initializer.key_dtype 不是整数或字符串时。同样当初始化程序。value_dtype!= int64。

属性

key_dtype 表键数据类型。
name 表的名称。
resource_handle 返回与此资源关联的资源句柄。
value_dtype 表值 dtype。

例如，如果 StaticVocabularyTable 的实例使用 string-to-id 初始化程序进行初始化，该初始化程序映射：

init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(['emerson', 'lake', 'palmer']),
    values=tf.constant([0, 1, 2], dtype=tf.int64))
table = tf.lookup.StaticVocabularyTable(
   init,
   num_oov_buckets=5)

Vocabulary 对象将执行以下映射：

emerson -> 0
lake -> 1
palmer -> 2
<other term> -> bucket_id ，其中 bucket_id 将介于 3 和 3 + num_oov_buckets - 1 = 7 之间，计算公式为：hash(<term>) % num_oov_buckets + vocab_size

如果 input_tensor 是：

input_tensor = tf.constant(["emerson", "lake", "palmer",
                            "king", "crimson"])
table[input_tensor].numpy()
array([0, 1, 2, 6, 7])

如果initializer 为无，则仅使用词汇外存储桶。

示例用法：

num_oov_buckets = 3
vocab = ["emerson", "lake", "palmer", "crimnson"]
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
f.write('\n'.join(vocab).encode('utf-8'))
f.close()

init = tf.lookup.TextFileInitializer(
    f.name,
    key_dtype=tf.string, key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
    value_dtype=tf.int64, value_index=tf.lookup.TextFileIndex.LINE_NUMBER)
table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)
table.lookup(tf.constant(["palmer", "crimnson" , "king",
                          "tarkus", "black", "moon"])).numpy()
array([2, 3, 5, 6, 6, 4])

用于生成词汇外存储桶 ID 的哈希函数是 Fingerprint64。

请注意，无论表值如何，词汇表外存储桶 ID 的范围始终从表 size 到 size + num_oov_buckets - 1，这可能会导致意外冲突：

init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(["emerson", "lake", "palmer"]),
    values=tf.constant([1, 2, 3], dtype=tf.int64))
table = tf.lookup.StaticVocabularyTable(
    init,
    num_oov_buckets=1)
input_tensor = tf.constant(["emerson", "lake", "palmer", "king"])
table[input_tensor].numpy()
array([1, 2, 3, 3])

相关用法

注：本文由纯净天空筛选整理自tensorflow.org大神的英文原创作品 tf.lookup.StaticVocabularyTable。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。