Python tf.keras.layers.TextVectorization用法及代码示例

将文本特征映射到整数序列的预处理层。

继承自：PreprocessingLayer、Layer、Module

用法

tf.keras.layers.TextVectorization(
    max_tokens=None, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False, **kwargs
)

参数

max_tokens 该层的最大词汇量。这仅应在调整词汇表或设置 pad_to_max_tokens=True 时指定。请注意，此词汇表包含 1 个 OOV 标记，因此标记的有效数量为 (max_tokens - 1 - (1 if output_mode == "int" else 0)) 。
standardize
适用于输入文本的标准化可选规范。值可以是：
- None：没有标准化。
- "lower_and_strip_punctuation" ：文本将小写并删除所有标点符号。
- "lower" ：文本将小写。
- "strip_punctuation" ：将删除所有标点符号。
- 可调用：输入将传递给可调用函数，该函数应标准化并返回。
split
用于分割输入文本的可选规范。值可以是：
- None ：不分裂。
- "whitespace" ：在空白处拆分。
- "character" ：在每个 unicode 字符上拆分。
- 可调用：标准化输入将传递给可调用函数，该函数应该拆分并返回。
ngrams 从 possibly-split 输入文本创建的 ngram 的可选规范。值可以是 None、整数或整数元组；传递整数将创建直到该整数的 ngram，传递整数元组将为元组中的指定值创建 ngram。传递 None 意味着不会创建任何 ngram。
output_mode
层输出的可选规范。值可以是"int","multi_hot","count"或者"tf_idf"，配置层如下：
- "int" ：输出整数索引，每个拆分字符串标记一个整数索引。当 output_mode == "int" 时，0 保留用于掩码位置；这将词汇大小减少到 max_tokens - 2 而不是 max_tokens - 1 。
- "multi_hot" ：每批次输出一个 int 数组，大小为 vocab_size 或 max_tokens，其中映射到该索引的标记在批次项中至少存在一次的所有元素中包含 1。
- "count" ：与 "multi_hot" 类似，但 int 数组包含该索引处的标记出现在批处理项中的次数的计数。
- "tf_idf" ：类似于 "multi_hot" ，但应用 TF-IDF 算法来查找每个令牌槽中的值。对于"int" 输出，支持任何形状的输入和输出。对于所有其他输出模式，目前仅支持 rank 1 输入(以及拆分后的 rank 2 输出)。
output_sequence_length 仅在 INT 模式下有效。如果设置，则输出的时间维度将被填充或截断为精确的 output_sequence_length 值，从而导致形状为 (batch_size, output_sequence_length) 的张量，无论拆分步骤产生了多少令牌。默认为无。
pad_to_max_tokens 仅在 "multi_hot" , "count" 和 "tf_idf" 模式下有效。如果为 True，即使词汇表中唯一标记的数量小于 max_tokens，输出的特征轴也会填充到 max_tokens，从而导致形状为 (batch_size, max_tokens) 的张量，无论词汇表大小如何。默认为假。
vocabulary 可选的。字符串数组或文本文件的字符串路径。如果传递一个数组，可以传递一个元组、列表、一维 numpy 数组或包含字符串词汇项的一维张量。如果传递文件路径，则该文件应包含词汇表中的每个术语一行。如果设置了此参数，则无需adapt() 图层。
idf_weights 仅当 output_mode 为 "tf_idf" 时有效。元组、列表、一维 numpy 数组或一维张量或与词汇表长度相同，包含浮点逆文档频率权重，该权重将乘以每个样本术语计数，以获得最终的 tf_idf 权重。如果设置了 vocabulary 参数，并且 output_mode 是 "tf_idf" ，则必须提供此参数。
ragged 布尔值。仅适用于"int" 输出模式。如果为 True，则返回 RaggedTensor 而不是密集的 Tensor ，其中每个序列在字符串拆分后可能具有不同的长度。默认为假。
sparse 布尔值。仅适用于 "multi_hot" , "count" 和 "tf_idf" 输出模式。如果为 True，则返回 SparseTensor 而不是密集的 Tensor 。默认为假。

属性

is_adapted 图层是否已经适合数据。

该层具有在 Keras 模型中管理文本的基本选项。它将一批字符串(一个示例 = 一个字符串)转换为令牌索引列表(一个示例 = 整数令牌索引的 1D 张量)或密集表示(一个示例 = 表示有关示例令牌的数据的浮点值的 1D 张量)。该层旨在处理自然语言输入。要处理简单的字符串输入(分类字符串或预标记字符串)，请参阅tf.keras.layers.StringLookup。

该层的词汇表必须在构造时提供或通过 adapt() 学习。当这一层适应时，它将分析数据集，确定单个字符串值的频率，并从中创建一个词汇表。该词汇表可以有无限大小或有上限，具体取决于该层的配置选项；如果输入中的唯一值比最大词汇量多，则将使用最常见的术语来创建词汇表。

每个示例的处理包含以下步骤：

标准化每个示例(通常是小写+标点符号剥离)
将每个示例拆分为子字符串(通常是单词)
将子字符串重新组合成标记(通常是 ngram)
索引标记(将唯一的 int 值与每个标记相关联)
使用此索引将每个示例转换为整数向量或密集浮点向量。

关于传递可调用对象以自定义此层的拆分和规范化的一些注意事项：

任何可调用对象都可以传递给这个层，但是如果你想序列化这个对象，你应该只传递注册 Keras 可序列化的函数(参见tf.keras.utils.register_keras_serializable 了解更多详细信息)。
当为 standardize 使用自定义可调用对象时，可调用对象接收到的数据将与传递给该层的数据完全相同。可调用对象应返回与输入形状相同的张量。
当为 split 使用自定义可调用对象时，可调用对象接收到的数据将挤出第一个维度 - 而不是 [["string to split"], ["another string to split"]] ，可调用对象将看到 ["string to split", "another string to split"] 。可调用对象应返回一个张量，其第一个维度包含拆分标记 - 在此示例中，我们应该看到类似 [["string", "to", "split"], ["another", "string", "to", "split"]] 的内容。这使得可调用站点与 tf.strings.split() 原生兼容。

有关预处理层的概述和完整列表，请参阅预处理指南。

例子：

此示例实例化一个 TextVectorization 层，该层将文本小写、在空白处拆分、去除标点符号并输出整数词汇索引。

text_dataset = tf.data.Dataset.from_tensor_slices(["foo", "bar", "baz"])
max_features = 5000  # Maximum vocab size.
max_len = 4  # Sequence length to pad the outputs to.

# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(text_dataset.batch(64))

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["foo qux bar"], ["qux baz"]]
model.predict(input_data)
array([[2, 1, 4, 0],
       [1, 3, 0, 0]])

例子：

此示例通过将词汇术语列表传递给层的 __init__() 方法来实例化 TextVectorization 层。

vocab_data = ["earth", "wind", "and", "fire"]
max_len = 4  # Sequence length to pad the outputs to.

# Create the layer, passing the vocab directly. You can also pass the
# vocabulary arg a path to a file containing one vocabulary word per
# line.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len,
 vocabulary=vocab_data)

# Because we've passed the vocabulary directly, we don't need to adapt
# the layer - the vocabulary is already set. The vocabulary contains the
# padding token ('') and OOV token ('[UNK]') as well as the passed tokens.
vectorize_layer.get_vocabulary()
['', '[UNK]', 'earth', 'wind', 'and', 'fire']

相关用法

注：本文由纯净天空筛选整理自tensorflow.org大神的英文原创作品 tf.keras.layers.TextVectorization。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。