Python tf.keras.layers.TextVectorization用法及代碼示例

將文本特征映射到整數序列的預處理層。

繼承自：PreprocessingLayer、Layer、Module

用法

tf.keras.layers.TextVectorization(
    max_tokens=None, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False, **kwargs
)

參數

max_tokens 該層的最大詞匯量。這僅應在調整詞匯表或設置 pad_to_max_tokens=True 時指定。請注意，此詞匯表包含 1 個 OOV 標記，因此標記的有效數量為 (max_tokens - 1 - (1 if output_mode == "int" else 0)) 。
standardize
適用於輸入文本的標準化可選規範。值可以是：
- None：沒有標準化。
- "lower_and_strip_punctuation" ：文本將小寫並刪除所有標點符號。
- "lower" ：文本將小寫。
- "strip_punctuation" ：將刪除所有標點符號。
- 可調用：輸入將傳遞給可調用函數，該函數應標準化並返回。
split
用於分割輸入文本的可選規範。值可以是：
- None ：不分裂。
- "whitespace" ：在空白處拆分。
- "character" ：在每個 unicode 字符上拆分。
- 可調用：標準化輸入將傳遞給可調用函數，該函數應該拆分並返回。
ngrams 從 possibly-split 輸入文本創建的 ngram 的可選規範。值可以是 None、整數或整數元組；傳遞整數將創建直到該整數的 ngram，傳遞整數元組將為元組中的指定值創建 ngram。傳遞 None 意味著不會創建任何 ngram。
output_mode
層輸出的可選規範。值可以是"int","multi_hot","count"或者"tf_idf"，配置層如下：
- "int" ：輸出整數索引，每個拆分字符串標記一個整數索引。當 output_mode == "int" 時，0 保留用於掩碼位置；這將詞匯大小減少到 max_tokens - 2 而不是 max_tokens - 1 。
- "multi_hot" ：每批次輸出一個 int 數組，大小為 vocab_size 或 max_tokens，其中映射到該索引的標記在批次項中至少存在一次的所有元素中包含 1。
- "count" ：與 "multi_hot" 類似，但 int 數組包含該索引處的標記出現在批處理項中的次數的計數。
- "tf_idf" ：類似於 "multi_hot" ，但應用 TF-IDF 算法來查找每個令牌槽中的值。對於"int" 輸出，支持任何形狀的輸入和輸出。對於所有其他輸出模式，目前僅支持 rank 1 輸入(以及拆分後的 rank 2 輸出)。
output_sequence_length 僅在 INT 模式下有效。如果設置，則輸出的時間維度將被填充或截斷為精確的 output_sequence_length 值，從而導致形狀為 (batch_size, output_sequence_length) 的張量，無論拆分步驟產生了多少令牌。默認為無。
pad_to_max_tokens 僅在 "multi_hot" , "count" 和 "tf_idf" 模式下有效。如果為 True，即使詞匯表中唯一標記的數量小於 max_tokens，輸出的特征軸也會填充到 max_tokens，從而導致形狀為 (batch_size, max_tokens) 的張量，無論詞匯表大小如何。默認為假。
vocabulary 可選的。字符串數組或文本文件的字符串路徑。如果傳遞一個數組，可以傳遞一個元組、列表、一維 numpy 數組或包含字符串詞匯項的一維張量。如果傳遞文件路徑，則該文件應包含詞匯表中的每個術語一行。如果設置了此參數，則無需adapt() 圖層。
idf_weights 僅當 output_mode 為 "tf_idf" 時有效。元組、列表、一維 numpy 數組或一維張量或與詞匯表長度相同，包含浮點逆文檔頻率權重，該權重將乘以每個樣本術語計數，以獲得最終的 tf_idf 權重。如果設置了 vocabulary 參數，並且 output_mode 是 "tf_idf" ，則必須提供此參數。
ragged 布爾值。僅適用於"int" 輸出模式。如果為 True，則返回 RaggedTensor 而不是密集的 Tensor ，其中每個序列在字符串拆分後可能具有不同的長度。默認為假。
sparse 布爾值。僅適用於 "multi_hot" , "count" 和 "tf_idf" 輸出模式。如果為 True，則返回 SparseTensor 而不是密集的 Tensor 。默認為假。

屬性

is_adapted 圖層是否已經適合數據。

該層具有在 Keras 模型中管理文本的基本選項。它將一批字符串(一個示例 = 一個字符串)轉換為令牌索引列表(一個示例 = 整數令牌索引的 1D 張量)或密集表示(一個示例 = 表示有關示例令牌的數據的浮點值的 1D 張量)。該層旨在處理自然語言輸入。要處理簡單的字符串輸入(分類字符串或預標記字符串)，請參閱tf.keras.layers.StringLookup。

該層的詞匯表必須在構造時提供或通過 adapt() 學習。當這一層適應時，它將分析數據集，確定單個字符串值的頻率，並從中創建一個詞匯表。該詞匯表可以有無限大小或有上限，具體取決於該層的配置選項；如果輸入中的唯一值比最大詞匯量多，則將使用最常見的術語來創建詞匯表。

每個示例的處理包含以下步驟：

標準化每個示例(通常是小寫+標點符號剝離)
將每個示例拆分為子字符串(通常是單詞)
將子字符串重新組合成標記(通常是 ngram)
索引標記(將唯一的 int 值與每個標記相關聯)
使用此索引將每個示例轉換為整數向量或密集浮點向量。

關於傳遞可調用對象以自定義此層的拆分和規範化的一些注意事項：

任何可調用對象都可以傳遞給這個層，但是如果你想序列化這個對象，你應該隻傳遞注冊 Keras 可序列化的函數(參見tf.keras.utils.register_keras_serializable 了解更多詳細信息)。
當為 standardize 使用自定義可調用對象時，可調用對象接收到的數據將與傳遞給該層的數據完全相同。可調用對象應返回與輸入形狀相同的張量。
當為 split 使用自定義可調用對象時，可調用對象接收到的數據將擠出第一個維度 - 而不是 [["string to split"], ["another string to split"]] ，可調用對象將看到 ["string to split", "another string to split"] 。可調用對象應返回一個張量，其第一個維度包含拆分標記 - 在此示例中，我們應該看到類似 [["string", "to", "split"], ["another", "string", "to", "split"]] 的內容。這使得可調用站點與 tf.strings.split() 原生兼容。

有關預處理層的概述和完整列表，請參閱預處理指南。

例子：

此示例實例化一個 TextVectorization 層，該層將文本小寫、在空白處拆分、去除標點符號並輸出整數詞匯索引。

text_dataset = tf.data.Dataset.from_tensor_slices(["foo", "bar", "baz"])
max_features = 5000  # Maximum vocab size.
max_len = 4  # Sequence length to pad the outputs to.

# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(text_dataset.batch(64))

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)

# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["foo qux bar"], ["qux baz"]]
model.predict(input_data)
array([[2, 1, 4, 0],
       [1, 3, 0, 0]])

例子：

此示例通過將詞匯術語列表傳遞給層的 __init__() 方法來實例化 TextVectorization 層。

vocab_data = ["earth", "wind", "and", "fire"]
max_len = 4  # Sequence length to pad the outputs to.

# Create the layer, passing the vocab directly. You can also pass the
# vocabulary arg a path to a file containing one vocabulary word per
# line.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len,
 vocabulary=vocab_data)

# Because we've passed the vocabulary directly, we don't need to adapt
# the layer - the vocabulary is already set. The vocabulary contains the
# padding token ('') and OOV token ('[UNK]') as well as the passed tokens.
vectorize_layer.get_vocabulary()
['', '[UNK]', 'earth', 'wind', 'and', 'fire']

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.keras.layers.TextVectorization。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。