Python tf.strings.unicode_decode_with_offsets用法及代碼示例

將每個字符串解碼為具有起始偏移量的代碼點序列。

用法

tf.strings.unicode_decode_with_offsets(
    input, input_encoding, errors='replace', replacement_char=65533,
    replace_control_characters=False, name=None
)

參數

input 一個 N 維度可能參差不齊的 string 張量，形狀為 [D1...DN] 。 N 必須是靜態已知的。
input_encoding 用於解碼每個字符串的 unicode 編碼的字符串名稱。
errors
指定無法使用指示的編碼轉換輸入字符串時的響應。之一：
- 'strict' ：為任何非法子字符串引發異常。
- 'replace' ：用 replacement_char 替換非法子字符串。
- 'ignore' ：跳過非法子串。
replacement_char 當 errors='replace' 時用於代替 input 中的無效子字符串的替換代碼點；並在 replace_control_characters=True 時代替 input 中的 C0 控製字符。
replace_control_characters 是否將 C0 控製字符 (U+0000 - U+001F) 替換為 replacement_char 。
name 操作的名稱(可選)。

一個元組N+1維張量(codepoints, start_offsets).
- codepoints 是一個 int32 張量，形狀為 [D1...DN, (num_chars)] 。
- offsets 是一個 int64 張量，形狀為 [D1...DN, (num_chars)] 。
如果 input 是標量，則返回的張量為 tf.Tensor ，否則為 tf.RaggedTensor 。

此操作類似於 tf.strings.decode(...) ，但它還返回其各自字符串中每個字符的起始偏移量。此信息可用於將字符與原始字節序列對齊。

返回一個元組 (codepoints, start_offsets) 其中：

codepoints[i1...iN, j] 是使用 input_encoding 解碼時 input[i1...iN] 中第 j 字符的 Unicode 代碼點。
start_offsets[i1...iN, j] 是使用 input_encoding 解碼時 input[i1...iN] 中第 j 字符的起始字節偏移量。

例子：

input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]
result = tf.strings.unicode_decode_with_offsets(input, 'UTF-8')
result[0].to_list()  # codepoints
[[71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]
result[1].to_list()  # offsets
[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]

相關用法

注：本文由純淨天空篩選整理自tensorflow.org大神的英文原創作品 tf.strings.unicode_decode_with_offsets。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。