Python cudf.core.subword_tokenizer.SubwordTokenizer.__call__用法及代码示例

用法: SubwordTokenizer.__call__(text, max_length: int, max_num_rows: int, add_special_tokens: bool = True, padding: str = 'max_length', truncation: Union[bool, str] = False, stride: int = 0, return_tensors: str = 'cp', return_token_type_ids: bool = False)

在 cuDF 字符串列上运行 CUDA BERT 子词标记器。使用来自预训练标记器的词汇将单词编码为标记 ID。

参数：

text：cudf字符串系列: 要编码的序列批次。
max_length：int: 控制要使用或填充的最大长度。
max_num_rows：int: 输出token-ids 预计由标记器生成的最大行数。用于在 GPU 设备上分配临时工作内存。如果输出生成大量行，则行为未定义。这将根据步幅、截断和max_length 而有所不同。例如，对于非重叠序列，输出行将与输入行相同。一个好的默认值可以是 max_length 的两倍
add_special_tokens：bool，可选，默认为 True: 是否使用 BERT 分类模型的特殊标记对序列进行编码
padding：“max_length”: 填充到参数 max_length 指定的最大长度
truncation：布尔值，默认为 False: True：截断到参数指定的最大长度 max_length False 或 ‘do_not_truncate’：默认不截断(输出与 HuggingFace 不同)
stride：int，可选，默认为 0: 此参数的值定义重叠标记的数量。有关重叠标记的信息存在于输出的元数据中。
return_tensors：str, {“cp”, “pt”, “tf”} 默认为 “cp”: “cp”：返回 cupy cp.ndarray 对象 “tf”：返回 TensorFlow tf.constant 对象 “pt”：返回 PyTorch torch.Tensor 对象
return_token_type_ids：布尔型，可选: 目前仅支持 False

具有以下字段的编码：

input_ids:(类型由return_tensors定义): 要馈送到模型的令牌 ID 的张量。
attention_mask：(类型由return_tensors定义): 指定模型应关注哪些标记的索引张量
元数据：(由return_tensors定义的类型): 每行包含原始字符串的索引 id 以及 token-ids 的第一个和最后一个索引，它们是非填充和非重叠的

例子：

>>> import cudf
>>> from cudf.utils.hash_vocab_utils import hash_vocab
>>> hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')

>>> from cudf.core.subword_tokenizer import SubwordTokenizer
>>> cudf_tokenizer = SubwordTokenizer('voc_hash.txt',
...                                    do_lower_case=True)
>>> str_series = cudf.Series(['This is the', 'best book'])
>>> tokenizer_output = cudf_tokenizer(str_series,
...                                   max_length=8,
...                                   max_num_rows=len(str_series),
...                                   padding='max_length',
...                                   return_tensors='pt',
...                                   truncation=True)
>>> tokenizer_output['input_ids']
tensor([[ 101, 1142, 1110, 1103,  102,    0,    0,    0],
        [ 101, 1436, 1520,  102,    0,    0,    0,    0]],
        device='cuda:0',
       dtype=torch.int32)
>>> tokenizer_output['attention_mask']
tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0]],
        device='cuda:0', dtype=torch.int32)
>>> tokenizer_output['metadata']
tensor([[0, 1, 3],
        [1, 1, 2]], device='cuda:0', dtype=torch.int32)

相关用法

注：本文由纯净天空筛选整理自rapids.ai大神的英文原创作品 cudf.core.subword_tokenizer.SubwordTokenizer.__call__。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

Python cudf.core.subword_tokenizer.SubwordTokenizer.call用法及代码示例

用法:

参数：

返回：

例子：