Python dask.bag.read_text用法及代碼示例

用法: dask.bag.read_text(urlpath, blocksize=None, compression='infer', encoding='utf-8', errors='strict', linedelimiter=None, collection=True, storage_options=None, files_per_partition=None, include_path=False)

從文本文件中讀取行

參數：

urlpath：字符串或列表: 絕對或相對文件路徑。使用 s3:// 之類的協議作為前綴，以從替代文件係統中讀取。要從多個文件中讀取，您可以傳遞一個 globstring 或路徑列表，但需要注意的是它們都必須具有相同的協議。
blocksize: None, int, or str：: 分割較大文件的大小(以字節為單位)。默認為流。可以是None(用於流式傳輸)、整數字節或類似“128MiB” 的字符串
compression: string：: 壓縮格式如‘gzip’ or ‘xz’。默認為‘infer’
encoding: string：
errors: string：
linedelimiter: string or None：
collection: bool, optional：: 如果為真，則返回 dask.bag，如果為假，則返回延遲值列表
storage_options: dict：: 對特定存儲連接有意義的額外選項，例如主機、端口、用戶名、密碼等
files_per_partition: None or int：: 如果設置，則將輸入文件分組到請求大小的分區中，而不是每個文件一個分區。與塊大小互斥。
include_path: bool：: 是否在包中包含路徑。如果為 true，則元素是 (line, path) 的元組。默認為假。

dask.bag.Bag 或列表: dask.bag.Bag 如果collection 為True，否則為延遲列表列表。

例子：

>>> b = read_text('myfiles.1.txt')  
>>> b = read_text('myfiles.*.txt')  
>>> b = read_text('myfiles.*.txt.gz')  
>>> b = read_text('s3://bucket/myfiles.*.txt')  
>>> b = read_text('s3://key:secret@bucket/myfiles.*.txt')  
>>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt')

通過提供要加載到每個分區中的未壓縮字節數來並行化大文件。

>>> b = read_text('largefile.txt', blocksize='10MB')

通過設置include_path=True獲取包的文件路徑

>>> b = read_text('myfiles.*.txt', include_path=True) 
>>> b.take(1) 
(('first line of the first file', '/home/dask/myfiles.0.txt'),)

相關用法

注：本文由純淨天空篩選整理自dask.org大神的英文原創作品 dask.bag.read_text。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

例子：