Python dask.bag.Bag.to_avro用法及代碼示例

用法:
Bag.to_avro(filename, schema, name_function=None, storage_options=None, codec='null', sync_interval=16000, metadata=None, compute=True, **kwargs)

將包寫入一組 avro 文件

模式是說明數據的複雜字典，請參閱https://avro.apache.org/docs/1.8.2/gettingstartedpython.html#Defining+a+schema和https://fastavro.readthedocs.io/en/latest/writer.html.它的結構如下：

{'name': 'Test',
 'namespace': 'Test',
 'doc': 'Descriptive text',
 'type': 'record',
 'fields': [
    {'name': 'a', 'type': 'int'},
 ]}

其中“name”字段是必需的，但“namespace” and “doc”是可選說明符； “type” 必須始終為 “record”。字段列表應該對輸入記錄的每個鍵都有一個條目，並且類型類似於 Avro 規範 (https://avro.apache.org/docs/1.8.2/spec.html) 的原始、複雜或邏輯類型。

每個輸入分區生成一個 avro 文件。

參數：

b: dask.bag.Bag：
filename: list of str or str：: 要寫入的文件名。如果是列表，則 number 必須與分區數匹配。如果是字符串，則必須包含一個全局字符“*”，它將使用name_function 進行擴展
schema: dict：: Avro 模式字典，見上文
name_function: None or callable：: 將整數擴展為字符串，參見dask.bytes.utils.build_name_function
storage_options: None or dict：: 傳遞給後端的額外鍵/值選項file-system
codec: ‘null’, ‘deflate’, or ‘snappy’：: 壓縮算法
sync_interval: int：: 文件中每個塊中包含的記錄數
metadata: None or dict：: 包含在文件頭中
compute: bool：: 如果為 True，則立即寫入文件和函數塊。如果為 False，則返回延遲對象，可由用戶在方便時計算。
kwargs: passed to compute(), if compute=True：

例子：

>>> import dask.bag as db
>>> b = db.from_sequence([{'name': 'Alice', 'value': 100},
...                       {'name': 'Bob', 'value': 200}])
>>> schema = {'name': 'People', 'doc': "Set of people's scores",
...           'type': 'record',
...           'fields': [
...               {'name': 'name', 'type': 'string'},
...               {'name': 'value', 'type': 'int'}]}
>>> b.to_avro('my-data.*.avro', schema)  
['my-data.0.avro', 'my-data.1.avro']

相關用法

注：本文由純淨天空篩選整理自dask.org大神的英文原創作品 dask.bag.Bag.to_avro。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。