Python pyspark DataFrame用法及代碼示例

本文簡要介紹 pyspark.pandas.DataFrame 的用法。

用法: class pyspark.pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

pandas-on-Spark DataFrame 邏輯上對應 pandas DataFrame。這在內部保存了 Spark DataFrame。

變量：

_internal - 用於管理元數據的內部不可變幀。

參數：

data：numpy ndarray(結構化或同類)、dict、pandas DataFrame、Spark DataFrame 或 pandas-on-Spark 係列: 字典可以包含係列、數組、常量或類似列表的對象。如果數據是字典，則 Python 3.6 及更高版本將保留參數順序。請注意，如果 data 是 pandas DataFrame、Spark DataFrame 和 pandas-on-Spark Series，則不應使用其他參數。
index：索引或類似數組: 用於結果幀的索引。如果輸入數據沒有索引信息部分且未提供索引，則默認為RangeIndex
columns：索引或類似數組: 用於生成框架的列標簽。如果未提供列標簽，則默認為 RangeIndex (0, 1, 2, …, n)
dtype：dtype，默認無: 要強製的數據類型。隻允許使用一個 dtype。如果沒有，推斷
copy：布爾值，默認為 False: 從輸入複製數據。僅影響 DataFrame /2d ndarray 輸入

例子：

從字典構造DataFrame。

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = ps.DataFrame(data=d, columns=['col1', 'col2'])
>>> df
   col1  col2
0     1     3
1     2     4

從 pandas DataFrame 構建 DataFrame

>>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2']))
>>> df
   col1  col2
0     1     3
1     2     4

請注意，推斷的 dtype 是 int64。

>>> df.dtypes
col1    int64
col2    int64
dtype: object

要強製執行單個 dtype：

>>> df = ps.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

從 numpy ndarray 構造DataFrame：

>>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
...                    columns=['a', 'b', 'c', 'd', 'e'])
>>> df2  
   a  b  c  d  e
0  3  1  4  9  8
1  4  8  4  8  4
2  7  6  5  6  7
3  8  7  9  1  0
4  2  5  4  3  9

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.DataFrame。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。