當前位置: 首頁>>技術教程>>正文


Pandas輕鬆學會21個入門操作技巧!

介紹

Pandas是易於使用且功能強大的數據分析庫。像NumPy一樣,它向量化了大多數基本操作,使其能在CPU上可以並行計算,從而加快了計算速度。此處指定的操作非常基礎,但是如果您剛開始使用Pandas,則太重要了。在這裏,我們來看如何將 Pandas 導入為“ pd”,然後使用“ pd”對象執行其他基本的 Pandas 操作。

1.如何從CSV文件或文本文件讀取數據?

CSV文件是comma-separated(“,”分割的),因此要讀取CSV文件,請執行以下操作:

df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,names=None)
Explanation:‘read_csv’ function has a plethora of parameters and I have specified only a few, ones that you may use most often. A few key points:
a) header=0 means you have the names of columns in the first row in the file and if you don’t you will have to specify header=None
b) index_col = False means to not use the first column of the data as an index in the data frame, you might want to set it to true if the first column is really an index.
c) names = None implies you are not specifying the column names and want it to be inferred from csv file, which means that your header = some_number contains column names. Otherwise, you can specify the names in here in the same order as you have the data in the csv file. 

If you are reading a text file separated by space or tab, you could simply change the sep to be:sep = " " or sep='\t'

2.如何使用預先存在的列或NumPy 2D數組創建數據框(DataFrame)?

使用字典

# c1, c2, c3, c4 are column names. 
d_dic ={'first_col_name':c1,'second_col_names':c2,'3rd_col_name':c3} 
df = pd.DataFrame(data = d_dic)

使用NumPy數組

np_data = np.zeros((no_of_samples,no_of_features)) #any_numpy_array
df = pd.DataFrame(data=np_data, columns = list_of_Col_names)

3.如何可視化數據框(DataFrame)中的頂部和底部x值?

df.head(num_of_rows_to_view) #top_values
df.tail(num_of_rows_to_view) #bottom_values
col = list_of_columns_to_view 
df[col].head(num_of_rows_to_view)
df[col].tail(num_of_rows_to_view)

4.如何重命名一個或多個列?

df = pd.DataFrame(data={'a':[1,2,3,4,5],'b':[0,1,5,10,15]})
new_df = df.rename({'a':'new_a','b':'new_b'})

將返回數據幀存儲到新數據幀中很重要,因為重命名不是原地操作的(in-place)。

5.如何獲取列表中的列名?

df.columns.tolist()

如果隻想遍曆名稱,但不使用 tolist()函數也可以完成此工作,但它會將所有內容作為索引對象返回。

6.如何獲得一係列數值的頻率?

df[col].value_counts() #returns a mapper of key,frequency pair
df[col].value_counts()[key] to get frequency of a key value

7.如何將索引重置為現有列或其他列表或數組?

new_df = df.reset_index(drop=True,inplace=False)

如果你這樣做inplace = True,則無需將其存儲到new_df中。另外,當您將索引重置為pandas RangeIndex()時,您可以選擇保留舊索引或使用“ drop”參數將其刪除。

8.如何刪除列?

df.drop(columns = list_of_cols_to_drop)

9.如何更改DataFrame(數據幀)中的索引?

df.set_index(col_name,inplace=True)

這會將col_name col設置為索引。您可以傳遞多個列以將它們設置為索引。 inplace關鍵字的作用與之前所述相同。

10.如果行或列具有nan值,如何刪除?

df.dropna(axis=0,inplace=True)

axis = 0將刪除任何您可能不希望使用nan值的列。 axis = 1將僅刪除任何列中具有nan值的行。

11.如何在給定條件的情況下切片數據幀(DataFrame)?

您始終需要以邏輯條件的形式指定掩碼。
例如,如果您具有列年齡,並且您想選擇其中年齡列具有特定值或位於列表中的數據框(DataFrame)。那麽,您可以實現切片,如下所示:

mask = df['age'] == age_value 
or
mask = df['age].isin(list_of_age_values)
result = df[mask]

有多個條件:例如選擇高度和年齡都與特定值對應的行。

mask = (df['age']==age_value) & (df['height'] == height_value)
result = df[mask]

12.如何在給定列名或行索引值的情況下切片數據幀(DataFrame)?

這裏有4個選項:at,iat,loc和iloc。“ iat”和“ iloc”,它們相似之處在於它們提供基於整數的索引,而“ loc”和“ at”則提供基於名稱的索引。

這裏要注意的另一件事是,在使用“ loc”和“ iloc”對單個元素進行“提供”索引時,“ iat”可以切片多個元素。

Examples:
a) 
df.iat[1,2] provides the element at 1th row and 2nd column. Here it's important to note that number 1 doesn't correspond to 1 in index column of dataframe. It's totally possible that index in df does not have 1 at all. It's like python array indexing.
b)
df.at[first,col_name] provides the value in the row where index value is first and column name is col_name
c)
df.loc[list_of_indices,list_of_cols] 
eg df.loc[[4,5],['age','height']]
Slices dataframe for matching indices and column names
d)
df.iloc[[0,1],[5,6]] used for interger based indexing will return 0 and 1st row for 5th and 6th column.

13.如何遍曆行?

iterrows() and itertuples()
for i,row in df.iterrows():
    sum+=row['hieght']
iterrows() passess an iterators over rows which are returned as series. If a change is made to any of the data element of a row, it may reflect upon the dataframe as it does not return a copy of rows.
itertuples() returns named tuples
for row in df.itertuples():
    print(row.age)

14.如何按列排序?

df.sort_values(by = list_of_cols,ascending=True) 

15.如何將函數應用於序列中的每個元素?

df['series_name'].apply(f) 
where f is the function you want to apply to each element of the series. If you also want to pass arguments to the custom function, you could modify it like this.
def f(x,**kwargs):
    #do_somthing
    return value_to_store
df['series_name'].apply(f, a= 1, b=2,c =3)
If you want to apply a function to more than a series, then:
def f(row):
    age = row['age']
    height = row['height']
df[['age','height']].apply(f,axis=1)
If you don't use axis=1, f will be applied to each element of both the series. axis=1 helps to pass age and height of each row for any manipulation you want.

16.如何將函數或者方法應用於數據框中的所有元素?

new_df = df.applymap(f)

17.如果一係列值位於列表中,如何切片數據幀?

使用masking和isin。要選擇年齡在列表中的數據樣本:

df[df['age'].isin(age_list)]

要選擇相反的數據,則使用年齡不在列表中的數據樣本:

df[~df['age'].isin(age_list)]

18.如何對列值進行group-by並且在另一列上匯總或應用函數?

df.groupby(['age']).agg({'height':'mean'})

這將按“年齡”係列對數據框(DataFrame)進行分組,而高度列將應用分組值的平均值。有時,您想要group-by某個列並將其他列的所有相應的分組元素轉換為列表。您可以通過以下方法實現此目的:

df.groupby(['age']).agg(list)

19.如何為特定列列表中的每個元素的其他列創建重複項?

這個問題可能有點令人困惑。我的實際意思是,假設您具有以下數據幀df:

Age Height(in cm)
>20  180
20  175
18  165
18  163
16  170

將group-by與列表聚合器一起使用後,您可能會得到類似以下內容的信息:

Age Height(in cm)
20  [180,175]
18  [165,163]
16  [170]

現在,如果您要通過撤消上一個操作返回到原始數據幀,該怎麽辦?您可以使用0.25版 Pandas 中新引入的名為explode的操作來實現這一點。

df['height'].explode() will give the desired outcome.

20.如何連接兩個DataFrame(數據幀)?

假設您有兩個data-frames df1和df2,它們具有給定的列名稱,年齡和身高,並且您希望實現兩列的串聯。 axis = 0是垂直軸。在這裏,結果data-frame將具有從data-frames追加的列:

df1 --> name,age,height
df2---> name,age,height
result = pd.concat([df1,df2],axis=0)

對於水平串聯,

df1--> name,age
df2--->height,salary
result = pd.concat([df1,df2], axis=1) 

21.如何合並兩個數據幀?

For the previous example, assume you have an employee database forming two dataframes like
df1--> name, age, height
df2---> name, salary, pincode, sick_leaves_taken
You may want to combine these two dataframe such that each row has all details of an employee. In order to acheive this, you would have to perform a merge operation.
df1.merge(df2, on=['name'],how='inner')
This operation will provide a dataframe where each row will comprise of name, age, height, salary, pincode, sick_leaves_taken. 
how = 'inner' means include the row in result if there is a matching name in both the data frames. For more read: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge

總結

對於任何初學者的數據分析項目,您可能需要非常了解這些操作。我一直發現Pandas是一個非常有用的庫,現在您可以與其他各種數據分析工具和語言集成。在學習支持分布式算法的語言時,了解 Pandas 操作甚至可能會有所幫助。

參考資料

本文由《純淨天空》出品。文章地址: https://vimsky.com/zh-tw/article/4324.html,未經允許,請勿轉載。