pandas 选取子集的操作

来源：互联网发布：网络nat类型编辑：程序博客网时间：2024/06/05 04:47

前段时间学习了pandas,也做了一些练习，今天做一下梳理。

pandas 中对于axis的理解

在对dataframe操作的时候，很多时候涉及到了axis的设置，我找到一幅图很好的能理解axis：
这里写图片描述

数据准备

读取otu.txt表并且选取5行5列

import pandas as pddf = pd.read_csv("otu_taxon.txt",header=0,index_col=0,sep="\t")df = df.iloc[0:5,0:5]df.to_csv("otu.txt",sep="\t")

现在的otu.txt:

$ cat otu.txtOTU     A1      A2      A3      A4      A5OTU_1   102     111     221     98      70OTU_2   13      1       39      22      1OTU_3   8508    8208    8165    8882    7499OTU_4   2122    1881    2414    2520    1923OTU_5   7700    7442    11718   6392    7546

单独选取其中一列

df[“colname”] or df.colname

import pandas as pddf = pd.read_csv("otu.txt",header=0,index_col=0,sep="\t")A1 = df["A1"]print "A1 = df['A1']:",A1A1 = df.A1print "A1 = df.A1",A1

两种方法的输出结果：

$ python pand.pyA1 = df['A1']: OTUOTU_1     102OTU_2      13OTU_3    8508OTU_4    2122OTU_5    7700Name: A1, dtype: int64A1 = df.A1 OTUOTU_1     102OTU_2      13OTU_3    8508OTU_4    2122OTU_5    7700Name: A1, dtype: int64

通过 [ ] 选取行：
df[rowindex_1:rowindex_2] or df[“row_name_1”:”row_name_2”]

rows = df[0:3]print "rows = df[0:3]"print rowsrows = df["OTU_1":"OTU_4"]print "df["OTU_1":"OTU_4"]"print rows

结果输出：

$ python pand.pyrows = df[0:3]         A1    A2    A3    A4    A5OTUOTU_1   102   111   221    98    70OTU_2    13     1    39    22     1OTU_3  8508  8208  8165  8882  7499df['OTU_1':'OTU_4']         A1    A2    A3    A4    A5OTUOTU_1   102   111   221    98    70OTU_2    13     1    39    22     1OTU_3  8508  8208  8165  8882  7499OTU_4  2122  1881  2414  2520  1923

通过label来选取子集

df.loc[] 中默认的是row,如果是col,需要用“，”隔开。
df.loc[“rowname”,”colname”]
df.loc[“rowname”]
df.loc[:,”colname”]
df.loc[[row_list],[col_list]]
df.loc[:,[col_list]]

import pandas as pddf = pd.read_csv("otu.txt",header=0,index_col=0,sep="\t")sub_df = df.loc["OTU_3"]print "sub_df = df.loc['OTU_3']"print sub_dfsub_df = df.loc["OTU_3","A2"]print "sub_df = df.loc['OTU_3','A2']"print sub_dfsub_df = df.loc[:,['A1','A4']]print "sub_df = df.loc[:,['A1','A4']]"print sub_dfsub_df = df.loc[:,'A1':'A4']print "sub_df = df.loc[:,'A1':'A4']"print sub_df

输出结果：

$ python pand.pysub_df = df.loc['OTU_3']A1    8508A2    8208A3    8165A4    8882A5    7499Name: OTU_3, dtype: int64sub_df = df.loc['OTU_3','A2']8208sub_df = df.loc[:,['A1','A4']]         A1    A4OTUOTU_1   102    98OTU_2    13    22OTU_3  8508  8882OTU_4  2122  2520OTU_5  7700  6392sub_df = df.loc[:,'A1':'A4']         A1    A2     A3    A4OTUOTU_1   102   111    221    98OTU_2    13     1     39    22OTU_3  8508  8208   8165  8882OTU_4  2122  1881   2414  2520OTU_5  7700  7442  11718  6392

通过label_index来筛选序列

df.iloc[row_index,col_index]

import pandas as pddf = pd.read_csv("otu.txt",header=0,index_col=0,sep="\t")sub_df = df.iloc[3]print "df.iloc[3]"print sub_dfsub_df = df.iloc[2:4,0:2]print "df.iloc[2:4,0:2]"print sub_dfsub_df = df.iloc[[1,3,4],[1,3,4]]print "df.iloc[[1,3,4],[1,3,4]]"print sub_df

结果输出为：

$ python pand.pydf.iloc[3]A1    2122A2    1881A3    2414A4    2520A5    1923Name: OTU_4, dtype: int64df.iloc[2:4,0:2]         A1    A2OTUOTU_3  8508  8208OTU_4  2122  1881df.iloc[[1,3,4],[1,3,4]]         A2    A4    A5OTUOTU_2     1    22     1OTU_4  1881  2520  1923OTU_5  7442  6392  7546

df.iat[row,col]来选取某个位置的值

value = df.iat[3,4]print "value"print value

结果：

$ python pand.pyvalue1923

根据条件对dataframe进行筛选

根据A1列大于15这个条件，选出子集：

sub_df = df[df.A1>15]print sub_df

输出：

$ python pand.py         A1    A2     A3    A4    A5OTUOTU_1   102   111    221    98    70OTU_3  8508  8208   8165  8882  7499OTU_4  2122  1881   2414  2520  1923OTU_5  7700  7442  11718  6392  7546

条件：df中值大于100的。
这种筛选方法，df中没有大于100的会设置为NaN

sub_df = df[df>100]print sub_df

输出为：

$ python pand.py           A1      A2       A3      A4      A5OTUOTU_1   102.0   111.0    221.0     NaN     NaNOTU_2     NaN     NaN      NaN     NaN     NaNOTU_3  8508.0  8208.0   8165.0  8882.0  7499.0OTU_4  2122.0  1881.0   2414.0  2520.0  1923.0OTU_5  7700.0  7442.0  11718.0  6392.0  7546.0

df.isin()

语文水平欠佳，这个函数的使用很难描述。通过例子自己感受吧。

df["E"] = ['one', 'one','two','three','four']ls = ['one','two','three']sub_df = df[df['E'].isin(ls)]print sub_df

输出：

$ python pand.py         A1    A2    A3    A4    A5      EOTUOTU_1   102   111   221    98    70    oneOTU_2    13     1    39    22     1    oneOTU_3  8508  8208  8165  8882  7499    twoOTU_4  2122  1881  2414  2520  1923  three

阅读全文

0 0