python大规模数据处理技巧之二:机器学习中常用操作

来源:互联网 发布:c语言float怎么定义 编辑:程序博客网 时间:2024/04/27 19:54

1、 数据预处理


随机化操作

机器学习中的常用随机化操作中可以使用random包做不重复随机数生成,以此生成的随机数作为数据集下标去截取相应数据集。下面这句简单有效的代码可以帮助实现基本所有的随机化预处理操作。

import randomsamp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ]     # nSample为需要取得样本数

数据随机抽样:

    import random    nItem = len(df)    nSample = 1000    samp_ids = [i for i in sorted(random.sample(range(nItem), nSample)) ]         # nSample为需要取得样本数    samp_idList = df.id.isin(samp_ids)    df_sample = df[samp_idList]

数据集切分为训练集与测试集:

    import random    nRatio = 2    nTest = int(nSample / nRatio)    nTrain = nSample - nTest    samp_ix = [rowId[i] for i in sorted(random.sample(range(nSample), nTest)) ]        # 随机产生要截取的下标    list_testSamp = df.row_id.isin(samp_ix)       list_trainSamp = list_testSamp.apply(lambda x: not x)        # 获得截取列表    samp_test = df[list_testSamp]    samp_train = df[list_trainSamp]

随机化数据集样本位序:

  • 推荐:一句输出下标:
    sorted(random.sample(range(nSample), nSample))再根据随机下标的顺序去遍历以此数据集
  • 使用sklearn包的内置操作:其机器学习算法的train方法都有一个random_state参数用于设置数据随机初始化的
  • 但如果不用上面的方法实现,可以使用如下的方法思路: 为每个样本设置两个值a、b:前者为随机值,后者为下标值
    • 以随机值a作为排序标准对每个样本的两个值进行排序(升降序都可以)
    • 以排序后的样本值b去寻址原样本集的样本,依次按排序的顺序执行操作即可

 def randomlizeSample(X_row, y_row):     nSample, nFeat = np.shape(X_row)     inx1 = DataFrame(np.random.randn(nSample), columns = ['randVal'])     inx2 = DataFrame(range(nSample), columns = ['inxVal'])     inx = pd.concat([inx1, inx2], axis = 1)     inx = inx.sort_index(by = 'randVal', ascending = False)     cnt = 0;     X = np.zeros((nSample, nFeat))     y = np.zeros((nSample))         # you should not set X and y to []     for line in inx['inxVal']:         X[cnt] = X_row[line]         y[cnt] = y_row[line]         cnt += 1     return X, y

不平衡分类抽取:

  • 保持稀有类样本数的数量,跟据比例随机抽取多数类的样本

相对高效的代码:

def sampleBalance(df, lableColumn, th):    label_counts = df[lableColumn].value_counts()        # 先获取当前的标签统计    mask = (label_counts[df[lableColumn].values] >= th).values    df = df.loc[mask]        # 筛选掉低统计量的标签样本    label_counts = df[lableColumn].value_counts()        # 再次获取当前的标签统计    labels = label_counts.order(ascending = False).index    nLabel = len(labels)    nSampPerLabel = label_counts[labels[-1]]    balancedSamples = pd.DataFrame()    for n in range(nLabel):        df_label =  df[lableColumn == labels[n]]        nItem = len(df_label)        df_label.reindex(range(nItem))            # 重新排序        samp_index = [i for i in sorted(random.sample(range(nItem),                                                         nSampPerLabel))]        samp_list = df_label[lableColumn].isin(samp_index)         df_label = df_label[samp_list]        balancedSamples = pd.concat([balancedSamples, df_label], axis = 1)    return balancedSamples, [label for lable in labels] 

写过的拙劣代码(不推荐的写法):没有运用python的特性,纯粹是用c/c++的思想去编写代码,代码执行效率差。

- def sampleBalance(X_row, y_row):-     rate_np = 1-         # radio of negative sample and positive sample-     nSample, nFeat = np.shape(X_row)-     nSample_pos = np.sum(y_row == 1)-     nSample_neg = np.sum(y_row == 0)-     nSample_negOnRate = np.floor(nSample_pos * rate_np)-     print(nSample, nSample_pos, nSample_neg)-     X = np.zeros((nSample_pos + nSample_negOnRate, nFeat))-     y = np.zeros((nSample_pos + nSample_negOnRate))-     # get pos sample-     id_pos = 0-     id_neg = 0-     X_neg = np.zeros((nSample_neg, nFeat))-     for i in range(nSample):-         if y_row[i] == 1:-             X[id_pos] = X_row[i]-             y[id_pos] = 1-             id_pos += 1-         else:-             X_neg[id_neg] = X_row[i]-             id_neg += 1-     inx1 = DataFrame(np.random.randn(nSample_neg), columns = ['randVal'])-     inx2 = DataFrame(range(nSample_neg), columns = ['inxVal'])-     inx = pd.concat([inx1, inx2], axis = 1)-     inx = inx.sort_index(by = 'randVal', ascending = False)-     cnt = 0-     for line in inx['inxVal']:-         if cnt >= nSample_negOnRate:-             break-         X[nSample_pos + cnt] = X_neg[line]-         y[nSample_pos + cnt] = 0-         cnt += 1-     X_rand, y_rand = randomlizeSample(X, y)-     return X_rand, y_rand

2、数据集转换


不同包间的格式转换

  • 内置数据结构、numpy与pandas的数据结构的用处简述:(数据存储形式未总结)
    • 内置数据结构,如list、dict、set与tuple是最通用的数据结构,使用方便。
    • numpy的数据结构与matlab的非常相似,适合用来做矩阵运算等算术计算。也是机器学习包scikit-learn的所支持的数据结构。
    • dataframe的功能与数据库有几分相似,适合做数据的大规模处理与分析。

numpy与list之间的转换:

  • list转换成numpy:
    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]    arr = np.array(data)
  • numpy转换成list
    ## 使用numpy方法:    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]    arr = np.array(data)    data = arr.tolist()
    ## 暴力方法:    data = [[elem for elem in line] for line in arr]

dataframe与numpy之间的转换:

  • dataframe转numpy
    X_train = df.values.astype(int) # df转化为numpy的ndarray,数据类型为int
  • numpy转dataframe
    columns = ['c0', 'c1', 'c2', 'c3', 'c4']    df = pd.DataFrame(X_train, columns = columns)

dataframe,series与list之间的转换:

  • list转换成dataframe与series
    data = [[6, 7.5, 8, 0, 1], [6, 7.5, 8, 0, 1]]    columns = ['c0', 'c1', 'c2', 'c3', 'c4']    df = pd.DataFrame(data, columns = columns)
    data = [6, 7.5, 8, 0, 1]    ser = pd.series(data)
  • dataframe与series转换成list
    ## dataframe转换成list    df['c0'].values.tolist() # 将某一列转化成list    df.values.tolist() # 将整个dataframe转化成list
    ## series转换成list    ser.values.tolist() # 将series值转化为list

<未完待续>

0 0