线性分类器——parameter learning

来源:互联网 发布:php apc缓存 编辑:程序博客网 时间:2024/06/07 02:02

本内容整理自coursera,欢迎交流转载。

1 如何确定系数矩阵

  我们可以用likelihood l(w)来观测系数矩阵的分类质量。

2 Data likelihood

MLE(maximum likelihood estimation)
这里写图片描述
注意:这里计算概率时需要注意真值y是+1还是-1,相应的我们在计算的时候需要使用对应的概率。
计算完上述内容之后,我们给出评价模型好坏的公式:
l(w)=P(y1|x1,w).P(y2|x2,w).P(y3|x3,w).P(y4|x4,w)       =Ni=1P(yi|xi,w).

我们希望可以得到尽可能大的l(w)

3 梯度上升(gradient ascend method)

  回忆之前学过的梯度下降法,类比之我们可以得到梯度上升法的一般形式是:
  w(t+1)wt+ηdldw|wt

4 log-likelihood 的导数

  
l(w)wj=Ni=1hj(xi)(1[yi=+1]P(y=+1|xi,w))
其中:

1[yi=+1]={1   if yi=+10   if yi=1

之后,我们描述一下算法流程:
首先,init w=0
while |Δl(wt)|>ϵ     for j=0,1,2,D        partial[j]=Ni=1hj(xi)(1[yi=+1]P(y=+1|xi,w))        w(t+1)wt+ηdldw|wt    tt+1

4-** 导数公式如何得到(选学)

  首先,复习一些基础代数知识。
  ln(a ˙b)=lna+lnblnab=lnalnb
  我们已经得到:
  l(w)=P(y1|x1,w).P(y2|x2,w).P(y3|x3,w).P(y4|x4,w)       =Ni=1P(yi|xi,w).
  现在,为了简化之后的运算,我们规定:
  ll(w)=ln(l(w))=lnNi=1P(yi|xi,w)

这里我们使用了一个性质,那就是对一个函数做对数变换,不影响函数取最大值的点,不影响我们求救最优系数矩阵。

  首先,根据前面提到的基础代数知识,我们得到:
  ll(w)=ln(l(w))=lnNi=1P(yi|xi,w)        =Ni=1lnP(yi|xi,w)        =Ni=1{1[yi=+1]lnP(yi=+1|xi,w)+1[yi=1]lnP(yi=1|xi,w)}
  有之前一篇博客提到过得:
  P(y=+1|x,w)=11+ewTh(x)
  得知:
  P(y=1|x,w)=1P(y=+1|x,w)=ewTh(x)1+ewTh(x)
  由此化简ll(w)得对于某一xi
  ll(w)=1[yi=+1]lnP(yi=+1|xi,w)+1[yi=1]lnP(yi=1|xi,w)        =1[yi=+1] ˙ln11+ewTh(x)+(11[yi=+1]) ˙lnewTh(x)1+ewTh(x)        =(11[yi=+1])wTh(xi)ln(1+ewTh(x))
  则:
  llwj=(11[yi=+1])wTh(xi)wjwjln(1+ewTh(x))      =(11[yi=+1])hj(xi)+hj(xi)P(y=1|xi,w)      =hj(xi)[1[yi=+1]P(y=+1|xi,w)]
  然后把所有的相加得到:
  l(w)wj=Ni=1hj(xi)(1[yi=+1]P(y=+1|xi,w))

5 如何确定步长η

  η过小会使得收敛过慢,η过大会导致算法震荡无法收敛。
  因此我们可以先确定一个比较大的和比较小的η,再从中间选择。这需要多次尝试,也需要经验和技巧。
  如果可以的话,我们可以一开始选择一个比较大的η,随着迭代次数的增加慢慢减小η

6 代码实现

可以在这里下载代码和数据文件.

import graphlabproducts = graphlab.SFrame('amazon_baby_subset.gl/')import jsonwith open('important_words.json', 'r') as f: # Reads the list of most frequent words    important_words = json.load(f)important_words = [str(s) for s in important_words]#去掉标点符号def remove_punctuation(text):    import string    return text.translate(None, string.punctuation) products['review_clean'] = products['review'].apply(remove_punctuation)#为每个important words 建立一列,表示出现的次数for word in important_words:    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))#问题1products['contains_perfect'] = products['perfect']>=1products['contains_perfect']sum(products['contains_perfect'])import numpy as np#转化为numpy array格式def get_numpy_data(data_sframe, features, label):    data_sframe['intercept'] = 1    features = ['intercept'] + features    features_sframe = data_sframe[features]    feature_matrix = features_sframe.to_numpy()    label_sarray = data_sframe[label]    label_array = label_sarray.to_numpy()    return(feature_matrix, label_array)# Warning: This may take a few minutes...feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment') '''produces probablistic estimate for P(y_i = +1 | x_i, w).estimate ranges between 0 and 1.'''import mathdef predict_probability(feature_matrix, coefficients):    # Take dot product of feature_matrix and coefficients      # YOUR CODE HERE    score=np.dot(feature_matrix, coefficients)    # Compute P(y_i = +1 | x_i, w) using the link function    # YOUR CODE HERE    predictions = 1/(1+math.e**(-score))    # return predictions    return predictionsdef feature_derivative(errors, feature):         # Compute the dot product of errors and feature    derivative =np.dot(errors, feature)    # Return the derivative    return derivativedef compute_log_likelihood(feature_matrix, sentiment, coefficients):    indicator = (sentiment==+1)    scores = np.dot(feature_matrix, coefficients)    logexp = np.log(1. + np.exp(-scores))    # Simple check to prevent overflow    mask = np.isinf(logexp)    logexp[mask] = -scores[mask]    lp = np.sum((indicator-1)*scores - logexp)    return lpfrom math import sqrtdef logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):    coefficients = np.array(initial_coefficients) # make sure it's a numpy array    for itr in xrange(max_iter):        # Predict P(y_i = +1|x_i,w) using your predict_probability() function        # YOUR CODE HERE        predictions = predict_probability(feature_matrix,coefficients)        # Compute indicator value for (y_i = +1)        indicator = (sentiment==+1)        # Compute the errors as indicator - predictions        errors = indicator - predictions        for j in xrange(len(coefficients)): # loop over each coefficient            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].            # Compute the derivative for coefficients[j]. Save it in a variable called derivative            # YOUR CODE HERE            derivative = feature_derivative(errors,feature_matrix[:,j])            # add the step size times the derivative to the current coefficient            ## YOUR CODE HERE            coefficients[j]+=step_size*derivative        # Checking whether log likelihood is increasing        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)            print 'iteration %*d: log likelihood of observed labels = %.8f' % \                (int(np.ceil(np.log10(max_iter))), itr, lp)    return coefficients
1 0
原创粉丝点击