Online Learning:随机梯度法

来源:互联网 发布:引用60分钟数据 编辑:程序博客网 时间:2024/05/25 12:22

本内容整理自coursera,欢迎交流转载。

1 大数据情景下的问题

  随着数据集越来越大,按照之前的梯度下降(上升)算法,每次更新系数w^都会遍历所有的数据,这样计算会变得很慢。为了解决这个问题,于是有了之后的故事。

2 每次使用一个数据进行更新

  回想原来的梯度上升法,
l(w)wj=Ni=1hj(xi)(1[yi=+1]P(y=+1|xi,w))=Ni=1li(w)wj
  随机梯度是每次只用一个li(w)wj来近似代替,这就大大简化了计算量。
  这里写图片描述

3 比较gradient & stochastic gradient

算法 每次迭代时间 收敛时间(理论) 收敛时间(实际) 参数敏感度 Gradient Slow for large data slower often slower moderate Stochastic Gradient Always fast faster often faster very high

具体可以看看下面这张图:
这里写图片描述

4 stochastic gradient是如何工作的?

  这里写图片描述
  这里类比高中力学的力的分解,我们实际每次是取总梯度的一个分量,由于总梯度是朝着最优的路径前进的,所以大多数情况下,每个分量都使得结果朝着好的方向发展。我们相当于曲线前进!!!
  最终我们会在最优点附近震荡,因此我们使用最后一部分的平均值来得到最终的系数矩阵。w^=1TTt=1wt

5 选择步长η

这里写图片描述

6 算法改进——每次一个小数据集

shuffle datainit wt=0, t=1until converged:    for k=0,1,2,,N/B1        for j=0,,D            wt+1jwtj+η(k+1)Bli(w)wj        t=t+1
每次一个数据:
这里写图片描述
  在线学习可以米隔一段时间更新模型!

7 代码实现

点击这里下载数据文件和代码。

from __future__ import divisionimport graphlabproducts = graphlab.SFrame('amazon_baby_subset.gl/')import jsonwith open('important_words.json', 'r') as f:     important_words = json.load(f)important_words = [str(s) for s in important_words]# Remote punctuationdef remove_punctuation(text):    import string    return text.translate(None, string.punctuation) products['review_clean'] = products['review'].apply(remove_punctuation)# Split out the words into individual columnsfor word in important_words:    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))train_data, validation_data = products.random_split(.9, seed=1)import numpy as npdef get_numpy_data(data_sframe, features, label):    data_sframe['intercept'] = 1    features = ['intercept'] + features    features_sframe = data_sframe[features]    feature_matrix = features_sframe.to_numpy()    label_sarray = data_sframe[label]    label_array = label_sarray.to_numpy()    return(feature_matrix, label_array)feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment') '''produces probablistic estimate for P(y_i = +1 | x_i, w).estimate ranges between 0 and 1.'''def predict_probability(feature_matrix, coefficients):    # Take dot product of feature_matrix and coefficients      score = np.dot(feature_matrix, coefficients)    # Compute P(y_i = +1 | x_i, w) using the link function    predictions = 1. / (1.+np.exp(-score))        return predictionsdef feature_derivative(errors, feature):     # Compute the dot product of errors and feature    ## YOUR CODE HERE    derivative = np.dot(errors,feature)    return derivativedef compute_avg_log_likelihood(feature_matrix, sentiment, coefficients):    indicator = (sentiment==+1)    scores = np.dot(feature_matrix, coefficients)    logexp = np.log(1. + np.exp(-scores))    # Simple check to prevent overflow    mask = np.isinf(logexp)    logexp[mask] = -scores[mask]    lp = np.sum((indicator-1)*scores - logexp)/len(feature_matrix)    return lpfrom math import sqrtdef logistic_regression_SG(feature_matrix, sentiment, initial_coefficients, step_size, batch_size, max_iter):    log_likelihood_all = []    # make sure it's a numpy array    coefficients = np.array(initial_coefficients)    # set seed=1 to produce consistent results    np.random.seed(seed=1)    # Shuffle the data before starting    permutation = np.random.permutation(len(feature_matrix))    feature_matrix = feature_matrix[permutation,:]    sentiment = sentiment[permutation]    i = 0 # index of current batch    # Do a linear scan over data    for itr in xrange(max_iter):        # Predict P(y_i = +1|x_i,w) using your predict_probability() function        # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,:]        ### YOUR CODE HERE        predictions = predict_probability(feature_matrix[i:i+batch_size,:],coefficients)        # Compute indicator value for (y_i = +1)        # Make sure to slice the i-th entry with [i:i+batch_size]        ### YOUR CODE HERE        indicator = (sentiment[i:i+batch_size]==+1)        # Compute the errors as indicator - predictions        errors = indicator - predictions        for j in xrange(len(coefficients)): # loop over each coefficient            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j]            # Compute the derivative for coefficients[j] and save it to derivative.            # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,j]            ### YOUR CODE HERE            derivative = feature_derivative(errors, feature_matrix[i:i+batch_size,j])            #print '&&&&'+str(derivative)            # compute the product of the step size, the derivative, and the **normalization constant** (1./batch_size)            ### YOUR CODE HERE            coefficients[j] += step_size*derivative*(1./batch_size)        # Checking whether log likelihood is increasing        # Print the log likelihood over the *current batch*        lp = compute_avg_log_likelihood(feature_matrix[i:i+batch_size,:], sentiment[i:i+batch_size],                                        coefficients)        log_likelihood_all.append(lp)        if itr <= 15 or (itr <= 1000 and itr % 100 == 0) or (itr <= 10000 and itr % 1000 == 0) \         or itr % 10000 == 0 or itr == max_iter-1:            data_size = len(feature_matrix)            print 'Iteration %*d: Average log likelihood (of data points in batch [%0*d:%0*d]) = %.8f' % \                (int(np.ceil(np.log10(max_iter))), itr, \                 int(np.ceil(np.log10(data_size))), i, \                 int(np.ceil(np.log10(data_size))), i+batch_size, lp)        # if we made a complete pass over data, shuffle and restart        i += batch_size        if i+batch_size > len(feature_matrix):            permutation = np.random.permutation(len(feature_matrix))            feature_matrix = feature_matrix[permutation,:]            sentiment = sentiment[permutation]            i = 0    # We return the list of log likelihoods for plotting purposes.    return coefficients, log_likelihood_all
1 0
原创粉丝点击