Online Learning:随机梯度法

来源：互联网发布：引用60分钟数据编辑：程序博客网时间：2024/05/25 12:22

本内容整理自coursera，欢迎交流转载。

1 大数据情景下的问题

　　随着数据集越来越大，按照之前的梯度下降（上升）算法，每次更新系数w^都会遍历所有的数据，这样计算会变得很慢。为了解决这个问题，于是有了之后的故事。

2 每次使用一个数据进行更新

　　回想原来的梯度上升法，
∂l(w)∂wj=∑Ni=1hj(xi)(1[yi=+1]−P(y=+1|xi,w))=∑Ni=1∂li(w)∂wj
　　随机梯度是每次只用一个∂li(w)∂wj来近似代替，这就大大简化了计算量。
　　这里写图片描述

3 比较gradient & stochastic gradient

算法每次迭代时间收敛时间（理论）收敛时间（实际）参数敏感度 Gradient Slow for large data slower often slower moderate Stochastic Gradient Always fast faster often faster very high

具体可以看看下面这张图：
这里写图片描述

4 stochastic gradient是如何工作的？

　　这里写图片描述
　　这里类比高中力学的力的分解，我们实际每次是取总梯度的一个分量，由于总梯度是朝着最优的路径前进的，所以大多数情况下，每个分量都使得结果朝着好的方向发展。我们相当于曲线前进！！！
　　最终我们会在最优点附近震荡，因此我们使用最后一部分的平均值来得到最终的系数矩阵。w^=1T∑Tt=1wt

5 选择步长η

这里写图片描述

6 算法改进——每次一个小数据集

shuffle datainit wt=0, t=1until converged: for k=0,1,2,…,N/B−1 for j=0,…,D wt+1j←wtj+η∑(k+1)B∂li(w)∂wj t=t+1
每次一个数据：
这里写图片描述
　　在线学习可以米隔一段时间更新模型！

7 代码实现

点击这里下载数据文件和代码。

from __future__ import divisionimport graphlabproducts = graphlab.SFrame('amazon_baby_subset.gl/')import jsonwith open('important_words.json', 'r') as f:     important_words = json.load(f)important_words = [str(s) for s in important_words]# Remote punctuationdef remove_punctuation(text):    import string    return text.translate(None, string.punctuation) products['review_clean'] = products['review'].apply(remove_punctuation)# Split out the words into individual columnsfor word in important_words:    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))train_data, validation_data = products.random_split(.9, seed=1)import numpy as npdef get_numpy_data(data_sframe, features, label):    data_sframe['intercept'] = 1    features = ['intercept'] + features    features_sframe = data_sframe[features]    feature_matrix = features_sframe.to_numpy()    label_sarray = data_sframe[label]    label_array = label_sarray.to_numpy()    return(feature_matrix, label_array)feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment') '''produces probablistic estimate for P(y_i = +1 | x_i, w).estimate ranges between 0 and 1.'''def predict_probability(feature_matrix, coefficients):    # Take dot product of feature_matrix and coefficients      score = np.dot(feature_matrix, coefficients)    # Compute P(y_i = +1 | x_i, w) using the link function    predictions = 1. / (1.+np.exp(-score))        return predictionsdef feature_derivative(errors, feature):     # Compute the dot product of errors and feature    ## YOUR CODE HERE    derivative = np.dot(errors,feature)    return derivativedef compute_avg_log_likelihood(feature_matrix, sentiment, coefficients):    indicator = (sentiment==+1)    scores = np.dot(feature_matrix, coefficients)    logexp = np.log(1. + np.exp(-scores))    # Simple check to prevent overflow    mask = np.isinf(logexp)    logexp[mask] = -scores[mask]    lp = np.sum((indicator-1)*scores - logexp)/len(feature_matrix)    return lpfrom math import sqrtdef logistic_regression_SG(feature_matrix, sentiment, initial_coefficients, step_size, batch_size, max_iter):    log_likelihood_all = []    # make sure it's a numpy array    coefficients = np.array(initial_coefficients)    # set seed=1 to produce consistent results    np.random.seed(seed=1)    # Shuffle the data before starting    permutation = np.random.permutation(len(feature_matrix))    feature_matrix = feature_matrix[permutation,:]    sentiment = sentiment[permutation]    i = 0 # index of current batch    # Do a linear scan over data    for itr in xrange(max_iter):        # Predict P(y_i = +1|x_i,w) using your predict_probability() function        # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,:]        ### YOUR CODE HERE        predictions = predict_probability(feature_matrix[i:i+batch_size,:],coefficients)        # Compute indicator value for (y_i = +1)        # Make sure to slice the i-th entry with [i:i+batch_size]        ### YOUR CODE HERE        indicator = (sentiment[i:i+batch_size]==+1)        # Compute the errors as indicator - predictions        errors = indicator - predictions        for j in xrange(len(coefficients)): # loop over each coefficient            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j]            # Compute the derivative for coefficients[j] and save it to derivative.            # Make sure to slice the i-th row of feature_matrix with [i:i+batch_size,j]            ### YOUR CODE HERE            derivative = feature_derivative(errors, feature_matrix[i:i+batch_size,j])            #print '&&&&'+str(derivative)            # compute the product of the step size, the derivative, and the **normalization constant** (1./batch_size)            ### YOUR CODE HERE            coefficients[j] += step_size*derivative*(1./batch_size)        # Checking whether log likelihood is increasing        # Print the log likelihood over the *current batch*        lp = compute_avg_log_likelihood(feature_matrix[i:i+batch_size,:], sentiment[i:i+batch_size],                                        coefficients)        log_likelihood_all.append(lp)        if itr <= 15 or (itr <= 1000 and itr % 100 == 0) or (itr <= 10000 and itr % 1000 == 0) \         or itr % 10000 == 0 or itr == max_iter-1:            data_size = len(feature_matrix)            print 'Iteration %*d: Average log likelihood (of data points in batch [%0*d:%0*d]) = %.8f' % \                (int(np.ceil(np.log10(max_iter))), itr, \                 int(np.ceil(np.log10(data_size))), i, \                 int(np.ceil(np.log10(data_size))), i+batch_size, lp)        # if we made a complete pass over data, shuffle and restart        i += batch_size        if i+batch_size > len(feature_matrix):            permutation = np.random.permutation(len(feature_matrix))            feature_matrix = feature_matrix[permutation,:]            sentiment = sentiment[permutation]            i = 0    # We return the list of log likelihoods for plotting purposes.    return coefficients, log_likelihood_all

1 0