TF/05_Nearest_Neighbor_Methods/01 02_Working_with_Nearest_Neighbors

来源：互联网发布：淘宝的林国庆编辑：程序博客网时间：2024/05/20 08:25

Ch 5: Nearest Neighbor Methods

Nearest Neighbor methods are a very popular ML algorithm. We show how to implement k-Nearest Neighbors, weighted k-Nearest Neighbors, and k-Nearest Neighbors with mixed distance functions. In this chapter we also show how to use the Levenshtein distance (edit distance) in TensorFlow, and use it to calculate the distance between strings. We end this chapter with showing how to use k-Nearest Neighbors for categorical prediction with the MNIST handwritten digit recognition.

Introduction
- We introduce the concepts and methods needed for performing k-Nearest Neighbors in TensorFlow.
Working with Nearest Neighbors
- We create a nearest neighbor algorithm that tries to predict housing worth (regression).
Working with Text Based Distances
- In order to use a distance function on text, we show how to use edit distances in TensorFlow.
Computing Mixing Distance Functions
- Here we implement scaling of the distance function by the standard deviation of the input feature for k-Nearest Neighbors.
Using Address Matching
- We use a mixed distance function to match addresses. We use numerical distance for zip codes, and string edit distance for street names. The street names are allowed to have typos.
Using Nearest Neighbors for Image Recognition
- The MNIST digit image collection is a great data set for illustration of how to perform k-Nearest Neighbors for an image classification task.

Working with Nearest Neighbors

# k-Nearest Neighbor#----------------------------------## This function illustrates how to use# k-nearest neighbors in tensorflow## We will use the 1970s Boston housing dataset# which is available through the UCI# ML data repository.## Data:#----------x-values-----------# CRIM   : per capita crime rate by town# ZN     : prop. of res. land zones# INDUS  : prop. of non-retail business acres# CHAS   : Charles river dummy variable# NOX    : nitrix oxides concentration / 10 M# RM     : Avg. # of rooms per building# AGE    : prop. of buildings built prior to 1940# DIS    : Weighted distances to employment centers# RAD    : Index of radian highway access# TAX    : Full tax rate value per $10k# PTRATIO: Pupil/Teacher ratio by town# B      : 1000*(Bk-0.63)^2, Bk=prop. of blacks# LSTAT  : % lower status of pop#------------y-value-----------# MEDV   : Median Value of homes in $1,000'simport matplotlib.pyplot as pltimport numpy as npimport tensorflow as tfimport requestsfrom tensorflow.python.framework import opsops.reset_default_graph()# Create graphsess = tf.Session()# Load the datahousing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']num_features = len(cols_used)housing_file = requests.get(housing_url)housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]y_vals = np.transpose([np.array([y[13] for y in housing_data])])x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in cols_used] for y in housing_data])## Min-Max Scalingx_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)# Split the data into train and test setsnp.random.seed(13)  #make results reproducibletrain_indices = np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False)test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices)))x_vals_train = x_vals[train_indices]x_vals_test = x_vals[test_indices]y_vals_train = y_vals[train_indices]y_vals_test = y_vals[test_indices]# Declare k-value and batch sizek = 4batch_size=len(x_vals_test)# Placeholdersx_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32)y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32)y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32)# Declare distance metric# L1distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), axis=2)# L2#distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), reduction_indices=1))# Predict: Get min distance index (Nearest neighbor)#prediction = tf.arg_min(distance, 0)top_k_xvals, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)top_k_yvals = tf.gather(y_target_train, top_k_indices)prediction = tf.squeeze(tf.matmul(x_val_weights,top_k_yvals), axis=[1])#prediction = tf.reduce_mean(top_k_yvals, 1)# Calculate MSEmse = tf.div(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size)# Calculate how many loops over training datanum_loops = int(np.ceil(len(x_vals_test)/batch_size))for i in range(num_loops):    min_index = i*batch_size    max_index = min((i+1)*batch_size,len(x_vals_train))    x_batch = x_vals_test[min_index:max_index]    y_batch = y_vals_test[min_index:max_index]    predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,                                         y_target_train: y_vals_train, y_target_test: y_batch})    batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,                                         y_target_train: y_vals_train, y_target_test: y_batch})    print('Batch #' + str(i+1) + ' MSE: ' + str(np.round(batch_mse,3)))# Plot prediction and actual distributionbins = np.linspace(5, 50, 45)plt.hist(predictions, bins, alpha=0.5, label='Prediction')plt.hist(y_batch, bins, alpha=0.5, label='Actual')plt.title('Histogram of Predicted and Actual Values')plt.xlabel('Med Home Value in $1,000s')plt.ylabel('Frequency')plt.legend(loc='upper right')plt.show()

Batch #1 MSE: 14.38

这里写图片描述

阅读全文

0 0