TensorFlow for Hackers
来源:互联网 发布:mac格式化ntfs格式 编辑:程序博客网 时间:2024/06/04 21:01
In this one, you will learn how to create a Neural Network (NN) and use it for deciding whether a student has alcohol consumption problems.
Do students drink too much? How can you predict that? What predicts it best? How much too much is exactly?
Those questions might be difficult to answer, yet we can start somewhere. We can use a very limited dataset to get a sense of what the answers might look like. Something like this one.
The dataset contains 1044 instances and 32 variables (most of which binary and categorical). Actually, it consists of 2 other datasets. The first provides data for students enrolled in Portuguese class. The second describes students enrolled in a math course. There is overlap (yep, I know) between the datasets, that is some students attend both classes.
Let’s build an NN model for classifying whether a student has alcohol consumption problem. For that, we will use our trusty old friend - TensorFlow.
Before getting there, we have a bit of dirty work to do. Our dataset is not clean enough to just start and feed the data to our NN model. A bit of wrangling is required. But first, let’s start with some setting up:
import tensorflow as tfimport numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport seaborn as snsfrom math import floor, ceilfrom pylab import rcParams%matplotlib inline
Some styling and making our experiments reproducible:
sns.set(style='ticks', palette='Spectral', font_scale=1.5)material_palette = ["#4CAF50", "#2196F3", "#9E9E9E", "#FF9800", "#607D8B", "#9C27B0"]sns.set_palette(material_palette)rcParams['figure.figsize'] = 16, 8plt.xkcd();random_state = 42np.random.seed(random_state)tf.set_random_seed(random_state)
1. Preparing the data
Remember, our data is stored in two separate files. Let’s load them, assign proper course attendance to each student and merge them into one:
math_df = pd.read_csv("data/student/student-mat.csv", sep=";")port_df = pd.read_csv("data/student/student-por.csv", sep=";")math_df["course"] = "math"port_df["course"] = "portuguese"merged_df = math_df.append(port_df)merged_df.shape
(1044, 34)
Exactly as promised - 1044 rows, but we have duplicates. The dataset archive contains instructions on how to find them. The merged result contains 382 instances. We will update the course column for those students, too:
merge_vector = ["school","sex","age","address", "famsize","Pstatus","Medu","Fedu", "Mjob","Fjob","reason","nursery","internet"]duplicated_mask = merged_df.duplicated(keep=False, subset=merge_vector)duplicated_df = merged_df[duplicated_mask]unique_df = merged_df[~duplicated_mask]both_courses_mask = duplicated_df.duplicated(subset=merge_vector)both_courses_df = duplicated_df[~both_courses_mask].copy()both_courses_df["course"] = "both"students_df = unique_df.append(both_courses_df)
We will use the following formula to quantify the amount of alcohol taken during the week per student:
The new value changes in the interval
students_df = students_df.sample(frac=1)students_df['alcohol'] = (students_df.Walc * 2 + students_df.Dalc * 5) / 7students_df['alcohol'] = students_df.alcohol.map(lambda x: ceil(x))students_df['drinker'] = students_df.alcohol.map(lambda x: "yes" if x > 2 else "no")
2. Exploration
Finally, we can get a feel for our data. Let’s take a look at the course distribution:
students_df.course.value_counts().plot(kind="bar", rot=0);
And the alcohol consumption from the formula:
students_df.alcohol.value_counts().plot(kind="bar", rot=0);
The actual variable that we are going to predict:
students_df.drinker.value_counts().plot(kind="bar", rot=0);
Somewhat more comprehensive overview:
sns.pairplot(students_df[['age', 'absences', 'G3', 'goout', 'freetime', 'studytime', 'drinker']], hue='drinker');
Let’s have a look at a general correlations matrix:
corr_mat = students_df.corr()fig, ax = plt.subplots(figsize=(20, 12))sns.heatmap(corr_mat, vmax=1.0, square=True, ax=ax);
3. Building our model
It is time for the fun part. Well, not just yet.
3.1 Encoding the data
Most of our variables are categorical and we must one-hot encode them four our NN to work properly. First, let’s define a little helper function:
def encode(series): return pd.get_dummies(series.astype(str))
Our features and target variable using our little helper function:
train_x = pd.get_dummies(students_df.school)train_x['age'] = students_df.agetrain_x['absences'] = students_df.absencestrain_x['g1'] = students_df.G1train_x['g2'] = students_df.G2train_x['g3'] = students_df.G3train_x = pd.concat([train_x, encode(students_df.sex), encode(students_df.Pstatus), encode(students_df.Medu), encode(students_df.Fedu), encode(students_df.guardian), encode(students_df.studytime), encode(students_df.failures), encode(students_df.activities), encode(students_df.higher), encode(students_df.romantic), encode(students_df.reason), encode(students_df.paid), encode(students_df.goout), encode(students_df.health), encode(students_df.famsize), encode(students_df.course) ], axis=1)train_y = encode(students_df.drinker)
3.2 Splitting the data
Let’s allocate 90% of the data for training and use 10% for testing:
train_size = 0.9train_cnt = floor(train_x.shape[0] * train_size)x_train = train_x.iloc[0:train_cnt].valuesy_train = train_y.iloc[0:train_cnt].valuesx_test = train_x.iloc[train_cnt:].valuesy_test = train_y.iloc[train_cnt:].values
3.3 Building our Neural Network
Our NN consists of input, output and 1 hidden layer. We are using ReLU as activation function of the hidden layer and softmax for our output layer. As an additional bonus we will use Dropout - simple way to reduce overfitting during the training of our network. Let’s wrap our model in a little helper function:
def multilayer_perceptron(x, weights, biases, keep_prob): layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1']) layer_1 = tf.nn.relu(layer_1) layer_1 = tf.nn.dropout(layer_1, keep_prob) out_layer = tf.matmul(layer_1, weights['out']) + biases['out'] return out_layer
Let’s set the number of neurons in the hidden layer to 38 and randomly initialize the weights and biases considering their proper dimensions:
n_hidden_1 = 38n_input = train_x.shape[1]n_classes = train_y.shape[1]weights = { 'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])), 'out': tf.Variable(tf.random_normal([n_hidden_1, n_classes]))}biases = { 'b1': tf.Variable(tf.random_normal([n_hidden_1])), 'out': tf.Variable(tf.random_normal([n_classes]))}keep_prob = tf.placeholder("float")
We will train our model for 5,000 epochs (training steps) with a batch size of 32. That is, at each step, we will train our NN using 32 rows of our data. Granted, in our case you can just train on the whole dataset. However, when the data is huge and you can’t fit it in memory, you would love to split it and feed it to the model at batches (chunks):
training_epochs = 5000display_step = 1000batch_size = 32x = tf.placeholder("float", [None, n_input])y = tf.placeholder("float", [None, n_classes])
3.4 Training
In order for our model to learn, we need to define what is good. Actually, we will define what is bad and try to minimize it. We will call the “badness” - error or cost (hence, the cost function). It represents how far off of the true result our model is at some point during training. We would love that error to be 0 for all possible inputs. Currently, that happens only in Sci-Fi novels (not that I discourage dreaming about it).
The cost function that we are going to use is called “Cross-Entropy”. It is defined as:
Where softmax_cross_entropy_with_logits
. It use softmax as activation function for our output layer and use Cross-Entropy as error function.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=predictions, labels=y))
Now, for the actual workhorse - Adam (nope, not the from the Bible - although, that would’ve been fun). Adam is a type of gradient descent optimization algorithm which essentially tries as hard as he can to find proper weights and biases for our network via minimizing the cost function that we specified above. It is well beyond the scope of this post to describe Adam in details, but you can find all the necessary information over here - with tons of nice pictures!
Using Adam in TensorFlow is quite easy, we just have to specify learning rate (you can fiddle with that one) and pass the cost function we defined above:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(cost)
Our model is created by just calling our helper function with the proper arguments:
predictions = multilayer_perceptron(x, weights, biases, keep_prob)
Our finished NN looks something like this (much reduced input and hidden layer sizes):
4. Evaluation
Time to see how well our model can predict. During the training, we will set the keep probability of the Dropout to 0.8 and reset it to 1.0 during test time:
with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in range(training_epochs): avg_cost = 0.0 total_batch = int(len(x_train) / batch_size) x_batches = np.array_split(x_train, total_batch) y_batches = np.array_split(y_train, total_batch) for i in range(total_batch): batch_x, batch_y = x_batches[i], y_batches[i] _, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y, keep_prob: 0.8}) avg_cost += c / total_batch if epoch % display_step == 0: print("Epoch:", '%04d' % (epoch+1), "cost=", \ "{:.9f}".format(avg_cost)) print("Optimization Finished!") correct_prediction = tf.equal(tf.argmax(predictions, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) print("Accuracy:", accuracy.eval({x: x_test, y: y_test, keep_prob: 1.0}))
Epoch: 0001 cost= 103.346587711Epoch: 1001 cost= 2.053295698Epoch: 2001 cost= 0.464109008Epoch: 3001 cost= 0.304592287Epoch: 4001 cost= 0.284183074Optimization Finished!Accuracy: 0.731343
5. Conclusion(s)
Yes, you did it! You survived another part of this tutorial. But what did you achieved? Our model got roughly 73% accuracy on the test set. Is this good? Well… no, it is not!
How is that possible? The authors of the paper linked from the dataset attained 92% accuracy. Which is (as they state) acceptable. So, why our model performs so badly?
For one thing, we excluded overlapping student data, which made our dataset considerably smaller - from 1044 to just 662 instances (I haven’t found any type of duplicate reduction technique used by the authors. Please, write me a comment if I am wrong about that one). Due to the high prevalence of no drinkers, this might have a decremental effect on our model performance.
Of course, you can try different parameters, architecture, training epochs etc… Feel free to do so! Till next time!
References
Student Alcohol Consumption - Description of the used dataset
Using Data Mining to Predict Secondary School Student Alcohol Consumption - A paper using this dataset and comparing 3 different models on it (including NN)
Student Alcohol Consumption Prediction - Possibly source code used in the previous paper
MNIST classification using TensorFlow - Use Deep Neural Network to classify handwritten digits
How to choose the number of hidden layers and neurons in NN?
How to handle ordinal data in NN models - Lots of the variables are ordinal. This paper presents an approach to handling that kind of data in NN models
Simpler way to handle ordinal data in NN models
- TensorFlow for Hackers
- TensorFlow for Hackers
- TensorFlow for Hackers
- TensorFlow for Hackers (Part IV)
- TensorFlow for Hackers (Part VI)
- TensorFlow for Hackers (Part VII)
- JavaScript for hackers
- Ant Colony Optimization For Hackers
- ML for hackers 第一章笔记
- hackers
- ML-for hackers 第二章笔记
- the definitive guide for hackers, virus coders and system administrators
- Probabilistic Programming and Bayesian Methods for Hackers读书笔记
- tensorflow CNN for mnist
- Tensorflow MNIST for Android
- Hackers' Crackdown
- [手记]Hackers
- 初识NoSQL 快速认识NoSQL数据库 分析Analytics For Hackers: How To Think About Event Data
- 実行
- 后台学习历程
- 在美失联中国女留学生新进展:校方确认其已离世(图)
- 客户端与服务端架构
- TensorFlow for Hackers
- TensorFlow for Hackers
- TensorFlow for Hackers
- TensorFlow for Hackers (Part IV)
- 匹配追踪算法进行图像重建
- TensorFlow for Hackers (Part VI)
- TensorFlow for Hackers (Part VII)
- 【Leetcode】【python】Insert Interval
- CCF CSP 送货 JAVA 201512-4
- 行时类型识别(RTTI, Run-Time Type Identification)是Java中非常有用的机制 多态 类加载