Logistic Regression

Agenda

Refresh your memory on how to do linear regression in scikit-learn
Attempt to use linear regression for classification
Show you why logistic regression is a better alternative for classification
Brief overview of probability, odds, e, log, and log-odds
Explain the form of logistic regression
Explain how to interpret logistic regression coefficients
Demonstrate how logistic regression works with categorical features
Compare logistic regression with other models

Part 1: Predicting a Continuous Response

# glass identification datasetimport pandas as pdurl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']glass = pd.read_csv(url, names=col_names, index_col='id')glass.sort('al', inplace=True)glass.head()

Question: Pretend that we want to predict ri, and our only feature is al. How could we do it using machine learning?

Answer: We could frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.

Question: How would we visualize this model?

Answer: Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.

import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlinesns.set(font_scale=1.5)

sns.lmplot(x='al', y='ri', data=glass, ci=None)

<seaborn.axisgrid.FacetGrid at 0x4136358>

Question: How would we draw this plot without using Seaborn?

# scatter plot using Pandasglass.plot(kind='scatter', x='al', y='ri')

<matplotlib.axes._subplots.AxesSubplot at 0x18395d30>

# equivalent scatter plot using Matplotlibplt.scatter(glass.al, glass.ri)plt.xlabel('al')plt.ylabel('ri')

<matplotlib.text.Text at 0x187b42b0>

# fit a linear regression modelfrom sklearn.linear_model import LinearRegressionlinreg = LinearRegression()feature_cols = ['al']X = glass[feature_cols]y = glass.rilinreg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# make predictions for all values of Xglass['ri_pred'] = linreg.predict(X)glass.head()

# plot those predictions connected by a lineplt.plot(glass.al, glass.ri_pred, color='red')plt.xlabel('al')plt.ylabel('Predicted ri')

<matplotlib.text.Text at 0x1a1fbda0>

# put the plots togetherplt.scatter(glass.al, glass.ri)plt.plot(glass.al, glass.ri_pred, color='red')plt.xlabel('al')plt.ylabel('ri')

<matplotlib.text.Text at 0x1a21d7b8>

Refresher: interpreting linear regression coefficients

Linear regression equation: $y = \beta_0 + \beta_1x$

# compute prediction for al=2 using the equationlinreg.intercept_ + linreg.coef_ * 2

array([ 1.51699012])

# compute prediction for al=2 using the predict methodlinreg.predict(2)

array([ 1.51699012])

# examine coefficient for alzip(feature_cols, linreg.coef_)

[('al', -0.002477606387469623)]

Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.

# increasing al by 1 (so that al=3) decreases ri by 0.00251.51699012 - 0.0024776063874696243

1.5145125136125304

# compute prediction for al=3 using the predict methodlinreg.predict(3)

array([ 1.51451251])

Part 2: Predicting a Categorical Response

# examine glass_typeglass.glass_type.value_counts().sort_index()

1    702    763    175    136     97    29dtype: int64

# types 1, 2, 3 are window glass# types 5, 6, 7 are household glassglass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})glass.head()

Let's change our task, so that we're predicting household using al. Let's visualize the relationship to figure out how to do this:

plt.scatter(glass.al, glass.household)plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1a570cf8>

Let's draw a regression line, like we did before:

# fit a linear regression model and store the predictionsfeature_cols = ['al']X = glass[feature_cols]y = glass.householdlinreg.fit(X, y)glass['household_pred'] = linreg.predict(X)

# scatter plot that includes the regression lineplt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred, color='red')plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1a87ddd8>

If al=3, what class do we predict for household? 1

If al=1.5, what class do we predict for household? 0

We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.

# understanding np.whereimport numpy as npnums = np.array([5, 15, 8])# np.where returns the first value if the condition is True, and the second value if the condition is Falsenp.where(nums > 10, 'big', 'small')

array(['small', 'big', 'small'],       dtype='|S5')

# transform household_pred to 1 or 0glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)glass.head()

# plot the class predictionsplt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_class, color='red')plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1a8af550>

Part 3: Using Logistic Regression Instead

Logistic regression can do what we just did:

# fit a logistic regression model and store the class predictionsfrom sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression(C=1e9)feature_cols = ['al']X = glass[feature_cols]y = glass.householdlogreg.fit(X, y)glass['household_pred_class'] = logreg.predict(X)

# plot the class predictionsplt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_class, color='red')plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1ace2080>

What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?

# store the predicted probabilites of class 1glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]

# plot the predicted probabilitiesplt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_prob, color='red')plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1accc550>

# examine some example predictionsprint logreg.predict_proba(1)print logreg.predict_proba(2)print logreg.predict_proba(3)

[[ 0.97161726  0.02838274]][[ 0.34361555  0.65638445]][[ 0.00794192  0.99205808]]

The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.

Part 4: Probability, odds, e, log, log-odds

$$probability = \frac {one\ outcome} {all\ outcomes}$$$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

Examples:

Dice roll of 1: probability = 1/6, odds = 1/5
Even dice roll: probability = 3/6, odds = 3/3 = 1
Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

$$odds = \frac {probability} {1 - probability}$$$$probability = \frac {odds} {1 + odds}$$

# create a table of probability versus oddstable = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})table['odds'] = table.probability/(1 - table.probability)table

What is e? It is the base rate of growth shared by all continually growing processes:

# exponential function: e^1np.exp(1)

2.7182818284590451

What is a (natural) log? It gives you the time needed to reach a certain level of growth:

# time needed to grow 1 unit to 2.718 unitsnp.log(2.718)

0.99989631572895199

It is also the inverse of the exponential function:

np.log(np.exp(5))

5.0

# add log-odds to the tabletable['logodds'] = np.log(table.odds)table

Part 5: What is Logistic Regression?

Linear regression: continuous response is modeled as a linear combination of the features:

$$y = \beta_0 + \beta_1x$$

Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$

This is called the logit function.

Probability is sometimes written as pi:

$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$

The equation can be rearranged into the logistic function:

$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

In other words:

Logistic regression outputs the probabilities of a specific class
Those probabilities can be converted into class predictions

The logistic function has some nice properties:

Takes on an "s" shape
Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

Most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): decompose the problem into multiple binary classification problems
Multinomial logistic regression can solve this as a single problem

Part 6: Interpreting Logistic Regression Coefficients

# plot the predicted probabilities againplt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_prob, color='red')plt.xlabel('al')plt.ylabel('household')

<matplotlib.text.Text at 0x1b302a58>

# compute predicted log-odds for al=2 using the equationlogodds = logreg.intercept_ + logreg.coef_[0] * 2logodds

array([ 0.64722323])

# convert log-odds to oddsodds = np.exp(logodds)odds

array([ 1.91022919])

# convert odds to probabilityprob = odds/(1 + odds)prob

array([ 0.65638445])

# compute predicted probability for al=2 using the predict_proba methodlogreg.predict_proba(2)[:, 1]

array([ 0.65638445])

# examine the coefficient for alzip(feature_cols, logreg.coef_[0])

[('al', 4.1804038614510901)]

Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'household'.

# increasing al by 1 (so that al=3) increases the log-odds by 4.18logodds = 0.64722323 + 4.1804038614510901odds = np.exp(logodds)prob = odds/(1 + odds)prob

0.99205808391674566

# compute predicted probability for al=3 using the predict_proba methodlogreg.predict_proba(3)[:, 1]

array([ 0.99205808])

Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

# examine the interceptlogreg.intercept_

array([-7.71358449])

Interpretation: For an 'al' value of 0, the log-odds of 'household' is -7.71.

# convert log-odds to probabilitylogodds = logreg.intercept_odds = np.exp(logodds)prob = odds/(1 + odds)prob

array([ 0.00044652])

That makes sense from the plot above, because the probability of household=1 should be very low for such a low 'al' value.

Logistic regression beta values

Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.

Part 7: Using Logistic Regression with Categorical Features

Logistic regression can still be used with categorical features. Let's see what that looks like:

# create a categorical featureglass['high_ba'] = np.where(glass.ba > 0.5, 1, 0)

Let's use Seaborn to draw the logistic curve:

# original (continuous) featuresns.lmplot(x='ba', y='household', data=glass, ci=None, logistic=True)

<seaborn.axisgrid.FacetGrid at 0x1a16bda0>

# categorical featuresns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True)

<seaborn.axisgrid.FacetGrid at 0x1b308e48>

# categorical feature, with jitter addedsns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True, x_jitter=0.05, y_jitter=0.05)

<seaborn.axisgrid.FacetGrid at 0x1bc03710>

# fit a logistic regression modelfeature_cols = ['high_ba']X = glass[feature_cols]y = glass.householdlogreg.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,          fit_intercept=True, intercept_scaling=1, max_iter=100,          multi_class='ovr', penalty='l2', random_state=None,          solver='liblinear', tol=0.0001, verbose=0)

# examine the coefficient for high_bazip(feature_cols, logreg.coef_[0])

[('high_ba', 4.4273153450187195)]

Interpretation: Having a high 'ba' value is associated with a 4.43 unit increase in the log-odds of 'household' (as compared to a low 'ba' value).

Part 8: Comparing Logistic Regression with Other Models

Advantages of logistic regression:

Highly interpretable (if you remember how)
Model training and prediction are fast
No tuning is required (excluding regularization)
Features don't need scaling
Can perform well with a small number of observations
Outputs well-calibrated predicted probabilities

Disadvantages of logistic regression:

Presumes a linear relationship between the features and the log-odds of the response
Performance is (generally) not competitive with the best supervised learning methods
Can't automatically learn feature interactions