Document classification by inversion of distributed language representations

The skip-gram probabalistic language model is trained to, for window , maximize for each word

This probability is calculated for all possible word pairs over a sentence (some defined short chunk of language).

In the word2vec formulation, each probability is represented as

where

is the node in a binary huffman tree representation (path) for word ,
,
and translates from left/right child to +/- one.

The binary huffman tree represents each word as a series of bits, so that the probability of word given word can be written as the product of probabilities that each bit is either on or off (represented above through the function).

# An example binary huffman tree.!dot -Tpng paper/graphs/bht.dot -o paper/graphs/bht.pngfrom IPython.display import ImageImage(filename='paper/graphs/bht.png') # Note that it is a prefix tree: you know the total length after each point given bits to that point.

Given a fitted representation, we can score any new sentence (and sum across them for documents) according to the model implied by the training process. This gives a document (log) probability under that representation. This can be calculated for language representations trained on each of the corpora associated with some class labels. When combined with priors for each class label and inversion via Bayes rule, the document probabilities provide class probabilities.

Comparators would be any linear model regression onto phrase counts (e.g., the logistic lasso below), the doc2vec machinery (also built into gensim), and theSocher et al RNTN. The doc2vec tool maps from documents to a vector space of fixed dimension, which can then be used as input to off-the-shelf machine learners. The latter builds the sentiment/meaning into the model itself, and conditions upon this information during the training process.

Advantages of the inversion framework include:

modularity: it works for any model of language that can (or its training can) be interpreted as a probabilistic model.
transparency and replicability: a simple extension of any software for training distributed language models.
performance (consider also 'training' the class priors, so as to correct for generative model misspecificiation).

Example: yelp data

We'll look at some proof-of-concept on the kaggle yelp recruiting contest data, split into star-rating files via parseyelp.py.

import sysimport numpy as npimport pandas as pdfrom copy import deepcopyfrom gensim.models import Word2Vecfrom gensim.models import Phrasesfin = open("data/yelptrain1star.txt")firstbadreview = fin.readline()print(firstbadreview)

 u can go there n check car out . if u wanna buy 000 there ? thats wrong move ! if u even want car service from there ? u made biggest mistake of ur life !! i had 000 time asked my girlfriend take my car there oil service guess what ? they ripped my girlfriend off by lying how bad my car now . if without fixing problem . might bring some serious accident . then she did what they said . 000 brand new tires timing belt 000 new brake pads . u know whys worst ? all of those above i had just changed 000 months before !!! what trashy dealer that ? people better off go somewhere !

## define a review generatorimport realteos = re.compile(r'( [!\?] )')def revsplit(l):    l = alteos.sub(r' \1 . ', l).rstrip("( \. )*\n")    return [s.split() for s in l.split(" . ")]def YelpReviews( stars = [1,2,3,4,5], prefix="train" ):    for nstar in stars:        for line in open("data/yelp%s%dstar.txt"%(prefix,nstar)):            yield revsplit(line)

## grab all sentences; good bad and uglyallsentences = [ s for r in YelpReviews() for s in r ]len(allsentences)

2027394

docgrp = {'neg': [1,2], 'pos': [5]} [g for g in docgrp]

['neg', 'pos']

reviews = { g: list(YelpReviews(docgrp[g])) for g in docgrp }ndoc = pd.Series( {g: len(reviews[g]) for g in docgrp} , dtype="float64" )

reviews['neg'][0][6:10]

[['if', 'without', 'fixing', 'problem'], ['might', 'bring', 'some', 'serious', 'accident'], ['then', 'she', 'did', 'what', 'they', 'said'], ['000',  'brand',  'new',  'tires',  'timing',  'belt',  '000',  'new',  'brake',  'pads']]

jointmodel = Word2Vec(workers=4)np.random.shuffle(allsentences)jointmodel.build_vocab(allsentences)

model = { g: deepcopy(jointmodel) for g in docgrp }

def trainW2V(g, T=20):    sent = [l for r in reviews[g] for l in r]    model[g].min_alpha = model[g].alpha    for epoch in range(T):        print(epoch, end=" ")        np.random.shuffle(sent)        model[g].train(sent)        model[g].alpha *= 0.9          model[g].min_alpha = model[g].alpha      print(".")

for g in docgrp:     print(g, end=": ")    trainW2V( g )

neg: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .

def nearby(word, g):    print(word)    print( "%s:"%str(g), end=" ")    for (w,v) in model[g].most_similar([word]):        print(w, end=" ")    print("\n")

for g in docgrp: nearby("food", g)for g in docgrp: nearby("service", g)for g in docgrp: nearby("value", g)

foodneg: pizzabytheslice meals service fazolis astoundingly absentminded cuisine slooow unmistakably meal foodpos: witha cuisine value bolting foood authentic breakfest grubs etcetera waite serviceneg: sevice craftsmanship arianas !!!!!!!!!!!!!!!!! unexceptional barbarian tapinos meek recork inconsistant servicepos: breakfest sevice waite kendall devin serivce bolting ambience value atmoshphere valueneg: price quality qualtiy insider pricing craftsmanship appallingly %! prices quantity valuepos: price quality pricing prices rages feri overall service quantity food

Everything to this point uses standard gensim. For the next bit, we're using the score functions implemented in the taddylab fork.

testrev = { g: list(YelpReviews(docgrp[g], "test")) for g in docgrp }

def getprobs(rev, grp):    sentences =  [(i,s) for i,r in enumerate(rev) for s in r]    eta = pd.DataFrame(            { g: model[g].score([s for i,s in sentences])               for g in grp } )    probs = eta.subtract( eta.max('columns'), 'rows')     probs = np.exp( probs )    probs = probs.divide(probs.sum('columns'), "rows")    probs['cnt'] = 1    probs = probs.groupby([i for i,s in sentences]).sum()    probs = probs.divide(probs["cnt"], 'rows').drop("cnt", 1)    return(probs)

probs = {g: getprobs(testrev[g], docgrp) for g in docgrp }

import matplotlib.pyplot as plt%matplotlib inlinefig = plt.figure(figsize=(7,4))plt.hist(probs['neg']['pos'],normed=1,    color="red", alpha=.6, label="true negative", linewidth=1)plt.hist(probs['pos']['pos'],normed=1,    color="yellow", alpha=.6, label="true positive", linewidth=1)plt.xlim([0,1])plt.ylim([0,5])plt.legend(frameon=False, loc='upper center')plt.xlabel("prob positive")plt.ylabel("density")#fig.savefig("graphs/coarseprobs.pdf", format="pdf", bbox_inches="tight")

<matplotlib.text.Text at 0x7f71dda4c0f0>

yhat = {g: probs[g].idxmax('columns') for g in docgrp}mc = pd.DataFrame({    'mcr': {g: (yhat[g] != g).mean() for g in docgrp},    'n': {g: len(testrev[g]) for g in docgrp}    })print(mc)overall = mc.product("columns").sum()/mc['n'].sum()print("\nOverall MCR: %.3f" %overall)

          mcr     nneg  0.064604  4427pos  0.055019  8797Overall MCR: 0.058

So the fit looks nice and tight. OOS we get around 6% misclassification rate on the reviews.

svec = np.concatenate((probs['neg']['pos'],probs['pos']['pos']), axis=0)allrev = [[w for s in r for w in s] for r in testrev['neg']+testrev['pos']]

import pandas as pddiff = pd.Series( svec )tops = diff.order(ascending=False)[:5]print("TOPS\n")for i in tops.index:    print( " ".join(allrev[i]), end="\n\n")bots = diff.order()[:5]print("BOTTOMS\n")for i in bots.index:    print( " ".join(allrev[i]), end="\n\n")

TOPSyvonne at enlighten salon gave me my first first perm in over 000 years !! shhh dont tell because i recieved so many compliments that were follows you should quit straightening your hair it looks beautiful natural !" awesome !!!!everything about this place top notch atmosphere service food second none will definately be coming back time time againvery yummy food friendly staff quick service good coffee ... great spot breakfast !i went in get fitted running shoes decided check them out im soo glad i did they were helpful knowledgable i left with some great running tips shoes i chose were great price i didnt have buy into any membership best of all i love my shoes !!000 years randy courtney valleywide properties has been serving clients find their dream home not just house that in area they take time listen your needs find area of valley that will be suited you our families lifestyle whether you looking buy sell your home let expert team at courtney valleywide take care of youBOTTOMSwent lunch place smelled bad like cig smoke we ordered we were told that they didnt have food i ordered so we offered pay our drinks waitress just we were leaving stuck our money in her pocket not register i will rate this joint 000 bad service no food stunki went this bar new years eve they had put up ad online saying that was suppose be free before certain time me my friends got there well before time slot they said that they were charging all guys $ 000 still even though it said something different on their own website they played all guys that went before that time slot with classic bait switch scheme which definitely illegal but there no reason go court over $ 000 which they know which why they thought did get away with it because we all had planned go there some of our friends were already in there so we kind of were forced pay extra $ 000 get in even though we were some of first about 000 people go in thereawful no service food not good anymore plus they give old stale bread dust webs on ceiling fans no one caresthis was worst customer service ever ..... owners son was extremely rude disrespectful ..... will never go back !!!!was going order online .. entire process was such hassle in end i was told i cannot order online .. i guess i can take my business money elsewhere .. if there was way rate with negative stars i certainly would !!

Now, same thing but on fine scale

This copies most of the code from above and replaces it with classification amongst the 1-5 ratings.

docgrp_fine = {str(i) : [i] for i in range(1,6)} docgrp_fine

{'1': [1], '2': [2], '3': [3], '4': [4], '5': [5]}

for g in docgrp_fine:    print(g, end=": ")    reviews[g] = list(YelpReviews(docgrp_fine[g]))    model[g] = deepcopy(jointmodel)    trainW2V( g )

1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .5: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .4: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .3: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .

for g in docgrp_fine:    testrev[g] =  list( YelpReviews(docgrp_fine[g], "test") )    probs[g] = getprobs(testrev[g], docgrp_fine)    yhat[g] = probs[g].idxmax("columns")

mc_fine = pd.DataFrame({    'mcr': {g: (yhat[g] != g).mean() for g in docgrp_fine},    'n': {g: len(testrev[g]) for g in docgrp_fine}    })print(mc_fine)ntest = mc_fine['n'].sum()overall_fine = mc_fine.product("columns").sum()/ntestprint("\nOverall Fine-Scale MCR: %.3f" %overall_fine)

        mcr     n1  0.265126  23802  0.624328  20473  0.681643  28494  0.548017  68835  0.268160  8797Overall Fine-Scale MCR: 0.435

Finally, phrases to disk for linear model analysis. We use R, because sci-kit learn doesn't seem have a fast L1 path algorithm for logistic regression. The models are then fit via the gamlr package, with optimal penalty size selected via corrected AICc, following the code in linmod.R.

phraser = Phrases(allsentences,threshold=5.0)

for w in phraser[allsentences[0]]:        print(w, end=" ")

mostly good service all over !! i also love walk over urban outfitter !! my_favorite of them all !!! * then h_& m so on !! :

i = 0fout = open("data/yelp_phrases.txt", "w")for samp in ["train","test"]:    for stars in range(1,6):          if samp == "train":            rev = reviews[str(stars)]        else:            rev = testrev[str(stars)]          for r in rev:            for s in r:                for w in phraser[s]:                    if "|" not in w:                        fout.write("%d|%s|%d|%s\n" % (i,w,stars,samp))            i += 1

also, output scores per rev for reference

Pfine = {}for stars in range(1,6):    print(stars)    Pfine['train%d'%stars] = getprobs(reviews[str(stars)], docgrp_fine)    Pfine['test%d'%stars] = getprobs(testrev[str(stars)], docgrp_fine)    pmatfine = pd.concat( [Pfine['train%d'%s] for s in range(1,6)] + [Pfine['test%d'%s] for s in range(1,6)] )pmatfine.to_csv("data/yelpw2vprobs.csv",index=False)

12345

ntrain = [len(reviews[g]) for g in docgrp_fine]ntest = [len(testrev[g]) for g in docgrp_fine]

sum(ntrain)+sum(ntest)

252863

i

252863

and finally, fit a full word2vec and export average weights to R

fullmodel = deepcopy(jointmodel)

fullmodel.min_alpha = fullmodel.alphafor epoch in range(10):    print(epoch, end=" ")    np.random.shuffle(allsentences)    fullmodel.train(allsentences)    fullmodel.alpha *= 0.9      fullmodel.min_alpha = fullmodel.alpha  print(".")

0 1 2 3 4 5 6 7 8 9 .

def aggvec(rev):    av = np.zeros(fullmodel.layer1_size)    ns = 0.0    for s in rev:        sv = np.zeros(fullmodel.layer1_size)        nw = 0.0        for w in s:            nw += 1.0            try:                sv += fullmodel[w]            except KeyError:                # print("%s is not in vocab"%w)                pass        if nw > 0.0:            av += sv/nw    if ns > 0:        av = av/ns    return av

i = 0AV = np.zeros((sum(ntrain)+sum(ntest),fullmodel.layer1_size))for samp in ["train","test"]:    for stars in range(1,6):          if samp == "train":            rev = reviews[str(stars)]        else:            rev = testrev[str(stars)]          for r in rev:            AV[i,:] = aggvec(r)            i += 1            if np.remainder(i,10000) == 0:                print(i)

100002000030000400005000060000700008000090000100000110000120000130000140000150000160000170000180000190000200000210000220000230000240000250000

np.savetxt("data/yelp_vectors.txt", AV, delimiter="|", fmt='%.6f')

i

252863

Alternatively, try doc2vec analysis

from gensim.models.doc2vec import *def YelpLabeledSentence( stars = [1,2,3,4,5], prefix="train" ):    for nstar in stars:        i = 0        for line in open("data/yelp%s%dstar.txt"%(prefix,nstar)):            line = alteos.sub(r' \1 . ', line).rstrip("( \. )*\n")            lab = "%s-%d-%d" % (prefix, nstar, i)            rev = [s.split() for s in line.split(" . ")]            i += 1            for s in rev:                yield LabeledSentence(s, [lab])

trainsent = list(YelpLabeledSentence()) testsent = list(YelpLabeledSentence(prefix="test"))

mdm0 = Doc2Vec(workers=4, size=100, window=5, dm=0)mdm1 = Doc2Vec(workers=4, size=100, window=5, dm=1)%time mdm0.build_vocab(trainsent+testsent)%time mdm1.build_vocab(trainsent+testsent)

CPU times: user 23.1 s, sys: 111 ms, total: 23.2 sWall time: 23.2 sCPU times: user 23.1 s, sys: 38.7 ms, total: 23.1 sWall time: 23.1 s

def trainD2V(mod, sent, T=20):    mod.min_alpha = mod.alpha    for epoch in range(T):        print(epoch, end=" ")        np.random.shuffle(sent)        mod.train(sent)        mod.alpha *= 0.9          mod.min_alpha = mod.alpha      print(".")

%time trainD2V(mdm0, trainsent)%time trainD2V(mdm1, trainsent)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .CPU times: user 49min 10s, sys: 4min 30s, total: 53min 40sWall time: 26min 29s0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .CPU times: user 56min 4s, sys: 4min 4s, total: 1h 8sWall time: 27min 39s

## turn of training of word vecs, just score label vecsmdm0.train_words=Falsemdm1.train_words=False%time trainD2V(mdm0, testsent)%time trainD2V(mdm1, testsent)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .CPU times: user 4min 2s, sys: 29.6 s, total: 4min 31sWall time: 2min 30s0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 .CPU times: user 4min 6s, sys: 26.7 s, total: 4min 33sWall time: 2min 23s

def writeD2V(mod, fname, prefix):    v = []    y = np.empty(0)    x = np.empty([0,mod.syn0.shape[1]])    for stars in range(1,6):        labs = [ w for w in mod.vocab if re.match("%s-%d-\d+"%(prefix,stars), w) ]        v += labs        i = [mod.vocab[w].index for w in labs]         y = np.append(y, np.repeat(stars,len(i)))        x = np.vstack( (x, mod.syn0[i,:]) )        veclab = ["x%d"%d for d in range(1,x.shape[1]+1)]    df = pd.DataFrame( x, index=v, columns=veclab )    df["stars"] = y    df.to_csv("data/%s.csv"%fname, index_label="id")

for prefix in ["train","test"]:    writeD2V(mdm0, "yelpD2V%s0"%prefix, prefix)    writeD2V(mdm1, "yelpD2V%s1"%prefix, prefix)

linear modelling in R

The rest -- forward linear and logistic modelling -- will happen in R to make sure we're comparing linear apples to linear apples.

!Rscript code/linmod.R

[1] 252855Read 228235 rows and 102 (of 102) columns from 0.433 GB file in 00:00:04Read 228235 rows and 102 (of 102) columns from 0.434 GB file in 00:00:04**** W2V INVERSION ****** COARSE **mcr 0:0.247, 1:0.064, overall: 0.099 diff: 0.099 deviance: 0.6818865 ** NNP **mcr (0,2]:0.154, (2,3]:0.783, (3,5]:0.093, overall: 0.19 diff: 0.25 deviance: 1.385381 ** FINE **mcr 1:0.265, 2:0.624, 3:0.682, 4:0.548, 5:0.268, overall: 0.435 diff: 0.598 deviance: 2.912396 *** COUNTREG ***** COARSE **Warning message:In gamlr(x[-test, ], ycoarse[-test], family = "binomial", lmr = 0.001) :  numerically perfect fit for some observations.mcr 0:0.324, 1:0.027, overall: 0.084 diff: 0.084 deviance: 0.4127019 ** NNP **mcr (0,2]:0.453, (2,3]:0.844, (3,5]:0.012, overall: 0.2 diff: 0.282 deviance: 1.023203 ** FINE **mcr 1:0.479, 2:0.754, 3:0.773, 4:0.359, 5:0.233, overall: 0.41 diff: 0.575 deviance: 2.022462 *** W2V and COUNTREG NNP ***mcr (0,2]:0.31, (2,3]:0.787, (3,5]:0.035, overall: 0.181 diff: 0.235 deviance: 0.9588185 *** D2V ***** COARSEdm0 **mcr 0:0.731, 1:0.005, overall: 0.145 diff: 0.145 deviance: 0.5881288 dm1 **mcr 0:0.929, 1:0.002, overall: 0.181 diff: 0.181 deviance: 0.7750106 dm both **mcr 0:0.751, 1:0.004, overall: 0.149 diff: 0.149 deviance: 0.593779 ** NNPdm0 **mcr (0,2]:0.82, (2,3]:0.991, (3,5]:0.001, overall: 0.282 diff: 0.441 deviance: 1.233308 dm1 **mcr (0,2]:0.95, (2,3]:1, (3,5]:0.001, overall: 0.308 diff: 0.492 deviance: 1.418819 dm both **mcr (0,2]:0.836, (2,3]:0.992, (3,5]:0.001, overall: 0.285 diff: 0.447 deviance: 1.245474 ** FINEdm0 **mcr 1:0.779, 2:0.962, 3:0.967, 4:0.17, 5:0.425, overall: 0.501 diff: 0.773 deviance: 2.311966 dm1 **mcr 1:0.93, 2:0.995, 3:0.994, 4:0.208, 5:0.465, overall: 0.549 diff: 0.872 deviance: 2.557261 dm both **mcr 1:0.799, 2:0.961, 3:0.969, 4:0.168, 5:0.43, overall: 0.504 diff: 0.781 deviance: 2.327122 *** MNIR ***fitting 5 observations on 72484 categories, 6 covariates.converting counts matrix to column list...distributed run.socket cluster with 6 nodes on host ‘localhost’** COARSE **Warning message:In gamlr(zir[-test, ], ycoarse[-test], lmr = 1e-04, family = "binomial") :  numerically perfect fit for some observations.mcr 0:0.301, 1:0.047, overall: 0.096 diff: 0.096 deviance: 0.4768655 ** NNP **mcr (0,2]:0.65, (2,3]:0.986, (3,5]:0.009, overall: 0.254 diff: 0.383 deviance: 1.224557 ** FINE **mcr 1:0.61, 2:0.971, 3:0.983, 4:0.295, 5:0.313, overall: 0.48 diff: 0.731 deviance: 2.299092 *** W2V AGGREGATION ***** COARSE **Warning message:In gamlr(aggvec[-test, ], ycoarse[-test], family = "binomial", lambda.min.ratio = 0.001) :  numerically perfect fit for some observations.mcr 0:0.501, 1:0.027, overall: 0.118 diff: 0.118 deviance: 0.5421111 ** NNP **mcr (0,2]:0.649, (2,3]:0.919, (3,5]:0.013, overall: 0.248 diff: 0.37 deviance: 1.176924 ** FINE **mcr 1:0.635, 2:0.869, 3:0.85, 4:0.361, 5:0.27, overall: 0.461 diff: 0.695 deviance: 2.215192 null device           1 null device           1 null device           1 null device           1

Image(filename='paper/graphs/yelp_logistic.png', width=600,height=300)