eBay User feedback clustering in R

来源：互联网发布：夜访吸血鬼影评知乎编辑：程序博客网时间：2024/06/04 22:01

Author: Zhao, Kevin

Abstract

Learning eBay user’s feedback is fairly important to improve our site service. However, Catching useful information from tons of user feedbacks is really not an easy task. Some PM and Site Analyst will scratch their head and sample hundreds of user feedbacks and read them one by one to get a general idea about their comments. It is extremely time-consuming and inefficient.

To deal with this problem, machine learning and natural language process methods are used to cluster user feedbacks into different groups and then generate the main topics in each cluster. Furthermore, a service tool/user interface is built to illustrate main topics in each cluster, basic statistics for each experiment and also sampled user feedbacks. By using this tool, site Analyst and PM can get a general and clear idea about what our client are talking about in just 5 seconds. A new feature on this tool is a word cloud section. Users can see which words are mentioned most after launching an experiment, it will help them know better of the experiment effects.

In this article, clustering algorithm and natural language processing method are explained and it also introduces some useful functions of our feedback clustering tool.

Keywords: nlp, k-means clustering, feedback, word cloud

User Story and Request

Some PM and Site analyst often complain that it is very hard to learn our eBay site users’ comments, some users talk about shipping, some others talk about item image problem and someone mentioned bad search results. After launching an experiment, people need to go through lots of user feedbacks to get a general idea about which feature makes user unhappy or which part user complains most. It is really tedious and time consuming.

Methodology

Based on this real problem, our basic idea is to cluster user feedbacks into different groups and find topics within each group, then sample data from each group and exposed to PM/Analyst. The Methodology will be explained in two main sections: NLP process of user feedback and then k-means clustering method for clustering.

1. NLP pocessing of user feedback

A. Dump user feedback from Oracle DB into R

The first step is to get all user feedbacks from Oracle DB and dump into R. We choose R as our machine learning algorithm realization software, because R is commonly used in academia and also in industry, it has lots of useful packages and we can use it directly to help us process data

R package : RJDBC

Description: RJDBC package in R to get R connected to Oracle DB, we can write sql code directly in R and all data can be dumped into R using some simple R code.

R code is provided here and we can apply this code when you want to dump data from Teradata DB/Oracle DB into R.

cp <- c("classes12-10.2.jar")

.jinit(classpath=cp)

drv <- JDBC("oracle.jdbc.driver.OracleDriver")

conn <- dbConnect(drv, "jdbc:oracle:thin:@qudb.vip.arch.ebay.com:1521/QUDB", "name", "password")

sql = "select * from survey_response_detail_v where survey_id=5000001476 and Q2 <> ’ ’ "

res <- dbSendQuery(conn, statement=sql);

data <- fetch(res, n = -1)

B. Clean data and generate feature matrix using NLP

After we got feedback data, we need to clean feedbacks before we use it for clustering like removing stop words, do stemming and also removing stop words. R package: tm and Snowball

Description of packages: tm and Snowball has lots of existing function to clean natural language in R

We clean data in 7 steps:

1. Put all words into lower case

We will put all words into lower cases

2. Remove stop words

We will remove all stop words like ”I’m”, ”who’s”

3. Remove punctuation

Remove all punctuations, we can create another feature column for some special punctuations like ”!”,”?”, because those are also very important to reflect user’s mood.

4. Remove numbers

We don’t think number is useful in user feedback clustering

5. Do stemming

we will only take the stem of the word, for example: ”running”,”runs”,”runned” will all be expressed as ”run”, so when we do clustering, feedbacks contains those words can be grouped together.

6. Eliminating extra white space

This step can make our user feedback more clean

7. Generate data matrix for next clustering step

Finally, we will generate a data matrix, this will be a data matrix in R

R code is provided here for your reference:

##datasheet is a R object which contains all user feedback, we grab Q2 column which has feedback sentences

reuters <- Corpus(VectorSource(datasheet$Q2))

reuters <- tm_map(reuters, tolower)

## Remove Stopwords

reuters <- tm_map(reuters, removeWords, stopwords("english"))

## Remove Punctuations

reuters <- tm_map(reuters, removePunctuation)

## Remove Numbers

reuters <- tm_map(reuters, removeNumbers)

## Stemming

reuters <- tm_map(reuters, stemDocument)

## Eliminating Extra White Spaces

reuters <- tm_map(reuters, stripWhitespace)

head(datasheet$Q2)

dtm <- DocumentTermMatrix(reuters)

2. K-means clustering method to cluster user feedback

A. Find the appropriate number of clusters

For all clustering problems, the first step is to choose how many clusters you want for your data. We cannot just use a value by guessing, we need to let data itself tell us how many clusters can fit our data best.

Here is how we let data tell us number of clusters to use:

We will iterate number of clusters from 1 to 20, and calculate within group sum of squares, then we plot all wss(within group sum squares) to see which cluster shall we choose. The main idea is that if the wss is small for clusters, all items within each group are very similar and have very small sum of squares of difference. Here is also something I shall mention, if we chop our data into more clusters, the wss will be smaller, because each cluster contains less data and will have less wss. So we need to find the number of clusters based on two criteria: first is the wss does not change much if we add more clusters; second is the number of cluster shall not be too large, because less number of clusters will provide more general information of the data.

Figure 1 is the within group sum of squares plot. This is just a sample experiment of data, we might use number of clusters = 13, since it is a break point of the plot and it has a very low within group sum of squares.

Figure 1: Finding the best number of clusters

B. K-means clustering

We will use kmeans package in R do do the clustering, sort the clusters by within group sum of squares and then output the clustered data.

1. Normalize data before clustering, we need to normalize our data so that

Euclidean distance makes sense when we do clustering. This step is done before we choose number of clusters. We can just use scale function in R to do this easily.

2. Do clustering using 13 clusters

Sample R code is provided below:

cl <- kmeans(m_norm, 13)

norm.withinss=cl$withinss/cl$size

rank0=data.frame(ss=norm.withinss,cluster=c(1:13))

rank0=rank0[order(rank0$ss),]

ssorder=rank0$cluster datasheet$cluster=cl$cluster

C. More methods on clustering

We also use LDA/topic model to do the natural language clustering, try using ”topicmodels” package in R and it is also easy to use and implement.If you want, more clustering method could also be used and might get better clustering result.

Feedback Clustering Tool/User Interface

For better illustration to PM/Analyst, I build a user interface based on the clustering results just using R. People can select each cluster and see sample feedback from clusters. Moreover, several great features are implemented into this tool. For example: PM can see the snapshot of what pages the our client see when he left the comments and also the current status of that page, they can compare whether the problem is solved or not now. Also, they can sort the feedback and also download feedback data in each cluster by just one click. One thing that shall be mentioned is that the whole User Interface is built by R and the code is fairly simple and easy to implement, we can also directly output statistical analysis directly into this UI, since all of the work flow is in R and it has generality.

A. Random Sample section

[description]

Figure2 is a snapshot of the UI, PM/Analyst can select the feedback data set they want to use, the clusters they want to see and other stuff in the left panel,then saw sample user comments on the right. They can also get a general idea about what are people talking about by just taking a look at the cluster names, we put the most frequently mentioned words as the title of each cluster.

[real user cases]

We choose search feedback as an example. If you click on the cluster drop down box, it will list all the topics after the clustering. In Figure 3, you can see bunch of clusters like cluster1. cheap.game, cluster3. car.look. Each topic will be demonstrated by the top 2 keyword in each group. If you choose cluster3(car.look), it shall complain about cars since we only collection low score feedback in this tool.Then if you take a look at the right panel for the sampels, you can see bunch of feedbacks on the search result like: ”IF A PERSON TYPES IN CARS, THEY DON.T WANT TO SEE PARTS OF A CAR.”(the last one); Or the one above: ”when im looking for mustang wheels i dont need to see a matchbox toy car”. People are not happy with the search result, it is great to help find problematic search result.

Figure 2: Feedback Clustering User Interface

B. Statistics summary section

In Figure 4, main statistics of user feedback score are generated directly from R, and it will give mean feedback score from test and control and also do a t-test on the feedback score. For this specific test(experiment 7451), the test is significantly has worse user feedback score than the control group, which means user are not happy after launching this test and leave bad feedbacks.

Figure 3: search category user story

Figure 4: Feedback Clustering Statistical Summary Section

Figure 5: Feedback Clustering Word Cloud Section

C. Test/control word difference section

Figure 5 shows the word cloud part of this tool, This word cloud plot shows words which shows up more often among the feedback in the test group than the control group. For example, we saw term ”jewelry” in the word cloud, which means after launching this experiment, people mentioned more about jewelry than before. Since we only collected bad feedback(feedback score =1), so after launching the experiment, people complain more about the jewelries. PM can check if the experiment affected jewelries items based on this word cloud.

Summary

In this article, a real user case in ebay has been described, Natural Language Processing and k-means clustering methods are used to group user feedbacks and provide more clear and general information to PM and site analysts. Furthermore, R is used to build a user interface for better illustration and add lots of statistical features to this tool like ”statistical summary” and ”word cloud”. If you are interested in how to build the whole statistical analysis system and user interface in R, please stay tuned for my next article ”using R to build user interface and adding statistical analysis into your tool”.

* 本文版权和/或知识产权归eBay Inc所有。如需引述，请和联系我们DL-eBay-CCOE-Tech@ebay.com。本文旨在进行学术探讨交流，如您认为某些信息侵犯您的合法权益，请联系我们DL-eBay-CCOE-Tech@ebay.com，并在通知中列明国家法律法规要求的必要信息，我们在收到您的通知后将根据国家法律法规尽快采取措施。

0 0