Pivot tables in R with melt and cast

来源:互联网 发布:图片识别软件 编辑:程序博客网 时间:2024/06/16 15:40

Pivot Tables in R with melt and cast

If you are making the transition from Excel to R and still can’t figure out how to quickly obtain pivot tables like Excel has, this article is for you!

Actually it is pretty easy to produce Pivot Tables in R. All you need is a package called reshape by Hadley Wickham (yes, the same prolific author of plyr and ggplot2) and some understanding of how reshape “thinks” and works.

If you are interested in an alternative, easier but less powerful, method to create Pivot Tables in R using aggregate, you can read my other post here.

For this tutorial, we will be using a sample data set called Salespeople PivotTable report, which is used in many articles by Microsoft teaching the basics of Pivot Tables in Excel. The link to the sample data file can be found at the bottom of this article.

We will first obtain each result in Excel and then reproduce it with R. I recommend that you download the sample data file and follow along.

1. Download the sample data file

Use the link at the bottom of the article and save the file somewhere on your hard drive. So far so good!

2. Open the file in Excel and save a CSV copy of the first sheet

Double-click the file from the location where you saved it to open it inside Excel. The Excel worksheet has 3 tabs. We will be using the first one, containing the source data, and the last one, which contains a ready-made Pivot Table to play with. If you do not know how to make Pivot Tables in Excel, I recommend you read the Microsoft tutorial. In this article we will focus on how to obtain the same results with R and we will use the ready-made pivot table on sheet 3.

Once you are on the first sheet of the sample data set (the one with the source data), save a copy of it as CSV (Comma Separated Values) file. This will generate a file which is easier to import into R. R has packages which allow you to read directly Excel files without converting them to CSV, but for the sake of simplicity we will stick to CSV.

Before saving as CSV, change the format of column E (Order Amount) to General so that the orders amounts appear as numbers with 2 decimals and not with a $ sign in front and comma separated thousands. If you export to CSV without doing this first, there will be extra steps required in R to convert the sales amounts to numeric (see the note at the end of the article).

3. Load the CSV file in R

This is an easy step accomplished with the read.CSV function as follows.

file_name.txt is the name of the CSV file you created from Excel. If it is not in the current working directory in R, you can either change R’s working directory with setwd or move the file to the current working directory or add a path to the file before the file name.

At this point you should have the same base data open in Excel and loaded into R. In R the first lines will look like this:

4. Your first Pivot Table in R with melt and cast

Here we go. This is what we have been waiting for, the rest was just preparation to get to this point. Ready?

We will re-produce in R a pivot table like the default one that appears in the 3rd tab of the sample Excel file and which looks like this.

Capture_1

Basically it shows how much each sales person has sold in total across all orders and and countries (note that the Country filter is set to All). It also shows a Grand Total which is the sum of all sales.

To get the job done in R we will use the reshape package. There are other packages available which can achieve the same results, but reshape is particularly versatile and easy to use once you have grasped some basic concepts. (And, hey, it has been written by Hadley Wickham, so we _have_ to use it!)

We can install (if not already installed) and load the reshape package with:

If you have ggplot2 installed and loaded, reshape will already be available.

The reshape package contains two functions that are key to easily generate pivot tables in R. They are melt and cast.

melt transforms a data frame from the original format to a so called long format, where all the observed variables (called measures) appear, together with their respective value, in two adjacent columns named variable and value. Each row of this new data format is identified by a unique combination of the id variables, also part of the original data frame.

An example will clarify how melt works. Let’s take the first 6 rows of our data set.

It contains 5 columns. The first 4 identify each order by a combination of Country, Salesperson, Order.Date and OrderID. These are all non-numeric and there are no calculations we can do on them except, maybe, counting their frequency. Using melt‘s terminology, Country, Salesperson, Order.Date, OrderID are id variables, while Order.Amount, which is a numeric and which is the one we would like to sum up in our pivot table, is a measure.

When melting your data, you can indicate multiple id variables and also multiple measure variables. The id variable will appear as-is in the resulting melted format, while the measure variable will be stacked in the variable column with their respective value in the valuecolumn.

Note that melt preserves all data. Nothing is lost or modified, just the way the data are collected within the data frame. It is therefore possible to “un-melt” the melted data and go back to the original format at any time.

Let’s melt our sales data frame and see what the result looks like.

Pretty simple. Instead of listing the column names we have used their numerical identifiers. We have indicated to melt that columns 1 to 4 ( Country, Salesperson, Order.Date, OrderID) are id variables while column 5 (Order.Amount) is a measure variable.

Here is how the first 6 rows of data.m look like at this point:

As said the column Order.Amount has been “melted” into a variable column with the respective values in the value column.

Now that our data have been melted (and you have to do this only once, unless you change idea on what you want as id and as measure), we are ready to cast them to build the pivot table.

cast requires us to indicate, beside a reference to the melted data, how we want to re-aggregate the values.

The basic syntax is (and, yes, we are omitting some parameters that we won’t need in this tutorial):

formula and fun.aggregate are the most important two because they indicate how we want to reshape the data and which functions to use for aggregating the values.

Think of  formula as saying:

While fun.aggregate says how you want to aggregate the values of the variable(s) in order to reshape them as described by formula. This is the equivalent of selecting “Count of”, or “Sum of” etc. in Excel.

Going back to our example will make this point clearer.

We want to produce a pivot table which contains the total sales for each sales person. Therefore we chose Salesperson as row and variable (which is Sales.Amount) as column. The aggregate function will be sum to obtain the sum of the variable (Sales.Amount) for each sales person. Here the command:

And here is the resulting pivot table:

As expected (or not?!?) is identical to the Excel one. Congratulations on your first pivot table in R!

But wait before you pat yourself on the shoulder. In Excel there is an extra row showing the Grand Total and we don’t have it in R. How can we fix it?

5. Adding Grand Totals

That’s easy actually. Look back at the syntax for cast. There is an argument called margins which comes to rescue. margins accepts a vector that can contain the strings “grand_col” and “grand_row”. This adds an extra column or extra row (or both) with a grand total to the resulting data frame. The grand total is obtained through the same fun.aggregate.

Let’s add a row with the grand total.

And here it is, identified by (all):

Well done!

6. And how to Filter

Ok, ok. You are a picky one! I know you noted that our pivot table lacks one more feature respect to the Excel one. It has in fact no ability to Filter by Country. Did you get it yet? No!?!

Yes, the solution is in the other parameter for cast which we did not consider so far (beside those we omitted): subset.

subset allows us to limit the casting to only a subset of the melted data selected according to certain criteria. It requires a vector of TRUE or FALSE corresponding to the rows of the melted data we want to select (TRUE) or exclude (FALSE) before the casting is performed.

Say that we want to calculate the performance only for the American sales people. We need to select only those rows for which Country==”USA”. This is exactly the additional condition we need to specify within cast and it corresponds to selecting the Filter Country = USA a the top of the Excel pivot table.

And here is the filtered pivot. The grand total has also been adjusted accordingly.

Congratulations, now we are really done! :-)

I hope you have enjoyed this tutorial on how to make simple pivot tables in R. Please leave a comment below if you would like to see more tutorials on the same topic.

Till next time!

 

Note: Stripping $ and commas and converting to numeric

Ok, so you decided not to change the format of Column E (Amount) to General before saving the table as CSV. I had warned you. Now your punishment will be terrible and will include… regular expressions! Read on.

Once you open the CSV file in R, you can verify that the column Order.Amount has indeed been imported as a Factor. In order to be able to do calculations on it, we need to convert it to numeric first. This can be easily achieved using the function as.numeric, however the $ sign at the beginning and the commas separating the thousands will cause it to generate unexpected results. Before running through as.numeric we need therefore to strip both the $ sign and the commas. This can be quickly done using gsub and regular expressions. Teaching you regular expressions goes beyond the scope of this article, but the following code does the job.

Obviously the two steps above can be combined into one by nesting gsub into as.numeric.


Reference: http://marcoghislanzoni.com/blog/2013/10/11/pivot-tables-in-r-with-melt-and-cast/

0 0
原创粉丝点击