data.table Way - Learning Note Part 1

来源：互联网发布：欧莱雅妮维雅知乎编辑：程序博客网时间：2024/06/06 03:30

This is a learning note for the course Data Analysis in R, the data.table Way.
This course introduces the package ‘data.table’ which is a very powerful tool for data manipulation. I borrow two built-in data frames: mtcars and iris throughout the entire article.

library('data.table')

Startup

First time using data.table package, we can compare it to data.frame. The fundemental functionality between the 2 are quite similar.

Create a data.table

To create a data.table, we use the same method as we do for creating a data.frame.

dt1 <- data.table(A = letters[1:6], B = 1:2, C = 8L)df1 <- data.frame(A = letters[1:6], B = 1:2, C = 8L)dt1

##    A B C## 1: a 1 8## 2: b 2 8## 3: c 1 8## 4: d 2 8## 5: e 1 8## 6: f 2 8

df1

##   A B C## 1 a 1 8## 2 b 2 8## 3 c 1 8## 4 d 2 8## 5 e 1 8## 6 f 2 8

It’s obvious the column B and C are created by recycling.

And that’s check the class of dt1:

class(dt1)

## [1] "data.table" "data.frame"

The data.table package inherits the class data.frame from base package and develops a sub-class data.table.

Convert a data.frame or a matrix

dt2 <- as.data.table(df1)                     # convert a data.frame to data.tabledt3 <- as.data.table(matrix(1:9, ncol= 3))    # convert a matrix to data.tableclass(dt2)

## [1] "data.table" "data.frame"

class(dt3)

## [1] "data.table" "data.frame"

Take a look

Head, tail, fire at will.

# now we start using mtcars & irisdt_mtcars <- as.data.table(mtcars)dt_iris <- as.data.table(iris)head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species## 145          6.7         3.3          5.7         2.5 virginica## 146          6.7         3.0          5.2         2.3 virginica## 147          6.3         2.5          5.0         1.9 virginica## 148          6.5         3.0          5.2         2.0 virginica## 149          6.2         3.4          5.4         2.3 virginica## 150          5.9         3.0          5.1         1.8 virginica

BEFORE MOVING TO THE NEXT PART
Several things need to be mentioned:
1. dt[i, j, by] - standard subsetting clause, dt is a data.table;
2. .() - an abbreviated method representing list;
3. () - an abbreviated method representing vector;
4. := - assignment function
5. data.table internal variable:
- .N - number of rows
- .SD - **s**ubset of **d**ata.table, excluding cols used in by (or keyby)

- .SDcols - column index

Subsetting

Prepare the example data.table:

dtm <- copy(dt_mtcars)dti <- copy(dt_iris)

Remember that subsetting data.table is quite than subsetting data.frame. dtm is not Document-Term Matrix LOL

Subset the rows

Subset normally

Return the 3rd row of dataset mtcars. Furthermore, several rows…

dtm[3]               # mtcars[3, ]

##     mpg cyl disp hp drat   wt  qsec vs am gear carb## 1: 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

dtm[2:3]             # mtcars[2:3, ]

##     mpg cyl disp  hp drat    wt  qsec vs am gear carb## 1: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## 2: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

dtm[c(1,4,5)]        # mtcars[c(1,4,5), ]

##     mpg cyl disp  hp drat    wt  qsec vs am gear carb## 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## 2: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## 3: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Given i the first argument by default, we don’t need to specify any arg. name and we don’t need the comma as data.frame does. Don’t forget the 3 commands applied on data.frame are used for subsetting columns.

Subset by internal variable .N

.N is an internal variable mostly used in i. Return the penultimate row of dtm.

dtm[.N-1]

##    mpg cyl disp  hp drat   wt qsec vs am gear carb## 1:  15   8  301 335 3.54 3.57 14.6  0  1    5    8

To be noticed that when you convert a data.frame to a data.table, you lose the row names.

Subset the columns

j should be either a column name or list of column names wrapped by .(). In this case, you need to skip the first argument i.

dtm[, .(mpg, cyl)]

##      mpg cyl##  1: 21.0   6##  2: 21.0   6##  3: 22.8   4##  4: 21.4   6##  5: 18.7   8##  6: 18.1   6##  7: 14.3   8##  8: 24.4   4##  9: 22.8   4## 10: 19.2   6## 11: 17.8   6## 12: 16.4   8## 13: 17.3   8## 14: 15.2   8## 15: 10.4   8## 16: 10.4   8## 17: 14.7   8## 18: 32.4   4## 19: 30.4   4## 20: 33.9   4## 21: 21.5   4## 22: 15.5   8## 23: 15.2   8## 24: 13.3   8## 25: 19.2   8## 26: 27.3   4## 27: 26.0   4## 28: 30.4   4## 29: 15.8   8## 30: 19.7   6## 31: 15.0   8## 32: 21.4   4##      mpg cyl

Note that the subsetting result is a data.table. What if you want to drop the structure, keep only a vector?

dtm[1:10, mpg]

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2

# to compare with:# dtm[1:10, .(mpg)]

Imagine what’s the result of dtm[, c(mpg, cyl)]?

Subset conditionally

Replay the conditional subsetting in data.frame:

head(iris[iris$Species == 'setosa',])

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1          5.1         3.5          1.4         0.2  setosa## 2          4.9         3.0          1.4         0.2  setosa## 3          4.7         3.2          1.3         0.2  setosa## 4          4.6         3.1          1.5         0.2  setosa## 5          5.0         3.6          1.4         0.2  setosa## 6          5.4         3.9          1.7         0.4  setosa

How do we do in data.table? Return the first 6 rows of Species ‘setosa’:

head(dti[Species == 'setosa'])

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1:          5.1         3.5          1.4         0.2  setosa## 2:          4.9         3.0          1.4         0.2  setosa## 3:          4.7         3.2          1.3         0.2  setosa## 4:          4.6         3.1          1.5         0.2  setosa## 5:          5.0         3.6          1.4         0.2  setosa## 6:          5.4         3.9          1.7         0.4  setosa

Select all the rows where species is either ‘setosa’ and ‘virginica’:

temp <- dti[Species %in% c('setosa', 'virginica')]table(temp$Species)

## ##     setosa versicolor  virginica ##         50          0         50

Select the rows where the sepal area is greater than 2O:

dti[Sepal.Width*Sepal.Length > 20]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species##  1:          5.4         3.9          1.7         0.4     setosa##  2:          5.8         4.0          1.2         0.2     setosa##  3:          5.7         4.4          1.5         0.4     setosa##  4:          5.4         3.9          1.3         0.4     setosa##  5:          5.7         3.8          1.7         0.3     setosa##  6:          5.2         4.1          1.5         0.1     setosa##  7:          5.5         4.2          1.4         0.2     setosa##  8:          7.0         3.2          4.7         1.4 versicolor##  9:          6.4         3.2          4.5         1.5 versicolor## 10:          6.9         3.1          4.9         1.5 versicolor## 11:          6.3         3.3          4.7         1.6 versicolor## 12:          6.7         3.1          4.4         1.4 versicolor## 13:          6.7         3.0          5.0         1.7 versicolor## 14:          6.0         3.4          4.5         1.6 versicolor## 15:          6.7         3.1          4.7         1.5 versicolor## 16:          6.3         3.3          6.0         2.5  virginica## 17:          7.1         3.0          5.9         2.1  virginica## 18:          7.6         3.0          6.6         2.1  virginica## 19:          7.3         2.9          6.3         1.8  virginica## 20:          7.2         3.6          6.1         2.5  virginica## 21:          6.5         3.2          5.1         2.0  virginica## 22:          6.8         3.0          5.5         2.1  virginica## 23:          6.4         3.2          5.3         2.3  virginica## 24:          7.7         3.8          6.7         2.2  virginica## 25:          7.7         2.6          6.9         2.3  virginica## 26:          6.9         3.2          5.7         2.3  virginica## 27:          7.7         2.8          6.7         2.0  virginica## 28:          6.7         3.3          5.7         2.1  virginica## 29:          7.2         3.2          6.0         1.8  virginica## 30:          7.2         3.0          5.8         1.6  virginica## 31:          7.4         2.8          6.1         1.9  virginica## 32:          7.9         3.8          6.4         2.0  virginica## 33:          7.7         3.0          6.1         2.3  virginica## 34:          6.3         3.4          5.6         2.4  virginica## 35:          6.9         3.1          5.4         2.1  virginica## 36:          6.7         3.1          5.6         2.4  virginica## 37:          6.9         3.1          5.1         2.3  virginica## 38:          6.8         3.2          5.9         2.3  virginica## 39:          6.7         3.3          5.7         2.5  virginica## 40:          6.7         3.0          5.2         2.3  virginica## 41:          6.2         3.4          5.4         2.3  virginica##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species

Manipulating data.table

Add

As mentioned before, converting a data.frame to a data.table discards its row names. Here, we can add the vehicle model names to the dtm.

dtm <- data.table(model = rownames(mtcars), dtm)head(dtm)

##                model  mpg cyl disp  hp drat    wt  qsec vs am gear carb## 1:         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4## 2:     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## 3:        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1## 4:    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1## 5: Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2## 6:           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We introduce the assignment function here: :=, to realize the step above in a simple way.

dtm <- dtm[,  model := rownames(mtcars)]

In this case, the new column is added as the last column.

Select

To check the vehicle model and their fuel efficiency (return only the first 10 rows):

dtm[1:10, .(model, mpg)]

##                 model  mpg##  1:         Mazda RX4 21.0##  2:     Mazda RX4 Wag 21.0##  3:        Datsun 710 22.8##  4:    Hornet 4 Drive 21.4##  5: Hornet Sportabout 18.7##  6:           Valiant 18.1##  7:        Duster 360 14.3##  8:         Merc 240D 24.4##  9:          Merc 230 22.8## 10:          Merc 280 19.2

Also, if I want to check the Sepal.Length and Petal.Length in one column, data.table can do that by recycling.

temp <- dti[1:5]temp

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1:          5.1         3.5          1.4         0.2  setosa## 2:          4.9         3.0          1.4         0.2  setosa## 3:          4.7         3.2          1.3         0.2  setosa## 4:          4.6         3.1          1.5         0.2  setosa## 5:          5.0         3.6          1.4         0.2  setosa

# compare the command below with temptemp[, .(Species, Length = c(Sepal.Length, Petal.Length))]

##     Species Length##  1:  setosa    5.1##  2:  setosa    4.9##  3:  setosa    4.7##  4:  setosa    4.6##  5:  setosa    5.0##  6:  setosa    1.4##  7:  setosa    1.4##  8:  setosa    1.3##  9:  setosa    1.5## 10:  setosa    1.4

Remove

To drop a column, we can simply assign NULL to this column with :=.

head(temp)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1:          5.1         3.5          1.4         0.2  setosa## 2:          4.9         3.0          1.4         0.2  setosa## 3:          4.7         3.2          1.3         0.2  setosa## 4:          4.6         3.1          1.5         0.2  setosa## 5:          5.0         3.6          1.4         0.2  setosa

# remove the 1st columntemp[, Sepal.Length:=NULL]head(temp)

##    Sepal.Width Petal.Length Petal.Width Species## 1:         3.5          1.4         0.2  setosa## 2:         3.0          1.4         0.2  setosa## 3:         3.2          1.3         0.2  setosa## 4:         3.1          1.5         0.2  setosa## 5:         3.6          1.4         0.2  setosa

Noticed that the removing action is applied automatically to the data.table itself without assigning to temp again.

There is also an equivalent way to remove certain column by specifying the column number.

# remove the 1st column againtemp[, 1:=NULL]head(temp)

##    Petal.Length Petal.Width Species## 1:          1.4         0.2  setosa## 2:          1.4         0.2  setosa## 3:          1.3         0.2  setosa## 4:          1.5         0.2  setosa## 5:          1.4         0.2  setosa

To remove multiple columns, the column names or indexes need to be fed as a vector.

temp[, c('Petal.Length', 'Petal.Width'):=NULL]temp

##    Species## 1:  setosa## 2:  setosa## 3:  setosa## 4:  setosa## 5:  setosa

# restore temptemp <- dti[1:5]temp[, (1:2):=NULL] # or temp[, 1:2:=NULL] tbr <- c('Petal.Length', 'Petal.Width')temp[, (tbr):=NULL]

or like this:

# restore temptemp <- dti[1:5]temp[, (1:2):=NULL] # or temp[, 1:2:=NULL] temp[, grep('^Petal', names(temp)):=NULL]

Update

Borrowing the same logic, we use := to assign new values to existing colums. In column Species, we keep only the first 3 letters:

temp <- dti[1:5]temp[, Species:=substr(Species, 1, 3)]temp

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1:          5.1         3.5          1.4         0.2     set## 2:          4.9         3.0          1.4         0.2     set## 3:          4.7         3.2          1.3         0.2     set## 4:          4.6         3.1          1.5         0.2     set## 5:          5.0         3.6          1.4         0.2     set

How do we update multiple columns? We can try to add 1 to column vs and am, using := as a function name.

temp <- dtm[1:32]temp[, `:=`(vs = vs + 1, am = am + 1)]head(temp)

##                model  mpg cyl disp  hp drat    wt  qsec vs am gear carb## 1:         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  1  2    4    4## 2:     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  1  2    4    4## 3:        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  2  2    4    1## 4:    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  2  1    3    1## 5: Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  1  1    3    2## 6:           Valiant 18.1   6  225 105 2.76 3.460 20.22  2  1    3    1

Summarise

First of all, let’s take a look on the simplest way for summarising. To check the efficiency of unit cylinder per model, we can select the 1st column of updated dtm and create a new variable using mpg divided by cyl (return only the first 10 rows).

dtm[1:10, .(model, efcy_cly = round(mpg/cyl, 3))]

##                 model efcy_cly##  1:         Mazda RX4    3.500##  2:     Mazda RX4 Wag    3.500##  3:        Datsun 710    5.700##  4:    Hornet 4 Drive    3.567##  5: Hornet Sportabout    2.337##  6:           Valiant    3.017##  7:        Duster 360    1.788##  8:         Merc 240D    6.100##  9:          Merc 230    5.700## 10:          Merc 280    3.200

To keep the column names, it’s always necessary to wrap the columns selected by .().

Introduce basic by

by is the third argument by default used for picking up the group. Now we check the average fuel efficiency (mpg) for number of cylinders that a vehicle has.

dtm[, mean(mpg), by = cyl]

##    cyl       V1## 1:   6 19.74286## 2:   4 26.66364## 3:   8 15.10000

The second column doesn’t have a colum name, try the command below:

dtm[, .(avg_mpg = mean(mpg)), by = cyl]

##    cyl  avg_mpg## 1:   6 19.74286## 2:   4 26.66364## 3:   8 15.10000

Apply function on by

It’s feasible to apply function on by. For example I want to check the mean value of qsec - the time cost to reach 1/4 mile - grouped by the first 3 letters of vehicle models.

dtm[, .(avg_qsec = mean(qsec)), by = substr(model, 1, 3)]

##     substr avg_qsec##  1:    Maz 16.74000##  2:    Dat 18.61000##  3:    Hor 18.23000##  4:    Val 20.22000##  5:    Dus 15.84000##  6:    Mer 19.01429##  7:    Cad 17.98000##  8:    Lin 17.82000##  9:    Chr 17.42000## 10:    Fia 19.18500## 11:    Hon 18.52000## 12:    Toy 19.95500## 13:    Dod 16.87000## 14:    AMC 17.30000## 15:    Cam 15.41000## 16:    Pon 17.05000## 17:    Por 16.70000## 18:    Lot 16.90000## 19:    For 14.50000## 20:    Fer 15.50000## 21:    Mas 14.60000## 22:    Vol 18.60000##     substr avg_qsec

What if you want to get the number of each group under certain grouping rule?

dti[, .N, by = .(area_group = 10*round(Sepal.Length*Sepal.Width/10))]

##    area_group   N## 1:         20 117## 2:         10  29## 3:         30   4

if you are familiar with dplyr package, you can try to write a command using n().

Return multiple numbers in j

So far, we tried returning single numbers in each group (in j). data.table also allows returning multiple values.

Check the fuel top 2 records of mpg for each group concerning cyl and am (transmission method).

dtm[, .(mpg = head(mpg, 2)), by = .(cyl, am)]

##     cyl am  mpg##  1:   6  1 21.0##  2:   6  1 21.0##  3:   4  1 22.8##  4:   4  1 32.4##  5:   6  0 21.4##  6:   6  0 18.1##  7:   8  0 18.7##  8:   8  0 14.3##  9:   4  0 24.4## 10:   4  0 22.8## 11:   8  1 15.8## 12:   8  1 15.0

Calculate the cumulative weight of vehicle grouped by cyl and gear.

dtm[, .(cumulative_su = cumsum(wt)), by = .(cyl, gear)]

##     cyl gear cumulative_su##  1:   6    4         2.620##  2:   6    4         5.495##  3:   6    4         8.935##  4:   6    4        12.375##  5:   4    4         2.320##  6:   4    4         5.510##  7:   4    4         8.660##  8:   4    4        10.860##  9:   4    4        12.475## 10:   4    4        14.310## 11:   4    4        16.245## 12:   4    4        19.025## 13:   6    3         3.215## 14:   6    3         6.675## 15:   8    3         3.440## 16:   8    3         7.010## 17:   8    3        11.080## 18:   8    3        14.810## 19:   8    3        18.590## 20:   8    3        23.840## 21:   8    3        29.264## 22:   8    3        34.609## 23:   8    3        38.129## 24:   8    3        41.564## 25:   8    3        45.404## 26:   8    3        49.249## 27:   4    3         2.465## 28:   4    5         2.140## 29:   4    5         3.653## 30:   8    5         3.170## 31:   8    5         6.740## 32:   6    5         2.770##     cyl gear cumulative_su

Chaining

List the mpg in descending order grouped by cyl and am, then show only the first 2 fuel-efficient records in each group. This cannot be done in one step.

temp <- dtm[, .(model, mpg), by = .(cyl, am)]temp <- temp[order(-mpg)]temp[, .(model = head(model, 2), mpg = head(mpg, 2)), by = .(cyl, am)]

##     cyl am             model  mpg##  1:   4  1    Toyota Corolla 33.9##  2:   4  1          Fiat 128 32.4##  3:   4  0         Merc 240D 24.4##  4:   4  0          Merc 230 22.8##  5:   6  0    Hornet 4 Drive 21.4##  6:   6  0          Merc 280 19.2##  7:   6  1         Mazda RX4 21.0##  8:   6  1     Mazda RX4 Wag 21.0##  9:   8  0  Pontiac Firebird 19.2## 10:   8  0 Hornet Sportabout 18.7## 11:   8  1    Ford Pantera L 15.8## 12:   8  1     Maserati Bora 15.0

We can simplify the process by chaining:

dtm[, .(model, mpg), by = .(cyl, am)][order(-mpg)][, .(model = head(model, 2), mpg = head(mpg, 2)), by = .(cyl, am)]

##     cyl am             model  mpg##  1:   4  1    Toyota Corolla 33.9##  2:   4  1          Fiat 128 32.4##  3:   4  0         Merc 240D 24.4##  4:   4  0          Merc 230 22.8##  5:   6  0    Hornet 4 Drive 21.4##  6:   6  0          Merc 280 19.2##  7:   6  1         Mazda RX4 21.0##  8:   6  1     Mazda RX4 Wag 21.0##  9:   8  0  Pontiac Firebird 19.2## 10:   8  0 Hornet Sportabout 18.7## 11:   8  1    Ford Pantera L 15.8## 12:   8  1     Maserati Bora 15.0

Summarise with .SD and .SDcols

Calculate the median of Sepal.Length, Sepal.Width, Petal.Length and Petal.Width by species:

dti[, .(Sepal.Length = median(Sepal.Length),        Sepal.Width = median(Sepal.Width),        Petal.Length = median(Petal.Length),        Petal.Width = median(Petal.Width)), by = Species][order(-Species)]

##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1:  virginica          6.5         3.0         5.55         2.0## 2: versicolor          5.9         2.8         4.35         1.3## 3:     setosa          5.0         3.4         1.50         0.2

We can use .SD to make this life easier.

dti[, lapply(.SD, median), by = Species][order(-Species)]

##       Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1:  virginica          6.5         3.0         5.55         2.0## 2: versicolor          5.9         2.8         4.35         1.3## 3:     setosa          5.0         3.4         1.50         0.2

How to select specified columns? For example I want to select the 2nd to 5th columns in dtm and calculate their average value.

dtm[, lapply(.SD, mean), .SDcols = 2:5]

##         mpg    cyl     disp       hp## 1: 20.09062 6.1875 230.7219 146.6875

You could also directly specify the column names:

dtm[, lapply(.SD, mean), .SDcols = names(dtm)[2:5]]

##         mpg    cyl     disp       hp## 1: 20.09062 6.1875 230.7219 146.6875

To mix all the internal variables together, we try more exciting exercises. We culculate the average fuel efficiency per cylinder settings grouped by carb > 3 and wt > 3.3.

dtm[, efcy_cly := round(mpg/cyl, 3)][, lapply(.SD, mean), by = .(carb > 3, wt > 3.3), .SDcols = c('efcy_cly', 'disp', 'hp')]

##     carb    wt efcy_cly    disp        hp## 1:  TRUE FALSE 3.064500 204.000 164.75000## 2: FALSE FALSE 6.407667 117.875  84.91667## 3: FALSE  TRUE 2.213125 304.300 161.87500## 4:  TRUE  TRUE 1.991250 339.775 215.12500

Somethings to be noticed

The problem of assignment

To copy a data.frame, we just need to assign it to a new symbol. You may find that we used copy() to make a copy of a data.table. Why don’t we use <-?

DT <- as.data.table(mtcars[1:5, 1:5])DT1 <- DT# Check DT1DT1

##     mpg cyl disp  hp drat## 1: 21.0   6  160 110 3.90## 2: 21.0   6  160 110 3.90## 3: 22.8   4  108  93 3.85## 4: 21.4   6  258 110 3.08## 5: 18.7   8  360 175 3.15

# Now we change DT1 - add the vehicle model nameDT1 <- DT[, model:=rownames(mtcars)[1:5]]# Check DTDT

If you know python, you’ll find that this is the same logic. Different labels point to the same memory location. You change DT1 also change DT. To realize static storage as normal data.frame does, there are
at least 2 ways.

# Use the DT just created# 1. copy()DT2 <- copy(DT)# 2. subsetting []DT3 <- DT[1:nrow(DT)]DT2[, model:=NULL]DT3[, model:=NULL]# Check DT, DT2 and DT3DT # DT not changed

##     mpg cyl disp  hp drat             model## 1: 21.0   6  160 110 3.90         Mazda RX4## 2: 21.0   6  160 110 3.90     Mazda RX4 Wag## 3: 22.8   4  108  93 3.85        Datsun 710## 4: 21.4   6  258 110 3.08    Hornet 4 Drive## 5: 18.7   8  360 175 3.15 Hornet Sportabout

Difference between `=` and `:=` in data.table

Both operators assign calculation result to a new column. However, the former one does not change the data.table it self, it generates and displays a temporary data.table; the latter does. Therefore, when you use = to create a new data.table, you need to assign it to a symbol either existed or not, while := does not need to.

0 0