In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package.
This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.
All code snippets below require the data.table package to be installed and loaded:
install.packages("data.table") library(data.table)
Basic examples
Here is the example for the number of appearances of the unique values in the data:
values <- data.table(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) values nr.of.appearances <- values[, list(nr.appearances=length(value)), by = list(unique.values = value)] nr.of.appearances value 1: a 2: a 3: a ..... unique.values nr.appearances 1: a 5 2: b 3 3: c 4
You can notice a lot of differences here. First of all, no additional function was invoke. Instead, the []
operator has been overloaded for the data.table
class allowing for a different signature: it has three inputs instead of the usual two for a data.frame
. We will return to this in a moment. Secondly, the columns of the data.table
were not referenced by their name as a string
, but as a variable instead. This is a very important aspect of the data.table
syntax. Last but not least as implied by the fact that both the aggregating function and the grouping variable are passed on as a list
one can not only group by multiple variables as in aggregate
but you can also use multiple aggregation functions at the same time. I will show an example of that later.
Coming back to the overloading of the []
operator: a data.table
is at the same time also a data.frame
. See e.g.
class(values) "data.table" "data.frame"
This means that you can use all (or at least most of) the data.frame
functionality as well. Among others you can use aggregate
like you would use for a data.frame
:
values <- data.frame(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) nr.of.appearances <- aggregate(x = values, by = list(unique.values = values$value), FUN = length)
EDIT (02/12/2015): Matt Dowle from the data.table team suggested a more efficient implementation for this in the comments (thanks, Matt!):
nr.of.appearances <- values[, list(nr.appearances=.N), by = list(unique.values = value)]
You can also use the []
operator in the classic data.frame
way by passing on only two input variables:
values[values$value == "a",] value 1: a 2: a 3: a ...
UPDATE 02/12/2015
Matt Dowle from the data.table team warned in the comments against this way of filtering a data.table and suggested an alternative (thanks, Matt!):
values[value=="a",] value 1: a 2: a 3: a ...
Another exciting possibility with data.table is creating a new column in a data.table derived from existing columns – with or without aggregation. Examples of both are shown below:
values[, new.col := paste0(value, value)] values values[, new.col := paste0(value, length(value)), by = list(unique.values = value)] values value new.col 1: a aa 2: a aa 3: a aa 4: a aa ..... value new.col 1: a a5 2: a a5 3: a a5 4: a a5 .....
Notice that in both cases the data.table
was directly modified, rather than left unchanged with the results returned. That’s right: data.table
creates side effect by using copy-by-reference rather than copy-by-value as (almost) everything else in R. It is arguable whether this is alien to the nature of a (more or less) functional language like R but one thing is sure: it is extremely efficient, especially when the variable hardly fits the memory to start with.
Back to the basic examples, here is the last (and first) day of the months in your data
dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729) dates dates <- as.data.table(dates) dates special.days <- dates[, list(first.day = min(date), last.day = max(date)), by = list(month = substr(date, 1, 7))] special.days date 1 2001-01-01 2 2001-01-02 3 2001-01-03 4 2001-01-04 ..... date 1: 2001-01-01 2: 2001-01-02 3: 2001-01-03 --- 726: 2002-12-27 727: 2002-12-28 728: 2002-12-29 month first.day last.day 1: 2001-01 2001-01-01 2001-01-31 2: 2001-02 2001-02-01 2001-02-28 3: 2001-03 2001-03-01 2001-03-31 .....
As you can see the syntax is the same as above – but now we can get the first and last days in a single command! Also note that you don’t have to know up front that you want to use data.table
: the as.data.table
command allows you to cast a data.frame
into a data.table
. Finally, notice how data.table
creates a summary of the head and the tail of the variable if it’s too long to show.
Advanced Uses
Just like in case of aggregate
, you can use anonymous functions to aggregate in data.table
as well. Let’s have a look at the example for fitting a Gaussian distribution to observations by categories:
library(MASS) categories <- data.table(category = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c")) observations <- data.table(observation = c(rnorm(5, mean = 3, sd = 0.2), rnorm(5, mean = -2, sd = 0.4), rnorm(4, mean = 0, sd = 1))) data <- cbind(categories, observations) data distr.estimate <- data[, list(mean = fitdistr(observation, densfun = "normal")$estimate[[1]], sd = fitdistr(observation, densfun = "normal")$estimate[[2]]), by = list(category)] distr.estimate category observation 1: a 2.7446816 2: a 2.8853469 3: a 2.7550775 ..... category mean sd 1: a 2.8332705 0.06882552 2: b -1.9678460 0.37420857 3: c 0.9233108 0.47680978
or equivalently
distr.estimate <- data[, list(mean = fitdistr(observation, densfun = "normal")$estimate, sd = fitdistr(observation, densfun = "normal")$estimate[[2]]), by = list(category)]
This example shows some weaknesses of using data.table
compared to aggregate
, but it also shows that those weaknesses are nicely balanced by the strength of data.table
. One such weakness is that by design data.table
aggregation requires the variables to be coming from the same data.table
, so we had to cbind
the two variables. Also, the aggregation in data.table
returns only the first variable if the function invoked returns more than variable, hence the equivalence of the two syntaxes showed above. However, as multiple calls can be submitted in the list, this can easily be overcome. Finally note how much simpler the anonymous function construction works: rather than defining the function itself, we can simply pass the relevant variable.
UPDATE 02/12/2015
As kindly noted by Jan Gorecki in the comments (thanks, Jan!), the weakness I mention above can be overcome by using the {}
operator for the inut variable j
:
distr.estimate <- data[, {est <- fitdistr(observation, densfun ="normal")$estimate; list(mean = est[[1]], sd = est[[2]])}, by = list(category)] distr.estimate category mean sd 1: a 2.8332705 0.06882552 2: b -1.9678460 0.37420857 3: c 0.9233108 0.47680978
Notice that as opposed to the anonymous function definition in aggregate
, you don’t have to use the return()
command, data.table
simply returns with the result of the last command.
If you have any question about this post please leave a comment below.