My question involves writing code using the dplyr package in R
I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id
), and a second with a date (date
). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:
id date
John12 2006-08-03
Tom2993 2008-10-11
Lisa825 2009-07-03
Tom2993 2008-06-12
Andrew13 2007-09-11
I'd like to reshape the data so that I have a row for every possible id
x date
pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.
I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.
Here's the code I've used to tabulate id
x date
counts so far: (my dataframe is called df)
reduced = df %.%
group_by(id, date) %.%
summarize(length(date))
My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id
x date
pair, with a count of 0.
I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!
dcast.data.table(dt[, bla := "count"], id + date ~ bla, drop=FALSE, fun.agg=length)
, although I like yours better (and is likely to be faster due toeval(length(.))
for each group here. – Skin