I'll take a shot at it with my own horrible workaround, because I think this needs stimulation. I agree with OP that filling in data based on statistical assumptions or a chosen hack is a terrible idea for exploratory analysis, and I think it's guaranteed to fail as soon as you forget how it works (about five days for me) and need to adjust it for something else.
Disclaimer
This is a terrible way to do things, and I hate it. It's useful for when you have a systematic source of NAs coming from something like sparse sampling of a high-dimensional dataset, which maybe the OP has.
Example
Say you have a small subset of some vastly larger dataset, making some of your columns sparsely represented:
| Sample (0:350)| Channel(1:118)| Trial(1:10)| Voltage|Class (1:2)| Subject (1:3)|
|---------------:|---------------:|------------:|-----------:|:-----------|--------------:|
| 1| 1| 1| 0.17142245|1 | 1|
| 2| 2| 2| 0.27733185|2 | 2|
| 3| 1| 3| 0.33203066|1 | 3|
| 4| 2| 1| 0.09483775|2 | 1|
| 5| 1| 2| 0.79609409|1 | 2|
| 6| 2| 3| 0.85227987|2 | 3|
| 7| 1| 1| 0.52804960|1 | 1|
| 8| 2| 2| 0.50156096|2 | 2|
| 9| 1| 3| 0.30680522|1 | 3|
| 10| 2| 1| 0.11250801|2 | 1|
require(data.table) # needs the latest rForge version of data.table for dcast
sample.table <- data.table(Sample = seq_len(10), Channel = rep(1:2,length.out=10),
Trial = rep(1:3, length.out=10), Voltage = runif(10),
Class = as.factor(rep(1:2,length.out=10)),
Subject = rep(1:3, length.out=10))
The example is hokey but pretend the columns are uniformly sampled from their larger subsets.
Let's say you want to cast the data to wide format along all channels to plot with ggpairs
. Now, a canonical dcast
back to wide format will not work, with an id
column or otherwise, because the column ranges are sparsely (and never completely) represented:
wide.table <- dcast.data.table(sample.table, Sample ~ Channel,
value.var="Voltage",
drop=TRUE)
> wide.table
Sample 1 2
1: 1 0.1714224 NA
2: 2 NA 0.27733185
3: 3 0.3320307 NA
4: 4 NA 0.09483775
5: 5 0.7960941 NA
6: 6 NA 0.85227987
7: 7 0.5280496 NA
8: 8 NA 0.50156096
9: 9 0.3068052 NA
10: 10 NA 0.11250801
It's obvious in this case what id
column would work because it's a toy example (sample.table[,index:=seq_len(nrow(sample.table)/2)]
), but it's basically impossible in the case of a tiny uniform sample of a huge data.table to find a sequence of id
values that will thread through every hole in your data when applied to the formula argument. This kludge will work:
setkey(sample.table,Class)
We'll need this at the end to ensure the ordering is fixed.
chan.split <- split(sample.table,sample.table$Channel)
That gets you a list of data.frames for each unique Channel.
cut.fringes <- min(sapply(chan.split,function(x) nrow(x)))
chan.dt <- cbind(lapply(chan.split, function(x){
x[1:cut.fringes,]$Voltage}))
There has to be a better way to ensure each data.frame has an equal number of rows, but for my application, I can guarantee they're only a few rows different, so I just trim off the excess rows.
chan.dt <- as.data.table(matrix(unlist(chan.dt),
ncol = length(unique(sample.table$Channel)),
byrow=TRUE))
This will get you back to a big data.table, with Channels as columns.
chan.dt[,Class:=
as.factor(rep(0:1,each=sampling.factor/2*nrow(original.table)/ncol(chan.dt))[1:cut.fringes])]
Finally, I rebind my categorical variable back on. The tables should be sorted by category already so this will match. This assumes you have the original table with all the data; there are other ways to do it.
ggpairs(data=chan.dt,
columns=1:length(unique(sample.table$Channel)), colour="Class",axisLabels="show")
Now it's plottable with the above.