using ggpairs with NA-continaing data
Asked Answered
S

6

13

ggpairs in the GGally package seems pretty useful, but it appears to fail when there NA is present anywhere in the data set:

#require(GGally)
data(tips, package="reshape")
pm <- ggpairs(tips[,1:3]) #works just fine

#introduce NA
tips[1,1] <- NA
ggpairs(tips[,1:3])
> Error in if (lims[1] > lims[2]) { : missing value where TRUE/FALSE needed

I don't see any documentation for dealing with NA values, and solutions like ggpairs(tips[,1:3], na.rm=TRUE) (unsurprisingly) don't change the error message.

I have a data set in which perhaps 10% of values are NA, randomly scattered throughout the dataset. Therefore na.omit(myDataSet) will remove much of the data. Is there any way around this?

Secession answered 26/10, 2012 at 20:26 Comment(2)
There's no default way to handle NA values within GGRally, at least that I've found. What I've done in the past is simply replace NA values with 0. Is that feasible for the data set you have?Coakley
It's not really, accurate, unfortunately. My NAs are generally due to lost/faulty samples, and the true value was unlikely to be 0.Secession
D
4

Some functions of GGally like ggparcoord() support handling NAs by missing=[exclude,mean,median,min10,random] parameter. However this is not the case for ggpairs() unfortunately.

What you can do is to replace NAs with a good estimation of your data you were expecting ggpair() will do automatically for you. There are good solutions like replacing them by row means, zeros, median or even closest point (Notice 4 hyperlinks on the words of the recent sentence!).

Duala answered 27/10, 2012 at 1:2 Comment(8)
I think it is very dangerous to invent data points where real data are missing. The techniques you mention are easy in R - and we may be able to get away with them when the number of replacements is small relative to the total number of observations - but when the number of replacements becomes reasonably large, as in my data set, the techniques you mention amount to data fabrication.Secession
@DrewSteen So what do you expect the ggpairs() would do in the very best choice with the NAs? Could it imagine the real value that should be filled? There is no choice but to remove the NAs along with the related data, or to cleverly replace them by some estimations that do not change the results.Duala
Yes, I would like it to remove the NAs & the paired data point, but to keep all data from the same data frame row in plots that don't include the NA. Thus, scatter plots in a given row or column might contain different numbers of data points. Instead it just throws an error message.Secession
If you replace your NAs with 0, you will then have some points on the axes that you can ignore them, if you don't have zero in your original dataDuala
But I do have zeros in the original data!Secession
What about a big number that does not appear in your data?Duala
I'm really resistant to inserting fake data into my data sets. In science, making up data (even if you're clear about doing it) is really, really not OK.Secession
Do you know if changes have been made to make ggpairs work with NA-containing data?Brashy
B
2

I see that this is an old post. Recently I encountered the same problem but still could not find a solution on the Internet. So I provide my workaround below FYI.

I think the aim is to use pair-wise complete observations for plotting (i.e. in a manner that is specific to each panel/facet of the ggpairs grid plot), instead of using complete observations across all variables. The former will keep "useable" observations to the maximal extent, w/o introducing "artificial" data by imputing missing values. Up to date it seems that ggpairs still does not support this. My workaround for this is to:

  1. Encode NA with another value not present in the data, e.g. for numerical variables, I replaced NA's with -666 for my dataset. For each dataset you can always pick something that is out of the range of its data values. BTW it seems that Inf doesn't work;
  2. Then retrieve the pair-wise complete cases with user-created plotting functions. For example, for scatter plots of continuous variables in the lower triangle, I do something like:
scat.my <- function(data, mapping, ...) {
  x <- as.character(unclass(mapping$x))[2] # my way of parsing the x variable name from `mapping`; there may be a better way
  y <- as.character(unclass(mapping$y))[2] # my way of parsing the y variable name from `mapping`; there may be a better way
  dat <- data.table(x=data[[x]], y=data[[y]])[x!=-666 & y!=-666] # I use the `data.table` package; assuming NA values have been replaced with -666
  ggplot(dat, aes(x=x, y=y)) +
    geom_point()
}

ggpairs(my.data, lower=list(continuous=scat.my), ...)

This can be similarly done for the upper triangle and the diagonal. It is somewhat labor-intensive as all the plotting functions need to be re-done manually with customized modifications as above. But it did work.

Barber answered 23/12, 2021 at 9:43 Comment(0)
S
1

I'll take a shot at it with my own horrible workaround, because I think this needs stimulation. I agree with OP that filling in data based on statistical assumptions or a chosen hack is a terrible idea for exploratory analysis, and I think it's guaranteed to fail as soon as you forget how it works (about five days for me) and need to adjust it for something else.

Disclaimer

This is a terrible way to do things, and I hate it. It's useful for when you have a systematic source of NAs coming from something like sparse sampling of a high-dimensional dataset, which maybe the OP has.

Example

Say you have a small subset of some vastly larger dataset, making some of your columns sparsely represented:

|  Sample (0:350)|  Channel(1:118)|  Trial(1:10)|     Voltage|Class  (1:2)|  Subject (1:3)|
|---------------:|---------------:|------------:|-----------:|:-----------|--------------:|
|               1|               1|            1|  0.17142245|1           |              1|
|               2|               2|            2|  0.27733185|2           |              2|
|               3|               1|            3|  0.33203066|1           |              3|
|               4|               2|            1|  0.09483775|2           |              1|
|               5|               1|            2|  0.79609409|1           |              2|
|               6|               2|            3|  0.85227987|2           |              3|
|               7|               1|            1|  0.52804960|1           |              1|
|               8|               2|            2|  0.50156096|2           |              2|
|               9|               1|            3|  0.30680522|1           |              3|
|              10|               2|            1|  0.11250801|2           |              1|

require(data.table) # needs the latest rForge version of data.table for dcast
sample.table <- data.table(Sample = seq_len(10), Channel = rep(1:2,length.out=10),
                           Trial = rep(1:3, length.out=10), Voltage = runif(10), 
                           Class = as.factor(rep(1:2,length.out=10)),
                           Subject = rep(1:3, length.out=10))

The example is hokey but pretend the columns are uniformly sampled from their larger subsets.

Let's say you want to cast the data to wide format along all channels to plot with ggpairs. Now, a canonical dcast back to wide format will not work, with an id column or otherwise, because the column ranges are sparsely (and never completely) represented:

wide.table <- dcast.data.table(sample.table, Sample ~ Channel,
                                   value.var="Voltage",
                                   drop=TRUE)

> wide.table
        Sample         1          2
     1:      1 0.1714224         NA
     2:      2        NA 0.27733185
     3:      3 0.3320307         NA
     4:      4        NA 0.09483775
     5:      5 0.7960941         NA
     6:      6        NA 0.85227987
     7:      7 0.5280496         NA
     8:      8        NA 0.50156096
     9:      9 0.3068052         NA
    10:     10        NA 0.11250801

It's obvious in this case what id column would work because it's a toy example (sample.table[,index:=seq_len(nrow(sample.table)/2)]), but it's basically impossible in the case of a tiny uniform sample of a huge data.table to find a sequence of id values that will thread through every hole in your data when applied to the formula argument. This kludge will work:

setkey(sample.table,Class)

We'll need this at the end to ensure the ordering is fixed.

chan.split <- split(sample.table,sample.table$Channel)

That gets you a list of data.frames for each unique Channel.

cut.fringes <- min(sapply(chan.split,function(x) nrow(x)))
chan.dt <- cbind(lapply(chan.split, function(x){
  x[1:cut.fringes,]$Voltage}))

There has to be a better way to ensure each data.frame has an equal number of rows, but for my application, I can guarantee they're only a few rows different, so I just trim off the excess rows.

chan.dt <- as.data.table(matrix(unlist(chan.dt),
                 ncol = length(unique(sample.table$Channel)), 
                 byrow=TRUE))

This will get you back to a big data.table, with Channels as columns.

chan.dt[,Class:=
         as.factor(rep(0:1,each=sampling.factor/2*nrow(original.table)/ncol(chan.dt))[1:cut.fringes])]

Finally, I rebind my categorical variable back on. The tables should be sorted by category already so this will match. This assumes you have the original table with all the data; there are other ways to do it.

ggpairs(data=chan.dt,
        columns=1:length(unique(sample.table$Channel)), colour="Class",axisLabels="show")

Now it's plottable with the above.

Snocat answered 20/1, 2014 at 10:42 Comment(1)
The way I have it set up above, you're still throwing away rows, but now you have a choice of what to do there. Those more skilled than I probably know some trick with ragged tables that will make it work.Snocat
L
1

As far as I can tell, there is no way around this with ggpairs(). Also, you are absolutely correct to not fill in with 'fake' data. If it is appropriate to suggest here, I would recommend using a different plotting method. For example

 cor.data<- cor(data,use="pairwise.complete.obs") #data correlations ignoring pair-wise NA's
 chart.Correlation(cor.data) #library(PerformanceAnalytics)

or using code from here http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/

Lepore answered 1/7, 2014 at 18:3 Comment(0)
L
0

11 years after you posted, I believe I found your answer:

data(tips, package="reshape")
pm <- ggpairs(tips[,1:3]) #works just fine
tips[1,1] <- NA
ggpairs(tips[,1:3], 
        upper=list(continuous="cor", combo ="box_no_facet", discrete="count", na="cor"),
        lower=list(continuous="points", combo="facethist", discrete="facetbar", na="points")
        )

Where I just changed the na value to match the continuous value, rather than leave the na value as its default "na"

Loosen answered 3/10, 2023 at 21:59 Comment(0)
R
0

You can adjust the parameters used in the ggally_cor() function (used for continuous = "cor"), some of them will be passed on to the stats:cor() function.

In stats:cor() you have the option to only use pairwise.complete.obs, as is pointed out in the documentation

If use has the value "pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables.

To use this functionally with ggpairs, instead of the default

upper = list(continuous = "cor", combo = "box_no_facet", discrete = "count", na = "na")

use

upper = list(continuous = wrap(ggally_cor, use = "pairwise.complete.obs"), combo = "box_no_facet", discrete = "count", na = "na")

Roadwork answered 13/1 at 15:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.