I have rather huge dataset in which I would like to exclude columns with a rather low variance, which is why I would like to use the phrase NearZeroVar. However, I do have some trouble understanding what freqCut and uniqueCut do and how they influence each other. I already read the explanation in R but that does not really help me with this one. If anyone could explain it to me, I would be very thankful!
If a variable has very little change or variation, it's like a constant and not useful for prediction. This would have close to zero variance, hence the name of the function.
The two parameters do not influence each other, they are there to take care of common scenarios that give rise to variable of near zero variance. The column needs to fail both criteria to be excluded.
Let's use an example:
mat = cbind(1,rep(c(1,2),c(8,1)),rep(1:3,3),1:9)
mat
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 2 2
[3,] 1 1 3 3
[4,] 1 1 1 4
[5,] 1 1 2 5
[6,] 1 1 3 6
[7,] 1 1 1 7
[8,] 1 1 2 8
[9,] 1 2 3 9
If we use the default, which calls for 95/5 for most common to 2nd and unique values, you can see only 1st column is taken out:
nearZeroVar(mat)
[1] 1
Let's look at the 2nd column, the most common to second most is 8/1, and it has 2 unique values, making it 2/9 = 0.22. So for this to be filtered out , you need to change the two settings:
nearZeroVar(mat,freqCut=7/1,uniqueCut=30)
[1] 1 2
Lastly, something you most likely should not filter out is column 3 or 4, so column we will filter out when we set something nonsense:
nearZeroVar(mat,freqCut=0.1,uniqueCut=50)
[1] 1 2 3
nearZeroVar(data.frame(new),freqCut = 18,uniqueCut = 20)
–
Cockspur I assume you are talking about the function present in mixOmics
, which is referring to a similar function in the caret
package.
The idea behind that function is to identify predictors (in columns of a matrix) that are mostly invariable (have "near zero variance"), e.g. having almost exclusively the value 0 and only low fraction of non-zero values; those would be uninteresting as predictors.
Their example of an uninteresting predictor would be 1000 values, of which 999 are 0 and 1 is 1. To define what you would call "near zero variance", the authors used two filters:
1) the ratio of frequencies for the most common value (0, in this example) over the second most common value (1) (freqRatio
)
2) the percentage of unique data points out of the total number of data points (percentUnique
)
In the above example, the frequency ratio is 999 and the unique value percentage is 0.0001, both meeting the default values used in the function.
You can imagine data where you have few discrete values, e.g. 500 0's and 500 1's. You might want to keep this as an informative predictor, and percentUnique
would be very low (would satisfy this filter criterium), but the freqRatio
would be too low to flag that predictor.
On the other end of the spectrum, you could have something like 500 0's and 500 distinct non-zero values, which might have useful predictive properties, and they are characterized by a high freqRatio
, but a high percentUnique
(also not flagged).
These two parameters allow you some flexibility in drawing the line of what you consider to still be useful as a predictor.
Depending on your data, you could also use matrixStats::colVars
(or matrixStats::rowVars
, depending on your data structure), to get a distribution of the variance of individual predictors, and then define a cutoff based on that. When you plot the density distribution of the variance, you might see a good cutpoint, e.g. in a bimodal distribution, or just pick a percentile of variance that you want to use as a cutoff.
© 2022 - 2024 — McMap. All rights reserved.
new <- c(1,1,3700,1,1,1,1,1,1,2600,1,1,3000,1,1,1,1,1,1,1,1,1)
. Now set NearZeroVar(df, freqCut = 18/4, uniqueCut = 15), which, as far as I understand, means if more then 18 values are equal and you have less than 2 unique Values (2/22 = 0.091) the column should be left out. What do I get wrong? – Defecate