Correct way to standardize/scale/normalize multiple variables following power law distribution for use in linear combination
Asked Answered
K

4

7

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:

in_degree + betweenness_centrality = informal_power_index

The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)

Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?

Three obvious approaches are:

  • Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
  • Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
  • Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?

Any other ideas?

Kraska answered 1/4, 2009 at 3:9 Comment(0)
C
4

You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.

Clarion answered 1/4, 2009 at 3:30 Comment(0)
S
1

you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.

((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?

Suter answered 1/4, 2009 at 3:18 Comment(2)
Won't this approach have the same problem as the standardization method, it will squash the distribution so that percentiles 95 and 99 look pretty close even though they are worlds apart (think Bill Gate's bank account versus ... mine!)Kraska
This method places everything within a percentage. It is not based on how the number deviates from the mean. But, I may not be clear on you methodology there. 2000 was arbitary. The larger it is the more unique values can be created.Suter
U
1

Very interesting question. Could something like this work:

Lets assume that we want to scale both the variables to a range of [-1,1] Take the example of betweeness_centrality that has a range of 0-35000

  1. Choose a large number in the order of the range of the variable. As an example lets choose 25,000
  2. create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
  3. For each number x-i find out the bin# it falls in the original bin. Let this be B-i
  4. Find the range of B-i in the range [-1,1].
  5. Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.

This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.

Unplumbed answered 28/6, 2012 at 18:44 Comment(0)
Y
0

normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.

if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)

you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

You answered 1/4, 2009 at 3:35 Comment(2)
Your second paragraph seems to describe the standardizing approach, where you go from the raw metric value to the number of standard deviations the value is from the mean. This all seems to work best with normal distributions, and less well with other distsKraska
agreed, as i indicated in the third paragraph you need to look at statistical measurements that pertain to your data set, if they are power distributions these are variance, moments, skewness, and possibly kurtosisYou

© 2022 - 2024 — McMap. All rights reserved.