Normalisation with a zero in the standard deviation
Asked Answered
S

4

5

I'm trying to centre and normalise a data set in python with the following code

mean = np.mean(train, axis=0)
std = np.std(train, axis=0)
norm_train = (train - mean) / std

The problem is that I get a devision by zero error. Two of the values in the data set end up having a zero std. The data set if of shape (3750, 55). My stats skills are not so strong so I'm not sure how to overcome this. Any suggestions?

Samuel answered 7/4, 2016 at 20:4 Comment(7)
By two of the values you mean two variables/features? If they have zero standard deviation, it means all the values are the same so they are basically useless for any kind of analysis. If you should keep them, considering all the other variables will have 0 mean, you can just convert them to zero as well.Babineaux
What I mean by two values is that np.std(trainData, axis=0)[28] = 0 and np.std(trainData, axis=0)[49] = 0 to be explicit. I just had another look at the data and I can see that trainData[:, 28] and trainData[:, 49] are all zeros. So are you suggesting that I remove them form the dataset?Samuel
If you should remove them depends on what you want to do with them. But you can't divide them by a std deviation. As they have none.Jordanjordana
Well the aim is to train and k-mans classifier. Would it be valid to overrate the zero values for thous specific indices to a 1 in the result of std so that the resting values after the devision will just be 0?Samuel
Since this is called train, can you also check the test dataset, if they are also all zeros? If that is the case, they have no discriminative power so I'd say it's safe to remove them (most of the algorithms will remove them or not work with them anyway).Babineaux
Yes they happened to be zeros as well in the test data. I will remove them then. Thanks a lot! You should move your last comment to an answer to my question so I can mark it as answered.Samuel
Sure I added it as an answer.Babineaux
B
7

Since the standard deviation is calculated by taking the sum of the squared deviations from the mean, a zero standard deviation can only be possible when all the values of a variable are the same (all equal to the mean). In this case, those variables have no discriminative power so they can be removed from the analysis. They cannot improve any classification, clustering or regression task. Many implementations will do it for you or throw an error about a matrix calculation.

Babineaux answered 7/4, 2016 at 20:40 Comment(0)
S
3

One standard is to include an epsilon variable that prevents divide by zero. In theory, it is not needed because it doesn't make logical sense to do such calculations. In reality, machines are just calculators and divide by zero becomes either NaN or +/-Inf.

In short, define your function like this:

def z_norm(arr, epsilon=1e-100):
    return (arr-arr.mean())/(arr.std()+epsilon)  

This assumes a 1D array, but it would be easy to change to row-wise or column-wise calculation of a 2D array.

Epsilon is an intentional error added to calculations to prevent creating NaN or Inf. In the case of Inf, you will still end up with numbers that are really large, but later calculations will not propagate Inf and may still retain some meaning.

The value of 1/(1 x 10^100) is incredibly small and will not change your result much. You can go down to 1e-300 or so if you want, but you risk hitting the lowest precision value after further calculation. Be aware of the precision you use and the smallest precision it can handle. I was using float64.

Update 2021-11-03: Adding test code. The objective of this epsilon is to minimize damage and remove the chance of random NaNs in your data pipeline. Setting epsilon to a positive value fixes the problem.

for arr in [
        np.array([0,0]),
        np.array([1e-300,1e-300]),
        np.array([1,1]),
        np.array([1,2])
    ]:
    for epi in [1e-100,0,1e100]:
        stdev = arr.std()
        mean = arr.mean()
        result = z_norm(arr, epsilon=epi)
        print(f' z_norm(np.array({str(arr):<21}),{epi:<7}) ### stdev={stdev}; mean={mean:<6}; becomes --> {str(result):<19} (float-64) --> Truncate to 32 bits. =', result.astype(np.float32))

z_norm(np.array([0 0]                ),1e-100 ) ### stdev=0.0; mean=0.0   ; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([0 0]                ),0      ) ### stdev=0.0; mean=0.0   ; becomes --> [nan nan]           (float-64) --> Truncate to 32 bits. = [nan nan]
z_norm(np.array([0 0]                ),1e+100 ) ### stdev=0.0; mean=0.0   ; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([1.e-300 1.e-300]    ),1e-100 ) ### stdev=0.0; mean=1e-300; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([1.e-300 1.e-300]    ),0      ) ### stdev=0.0; mean=1e-300; becomes --> [nan nan]           (float-64) --> Truncate to 32 bits. = [nan nan]
z_norm(np.array([1.e-300 1.e-300]    ),1e+100 ) ### stdev=0.0; mean=1e-300; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([1 1]                ),1e-100 ) ### stdev=0.0; mean=1.0   ; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([1 1]                ),0      ) ### stdev=0.0; mean=1.0   ; becomes --> [nan nan]           (float-64) --> Truncate to 32 bits. = [nan nan]
z_norm(np.array([1 1]                ),1e+100 ) ### stdev=0.0; mean=1.0   ; becomes --> [0. 0.]             (float-64) --> Truncate to 32 bits. = [0. 0.]
z_norm(np.array([1 2]                ),1e-100 ) ### stdev=0.5; mean=1.5   ; becomes --> [-1.  1.]           (float-64) --> Truncate to 32 bits. = [-1.  1.]
z_norm(np.array([1 2]                ),0      ) ### stdev=0.5; mean=1.5   ; becomes --> [-1.  1.]           (float-64) --> Truncate to 32 bits. = [-1.  1.]
z_norm(np.array([1 2]                ),1e+100 ) ### stdev=0.5; mean=1.5   ; becomes --> [-5.e-101  5.e-101] (float-64) --> Truncate to 32 bits. = [-0.  0.]
Stonybroke answered 10/7, 2019 at 17:12 Comment(2)
Your epsilon of 1e-100 is incredibly small leading to the term shooting up immensely. Your probably want more like 1e100.Unfathomable
I added test code and clarification.Stonybroke
M
2

You could just replace the 0 std to 1 for that feature. This would basically mean that the scaled value would be zero for all the data points for that feature. This makes sense as this implies that the feature values do not deviate even a bit form the mean(as the values is constant, the constant is the mean.)

FYI- This is what sklearn does! https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/preprocessing/data.py#L70

Maverick answered 11/2, 2019 at 3:53 Comment(0)
A
0

Going back to its definition, the idea behind the z_score is to give the distance between an element and the mean of the sample in terms of standard deviations. If all elements are the same, it means that their distance to the mean is 0, and therefore the zscore is 0 time the standard deviation, since all your data points are at the mean. The division by the standard division is a way to relate the distance to the dispersion of the data. Visually it is easy to understand and come to this conclusion: https://en.wikipedia.org/wiki/Standard_score#/media/File:The_Normal_Distribution.svg

Alpine answered 15/6, 2021 at 18:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.