Effect of feature scaling on accuracy
Asked Answered
M

1

12

I am working on image classification using Gaussian Mixture Models. I have around 34,000 features, belonging to three classes, all lying in a 23 dimensional space. I performed feature scaling on both the training and testing data using different methods, and I observed that accuracy actually reduces after performing scaling. I performed feature scaling because there was a difference of many orders between many features. I am curious to know why this is happening, I thought that feature scaling would increase the accuracy, especially given the large differences in features.

Maryrose answered 31/10, 2014 at 5:52 Comment(3)
1) How many training samples do you have? 2) Are you doing testing by testing on data not seen during training? (In this case since you are doing GMM, you are just doing clustering.) 3) What is the accuracy B4 and AFTER your change?Midriff
I should have added this in the question itself, sorry about it. I am using the GMM to build a Bayes classifier. Well before the scaling, I was getting an accuracy of 70%, and after the scaling, the accuracy reduces to 45%.Maryrose
@Raghuram Do you have native pictures? What happens if you enlarge the native picture? Accuracy should not be lost here.Veneration
M
27

I thought that feature scaling would increase the accuracy, especially given the large differences in features.

Welcome to the real world buddy.

In general, it is quite true that you want features to be in the same "scale" so that you don't have some features "dominating" other features. This is especially so if your machine learning algorithm is inherently "geometrical" in nature. By "geometrical", I mean it treats the samples are points in a space, and relies on distances (usually Euclidean/L2 as is your case) between points in making its predictions, i.e., the spatial relationships of the points matter. GMM and SVM are algorithms of this nature.

However, feature scaling can screw things up, especially if some features are categorical/ordinal in nature, and you didn't properly preprocess them when you appended them to the rest of your features. Furthermore, depending on your feature scaling method, presence of outliers for a particular feature can also screw up the feature scaling for that feature. For e.g., a "min/max" or "unit variance" scaling is going to be sensitive to outliers (e.g., if one of your feature encodes yearly income or cash balance and there are a few mi/billionaires in your dataset).

Also, when you experience a problem such as this, the cause may not be obvious. It does not mean you perform feature scaling, result goes bad, then feature scaling is at fault. It could be that your method was screwed up to begin with, and the result after feature scaling just happens to be more screwed up.

So what could be other cause(s) of your problem?

  1. My guess for the most likely cause is that you have high-dimensional data and not enough training samples. This is because your GMM is going to estimating covariance matrices using data that is 34000 in dimension. Unless you have a lot of data, chances are one or more of your covariance matrices (one for each gaussian) are going to be near singular or singular. This means the predictions from your GMM are nonsense to begin with because your gaussians "blew" up, and/or the EM algorithm just gave up after a predefined number of iterations.
  2. Poor testing methodology. You did not have data divided into proper training/validation/test sets, and you did not perform the testing properly. What "good" performance you have in the beginning was not credible. This is actually very common, as the natural tendency is to test using the training data the model was fitted on and not on a validation or test set.

So what can you do?

  1. Don't use a GMM for image categorization. Use a proper supervised learning algorithm, especially if you have known image categories as labels. In particular, to avoid the feature scaling altogether, use random forest or its variants (e.g., extremely randomized trees).
  2. Get more training data. Unless you are classifying "simple" (i.e., "toy"/synthetic images) or you are classifying them into a few image classes (e.g., <= 5. Note this is just a random small number I pulled out of the air.), you really to have a good deal of images per class. A good starting point is to get at least a couple of hundreds per class, or use a more sophisticated algorithm to exploit the structure within your data to arrive at better performance.

Basically, my point is don't (just) treat machine learning field/algorithms as black boxes and a bunch of tricks which you memorize and try at random. Try to understanding the algorithm/math under the hood. That way, you'll be better able to diagnose the problem(s) you encounter.


EDIT (in response to request for clarification by @Zee)

For papers, the only one I can recall off the top of my head is A Practical Guide to Support Vector Classification by the authors of LibSVM. Examples therein show the importance of feature scaling for SVM on various datasets. E.g., consider the RBF/Gaussian kernel. This kernel uses the square L2 norm. If your features are of different scale, this will affect the value.

Also, how you represent your features matter. E.g., changing a variable that represents height from meters to cm or inches will affect algorithms such as PCA (because variance along direction for that feature has changed.) Note this is different from the "typical" scaling (e.g., min/max, Z-score etc.) in that this is a matter of representation. The person is still the same height regardless of the unit. Whereas typical feature scaling "transform" the data, which changes the "height" of the person. Prof. David Mackay, on the Amazon page of his book, Information Theory for Machine Learning, has a comment in this vein when asked why he did not include PCA in his book.

For ordinal and categorical variables, they are mentioned briefly in Bayesian Reasoning for Machine Learning, The Elements of Statistical Learning. They mention ways to encode them as features, for e.g., replacing a variable that can represent 3 categories with 3 binary variables, with one set to "1" to indicate the sample has that category. This is important for methods such as Linear Regression (or Linear Classifiers). Note this is about encoding categorical variables/features, not scaling per se, but they are part of the feature preprocessing set up, and hence useful to know. More can be found in Hal Duame III's book below.

The book A Course in Machine Learning by Hal Duame III. Search for "scaling". One of the earliest example in the book is how it affects KNN (which just uses L2 distance, which GMM, SVM etc. uses if you use the RBF/gaussian kernel). More details are given in the chapter 4, "Machine Learning in Practice". Unfortunately the images/plots are not shown in the PDF. This book has one of the nicest treatments on feature encoding and scaling, especially if you work on Natural Language Processing (NLP). E.g., see his explanation of applying the logarithm to features (i.e., log transform). That way, sums of logs become log of product of features, and "effects"/"contributions" of these features are tapered by the logarithm.

Note that all the aforementioned textbooks are freely downloadable from the above links.

Midriff answered 31/10, 2014 at 7:37 Comment(6)
Well, I did test my data on a training set. I am using GMMs to do the image categorization because I was curious to see how it would work out. Thanks for the advice!!Maryrose
This is a pretty good answer, however can you reference some papers? I am trying to find some scientific evidence of the effect of feature scaling.Communard
@Communard I've included more details. Feature scaling is more of an "art" or "preprocessing step" in research and is mentioned sporadically in different papers, often investigated in the context for a particular applications. I don't recall offhand any theoretical research on it, although if you understand the math for the algorithms, it is not hard to see how the scale changes things. Results in literature are mostly empirical (i.e., test on datasets and report best scaling method) which makes the procedures largely dependent on the nature of the data. You'll see some in the books above.Midriff
@Midriff thanks for the refs. I am familiar with basic normalisation and standardisation but I wanted some more info. Especially examples for success uses. Thank you for your answer it is exactly what I was looking for.Communard
The part on representation of features could be better explained if we simply say that the large variance of a feature is dominating other features of smaller variance. e.g. Suppose we want to classify people based on height (in metres) and weight (in kilograms). The height attribute has a low variability, ranging from 1.5m to 1.85m, whereas the weight attribute may vary from 50kg to 250kg. If the scale of the attributes are not taken into consideration, the distance measure may be dominated by differences in the weights of a person. Source: Introduction to Data Mining, Ch.5, Tan Pan-NingTinsley
Thanks for the detailed answer. Ragarding: "Furthermore, depending on your feature scaling method, presence of outliers for a particular feature can also screw up the feature scaling for that feature.", won't outliers influence performance especially for distance based methods even if the data is not normalized? I cannot see what normalization could worsen performance when there are outliers in the data.Herminahermine

© 2022 - 2024 — McMap. All rights reserved.