pyspark OneHotEncoded vectors appear to be missing categories?
Asked Answered
Z

1

1

Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).

After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem

Have dataset of the form

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term  

with the actual data looking like

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

sourced from here: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.

After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

produces a row that looks like

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.

Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?

Zoography answered 31/7, 2018 at 1:9 Comment(0)
Z
1

Found this in the pyspark docs (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):

...with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

More discussion about why this kind of last-category-dropping would be done can be found here (http://www.algosome.com/articles/dummy-variable-trap-regression.html) and here (https://stats.stackexchange.com/q/290526/167299).

I am pretty new to machine learning of any kind, but it seems that basically (for regression models) dropping the last categorical value is done to avoid something called the dummy variable trap where "the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others" (so basically you'd have a redundant feature (which I assume is not good for weighting a ML model)).

Eg. don't need a 1hot encoding of [isBoy, isGirl, unspecified] when an encoding of [isBoy, isGirl] would communicate the same information about someone's gender, here [1,0]=isBoy, [0,1]=isGirl, and [0,0]=unspecified.

This link (http://www.algosome.com/articles/dummy-variable-trap-regression.html) provides a good example, with the conclusion being

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

** Note: In looking for an answer to the original question, found this similar SO post (Why does Spark's OneHotEncoder drop the last category by default?). Yet, I think that this current post warrants existing since the mentioned post is about why this behavior happens while this post is about being confused as to what was going on in the first place as well as the fact that this current question title does not find the mentioned post when pasting to google.

Zoography answered 31/7, 2018 at 1:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.