Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder
(https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).
After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem
Have dataset of the form
1. Wife's age (numerical)
2. Wife's education (categorical) 1=low, 2, 3, 4=high
3. Husband's education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife's religion (binary) 0=Non-Islam, 1=Islam
6. Wife's now working? (binary) 0=Yes, 1=No
7. Husband's occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term
with the actual data looking like
wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1
sourced from here: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice.
After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...
for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
print encoder.k
dataset = encoder.transform(dataset)
produces a row that looks like
Row(
....,
numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]),
wife_edu_1hot=SparseVector(4, {2: 1.0}),
husband_edu_1hot=SparseVector(4, {3: 1.0}),
husband_occupation_1hot=SparseVector(4, {2: 1.0}),
SoL_index_1hot=SparseVector(4, {3: 1.0}),
wife_religion_1hot=SparseVector(1, {0: 1.0}),
wife_working_1hot=SparseVector(1, {0: 1.0}),
media_exposure_1hot=SparseVector(1, {0: 1.0}),
contraceptive_1hot=SparseVector(2, {0: 1.0})
)
My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn})
implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).
Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler
) and checking the array version of the resulting dataset.head(n=1)
vector shows
input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})
indicates a vector looking like
indices: 0 1 2 3 4... 9 12 17 18 19 20 21
[-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.
Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?