Clustering algorithm in R for missing categorical and numerical values

Asked 3/6, 2014 at 23:26 Answered 24/11, 2017 at 6:26

r machine-learning cluster-analysis missing-data

I want to perform marketing segmentation clustering on a dataset with missing categorical and numerical values in R. I cannot perform k-means clustering because of the missing values.

R version 3.1.0 (2014-04-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

Mac OSX 10.9.3 4GB hardrive

Is there a clustering algorithm package in R available that can accommodate a partial fill rate? Looking at scholarly articles on missing values, the researchers create a new algorithm for the special use case and the packages are not available in R. For example, k-means with soft constraints and k-means clustering with partial distance strategy.

I have 36 variables, but here is description of the first 5:

head(df)

  user_id    Age   Gender Household.Income Marital.Status
1   12945           Male                                
2   12947           Male                                
3   12990                                                  
4   13160   25-34   Male   100k-125k         Single
5   13195           Male    75k-100k         Single
6   13286

Please let me know if I can provide additional information.

Forelli answered 3/6, 2014 at 23:26 Comment(1)

@EDi, there have been scalability issues with matrix-oriented approaches before traditional clustering methods. I got an error about being unable to allocate vector of a certain size. – Forelli 13/6, 2014 at 17:57

k-means algorithm is usually not preferred in presence of categorical variables. There is a variant of k-means, called k-prototypes, which can handle mixed data types. You can find more about the package that can do this here.

For missing values, you may either remove those rows (which is usually not preferred) or impute suitable values. Generally, for a numeric value, mean value can be imputed and for a categorical variable, mode can be imputed. Or, for imputation, standard packages such as mice can be used.

Ref:

Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

Gayton answered 24/11, 2017 at 6:26 Comment(0)

I'd suggest using hierarchical clustering (HC) with Gower's metric. Check the possibility of replacing NAs by empty cells.

HC can handle categorical and numerical values. Check it out the daisy package in R.

daisy(x, metric ="gower",stand = FALSE, type = list(), weights = rep.int(1, p))

For more info, here it is: https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html

Mephistopheles answered 27/6, 2016 at 6:7 Comment(0)

A variant of Eduardo's answer would be to use sparse matrix approximation to fill in the missing cells, and then to cluster. Once you have estimates for all values, you can use either hierarchical or k-means. See the Amelia or softImpute packages.

Scutum answered 7/10, 2016 at 23:52 Comment(0)

Recommended topics

Hot tags