Mclust: Order of input parameters affecting clustering results
Asked Answered
I

4

6

I am using mclust to see various clusters in my data set using various numbers of input (X,Y,Z,R, and S in the script below):

e.g.

elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)

I just find out that the order of the input parameters matters and affect the results; in other words elements <- cbind(X,Y,Z,R,S) gives a different clusters than say elements-<cbind(Y,Z,X,R,S). My understanding is that all the input parameters have the same weight and importance in the clustering analysis. am I wrong or is it a bug?

I have seen that in R 2.15.3 and 2 other R versions.

Any comment on or explanation of the above is appreciated.

Instal answered 5/12, 2013 at 5:50 Comment(7)
Are these (X,Y,Z,R,S) whole instances used in the algorithm? How many clusters are appear in "dataclust" after algorithm run?Giese
I cannot reproduce your problem. Please provide an example, as described here.Wage
This still appears to be an issue that can come up. I realize this is an old thread, but I saw a 'good' example was never provided, so I'm posting one in hopes of getting clarification on how/why this issue arises. This link is to an an example. Mclust is run on the variables X1, X3, X5, X7, X9, and X11 in two different orders, producing two different solutions. gist.github.com/codydh/bb027d23705b5c2a61ec00afe07fe03eNinon
Could not reproduce the two different solutions provided by @Cody. I downloaded a fresh install of mclust and ran the commands (see my gist ) but both provided exact clustering solutions. Initially I thought the comment by @Anony-Mousse made sense because of the random nature of the Gaussian Model, but from the documentation, Mclust computes the most optimal model over various ones, hence it must provide same results (Mclust tries 9 different models)Calliopsis
By "9 different models", what I mean is it tries to cluster from G=1 cluster to G=9 clusters.Calliopsis
relevant seesionInfo(): R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.5, mclust_5.3Calliopsis
@Cody, could you please provide sessionInfo for your gist?Aesculapius
N
2

Unfortunately, I am unable to comment or edit my previous comment, so I'm posting an answer. @m-dz set me on a path that I think has revealed a possible answer. Specifically:

> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.

> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")

> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mclust_5.2.2

loaded via a namespace (and not attached):
[1] tools_3.3.2

As you can see, this produces two solutions that match @m-dz's! However, what I was previously doing was loading the psych package. I'm seeing now this is masking sim from mclust. I'm guessing this then causes the incorrect solutions:

> library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:mclust’:

    sim

> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.

> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:

 log.likelihood    n df      BIC      ICL
       3547.068 3195 49 6698.738 6692.126

Clustering table:
   1    2 
2759  436 

> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:

 log.likelihood    n  df      BIC      ICL
       18473.94 3195 137 35842.37 35834.51

Clustering table:
  1   2   3   4   5   6 
431 932 210 881 524 217 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] psych_1.6.9  mclust_5.2.2

loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2    foreign_0.8-67 mnormt_1.5-5  
Ninon answered 25/6, 2017 at 12:50 Comment(2)
It's not the psych package itself, looks like it is the read.file function, which is reading in the first row as a header and messing with the rest of the script - see my edit.Aesculapius
Small update, looks like columns 1:6 give different results. Need to investigate further.Aesculapius
A
1

Often, Gaussian Mixture Model clustering is initialized randomly, as it will only find a local maximum.

Don't expect it to return the same result all the time.

Awfully answered 6/12, 2013 at 9:26 Comment(3)
The package may have been updated over the years. Using the main Mclust function in the package gives "The optimal model according to BIC for EM initialized by hierarchical clustering..." Hierarchical clustering should in theory give same outputs over repeated runs.Calliopsis
Hmmmm. The Mclust package apparently follows an unusual approach here. Hierarchical clustering is very expensive to use for initialization. I'm mostly using the ELKI EM implementation, and it uses the same initialization as k-means.Awfully
But even hclust may yield different results, although I'd expect this to be rare (on discrete or low resolution fatal only), and to happen on permutation of rows not columns.Awfully
A
1

Edit:

My previous edit re. read.file treating the first row as a header is correct, but this is not the case. Apparently columns 1 to 6, regardless whether called V1, V2, V3, V4, V5, V6 or X1, X3, X5, X7, X9, X11, do give different results. I will investigate further slightly later.

library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United Kingdom.1252 
# [2] LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
#   [1] magrittr_1.5 psych_1.7.5  mclust_5.3  
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.0    parallel_3.4.0    tools_3.4.0      
# [4] foreign_0.8-68    rstudioapi_0.6    mdaddins_0.0.0001
# [7] nlme_3.1-131      mnormt_1.5-5      grid_3.4.0       
# [10] lattice_0.20-35  

testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE)  # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")

testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()

# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))

# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))

# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))

Old answer:

Have done my best to investigate the issue:

  • Current R (3.4.0) and mclust (5.3) tested: order and seed had no effect;
  • mclust 4.2 (current on Dec 5 '13 when the question was asked), the same, no effect;
  • R 2.25.3 mentioned by @user3068797: could not compile mclust 4.2, gave up as it would take too long to debug this;
  • @Cody did not provide a sessionInfo(), so do not know where to dig more.

To the code:

library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# other attached packages:
# [1] mclust_5.3

testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")

## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 



## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)

## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
  if(is.null(upper)) {
    upper <- any(x[row(x) < col(x)])
    lower <- any(x[row(x) > col(x)])
    if(upper && lower)
      stop("not a triangular matrix")
    if(!(upper || lower)) {
      x <- diag(x)
      return(diag(x * x))
    }
  }
  dimx <- dim(x)
  storage.mode(x) <- "double"
  .Fortran("uncholf",
           as.logical(upper),
           x,
           as.integer(nrow(x)),
           as.integer(ncol(x)),
           integer(1),
           PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol

## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
#   optimal number of clusters occurs at max choice

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
# log.likelihood    n df      BIC       ICL
#      -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
#   optimal number of clusters occurs at max choice



## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)

Edit:

Fortran functions unchol (mclust 4.2) and uncholf (mclust 5.3) do not differ: uncholf 5.3, unchol 4.3

Mclust does differ, but provide same results, so I guess changes were simply fixing errors etc.: Mclust 5.3 , Mclust 4.3

Aesculapius answered 24/6, 2017 at 12:20 Comment(6)
Thanks for this! I've given Cody the Bounty seeing as he doesn't have enough points to effectively operate his SO account. Hope this is okay!Folkmoot
I was able to replicate this same issue. The interesting part here is really the difference between your third "Same result" example and your second "Different result" example, given that they're using the same data frame. Using the 1st, 3rd, 5th, 7th, 9th, and 11th variables in two orders produces the same solution, but the 1st, 2nd, 3rd, 4th, 5th, and 6th variables in two orders does not.Ninon
@Pi, yeap, it is fine. I should shed a bit of light to the question in the next 5-6 hours.Aesculapius
@Cody, it is the same data frame, but different columns. The header row really does not make the difference. To quickly sum up what I have found - it is all about mclust::me() function with model type 'EEV' (off the top of my head). Will see later!Aesculapius
@m-dz: Looking forward to what you've found. I have discovered that randomizing the row order of my data.frame and then passing it to Mclust also produces different solutions. Perhaps what you've found will shed some light on that.Ninon
Apologies, I got really busy at work and just could not fully wrap my head around this and transfer my findings into a neat piece of code... If you check the output of Mclust() there is an object called BIC with all BIC scores and they differed only for one of the models (EEV if I remember right). The function calculating them is mclust::me(), would be worth to dig a bit more into it.Aesculapius
P
1

I notice this is a very old thread but I think it's still worth to post an (official) answer. The issue was addressed in mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models from the R Journal with suggested solutions ([https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf][1]), Pages 305-307. In short, "A problem with the MBHAC approach may arise in the presence of coarse data, resulting from the discrete nature of the data or from continuous data that are rounded when measured. In this case, ties must be broken by choosing the pair of entities that will be merged. This is often done at random, but regardless of which method is adopted for breaking ties, this choice can have important consequences because it changes the clustering of the remaining observations. Moreover, the final EM solution may depend on the ordering of the variables."

Pepe answered 6/6, 2021 at 0:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.