Edit:
My previous edit re. read.file
treating the first row as a header is correct, but this is not the case. Apparently columns 1 to 6, regardless whether called V1, V2, V3, V4, V5, V6
or X1, X3, X5, X7, X9, X11
, do give different results. I will investigate further slightly later.
library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252
# [2] LC_CTYPE=English_United Kingdom.1252
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United Kingdom.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods
# [7] base
#
# other attached packages:
# [1] magrittr_1.5 psych_1.7.5 mclust_5.3
#
# loaded via a namespace (and not attached):
# [1] compiler_3.4.0 parallel_3.4.0 tools_3.4.0
# [4] foreign_0.8-68 rstudioapi_0.6 mdaddins_0.0.0001
# [7] nlme_3.1-131 mnormt_1.5-5 grid_3.4.0
# [10] lattice_0.20-35
testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE) # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()
# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))
# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))
# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))
Old answer:
Have done my best to investigate the issue:
- Current R (3.4.0) and mclust (5.3) tested: order and seed had no effect;
- mclust 4.2 (current on Dec 5 '13 when the question was asked), the same, no effect;
- R 2.25.3 mentioned by @user3068797: could not compile mclust 4.2, gave up as it would take too long to debug this;
- @Cody did not provide a sessionInfo(), so do not know where to dig more.
To the code:
library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# other attached packages:
# [1] mclust_5.3
testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")
## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)
## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
if(is.null(upper)) {
upper <- any(x[row(x) < col(x)])
lower <- any(x[row(x) > col(x)])
if(upper && lower)
stop("not a triangular matrix")
if(!(upper || lower)) {
x <- diag(x)
return(diag(x * x))
}
}
dimx <- dim(x)
storage.mode(x) <- "double"
.Fortran("uncholf",
as.logical(upper),
x,
as.integer(nrow(x)),
as.integer(ncol(x)),
integer(1),
PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol
## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
# optimal number of clusters occurs at max choice
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
# optimal number of clusters occurs at max choice
## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)
Edit:
Fortran functions unchol (mclust 4.2) and uncholf (mclust 5.3) do not differ:
uncholf 5.3, unchol 4.3
Mclust does differ, but provide same results, so I guess changes were simply fixing errors etc.: Mclust 5.3 , Mclust 4.3
mclust
and ran the commands (see my gist ) but both provided exact clustering solutions. Initially I thought the comment by @Anony-Mousse made sense because of the random nature of the Gaussian Model, but from the documentation,Mclust
computes the most optimal model over various ones, hence it must provide same results (Mclust
tries 9 different models) – CalliopsisG=1
cluster toG=9
clusters. – CalliopsisseesionInfo()
:R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.5
,mclust_5.3
– Calliopsis