Identifying points by color
Asked Answered
C

3

1

I am following the tutorial over here : https://www.rpubs.com/loveb/som . This tutorial shows how to use the Kohonen Network (also called SOM, a type of machine learning algorithm) on the iris data.

I ran this code from the tutorial:

library(kohonen) #fitting SOMs
library(ggplot2) #plots
library(GGally) #plots
library(RColorBrewer) #colors, using predefined palettes

iris_complete <-iris[complete.cases(iris),] 
iris_unique <- unique(iris_complete) # Remove duplicates

#scale data
iris.sc = scale(iris_unique[, 1:4]) #Levels/Factors cannot be scaled... But used in predictive SOM:s using xyf. Later.

#build grid
iris.grid = somgrid(xdim = 10, ydim=10, topo="hexagonal", toroidal = TRUE)

set.seed(33) #for reproducability
iris.som <- som(iris.sc, grid=iris.grid, rlen=700, alpha=c(0.05,0.01), keep.data = TRUE)

#plot 1
plot(iris.som, type="count")

#plot2
var <- 1 #define the variable to plot
plot(iris.som, type = "property", property = getCodes(iris.som)[,var], main=colnames(getCodes(iris.som))[var], palette.name=terrain.colors)

The above code fits a Kohonen Network on the iris data. Each observation from the data set is assigned to each one of the "colorful circles" (also called "neurons") in the below pictures.

My question: In these plots, how would you identify which observations were assigned to which circles? Suppose I wanted to know which observations belong in the circles outlined in with the black triangles below:

enter image description here enter image description here

Is it possible to do this? Right now, I am trying to use iris.som$classif to somehow trace which points are in which circle. Is there a better way to do this?

UPDATE: @Jonny Phelps showed me how to identify observations within a triangular form (see answer below). But i am still not sure if it possible to identify irregular shaped forms. E.g. enter image description here

In a previous post (Labelling Points on a Plot (R Language)), a user showed me how to assign arbitrary numbers to each circle on the grid:

enter image description here

Based on the above plot, how could you use the "som$classif" statement to find out which observations were in circles 92,91,82,81,72 and 71?

Thanks

Capriccioso answered 23/1, 2021 at 21:0 Comment(0)
D
2

EDIT: Now with Shiny App!

A plotly solution is also possible, where you can mouse over individual neurons to display the associated iris rownames (called id here). Based on your iris.som data and Jonny Phelps' grid approach, you can just assign the row numbers as concatenated strings to the individual neurons and have these shown upon mouseover:

library(ggplot2)
library(plotly)
ga <- data.frame(g=iris.som$unit.classif, 
                 sample=seq_len(dim(iris.som$data[[1]])[1]))
grid_pts <- as.data.frame(iris.som$grid$pts)
grid_pts$column <- rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
grid_pts$classif <- 1:nrow(grid_pts)
grid_pts$id <- sapply(seq_along(grid_pts$classif), 
                      function(x) paste(ga$sample[ga$g==x], collapse=", "))
grid_pts$count <- sapply(seq_along(grid_pts$classif), 
                         function(x) length(ga$sample[ga$g==x]))
grid_pts$count <- factor(grid_pts$count, levels=0:max(grid_pts$count))
p1 <- ggplot(grid_pts, aes(x=x, y=y, colour=count, row=row, column=column, id=id)) +
    geom_point(size=8) +
    scale_colour_manual(values=c("grey50", heat.colors(length(unique(grid_pts$count))))) +
    theme_void() +
    theme(plot.margin=unit(c(1,rep(.3, 3)),"cm"))
ggplotly(p1)

Here is a full Shiny app that allows lasso selection and shows a table with the data:

invisible(suppressPackageStartupMessages(
    lapply(c("shiny","dplyr","ggplot2", "plotly", "kohonen", "GGally", "DT"),
           require, character.only=TRUE)))

iris_complete <- iris[complete.cases(iris),] 
iris_unique <- unique(iris_complete) # Remove duplicates

#scale data
iris.sc = scale(iris_unique[, 1:4]) #Levels/Factors cannot be scaled... But used in predictive SOM:s using xyf. Later.

#build grid
iris.grid = somgrid(xdim = 10, ydim=10, topo="hexagonal", toroidal = TRUE)

set.seed(33) #for reproducability
iris.som <- som(iris.sc, grid=iris.grid, rlen=700, alpha=c(0.05,0.01), keep.data = TRUE)

ga <- data.frame(g=iris.som$unit.classif, 
                 sample=seq_len(dim(iris.som$data[[1]])[1]))
grid_pts <- as.data.frame(iris.som$grid$pts)
grid_pts$column <- rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
grid_pts$classif <- 1:nrow(grid_pts)
grid_pts$id <- sapply(seq_along(grid_pts$classif), 
                      function(x) paste(ga$sample[ga$g==x], collapse=", "))
grid_pts$count <- sapply(seq_along(grid_pts$classif), 
                         function(x) length(ga$sample[ga$g==x]))
grid_pts$count <- factor(grid_pts$count, levels=0:max(grid_pts$count))

# Shiny app, adapted from https://gist.github.com/dgrapov/128e3be71965bf00495768e47f0428b9

ui <- fluidPage(
    fluidRow(
        column(12, plotlyOutput("plot", height = "600px")),
        column(12, DT::dataTableOutput('data_table'))
    )
)


server <- function(input, output){
    
    output$plot <- renderPlotly({
        req(data()) 
        p <- ggplot(data = data()$data, 
            aes(x=x, y=y, classif=classif, colour=count, row=row, column=column, id=id)) +
            geom_point(size=8) +
            scale_colour_manual(
                values=c("grey50", heat.colors(length(unique(grid_pts$count))))
            ) +
            theme_void() +
            theme(plot.margin=unit(c(1, rep(.3, 3)), "cm"))
        
        obj <- data()$sel
        if(nrow(obj) != 0) {
            p <- p + geom_point(data=obj, mapping=aes(x=x, y=y, classif=classif, 
                    count=count, row=row, column=column, id=id), color="blue", 
                    size=5, inherit.aes=FALSE)
        }
        ggplotly(p, source="p1") %>% layout(dragmode = "lasso")
    })
   
    selected <- reactive({
        event_data("plotly_selected", source = "p1")
    })

    output$data_table <- DT::renderDataTable(
        data()$sel, filter='top', options=list(  
            pageLength=5, autoWidth=TRUE
        )
    )
    
    data <- reactive({
        tmp <- grid_pts 
        sel <- tryCatch(filter(grid_pts, paste(x, y, sep="_") %in% 
                paste(selected()$x, selected()$y, sep="_")),
            error=function(e){NULL})
        list(data=tmp, sel=sel)
    })
}  

shinyApp(ui,server)

Decennary answered 1/2, 2021 at 22:50 Comment(0)
F
1

From what I can see, using iris.som$unit.classif & iris.som$grid is the way to go in isolating circles within the plotting grid. I have made an assumption that the classifier value matches the row index of iris.som$grid so this will need some more validation. Let me know if this helps your problem :)

findTriangle <- function(top_row, top_column, side_length, iris.som,
                         reverse=FALSE){
  
  # top_row: row index of the top most triangle value
  # top_column: column index...
  # side_length: how many rows does the triangle occupy?
  # iris.som: the som object
  # reverse: set to TRUE to flip the triangle
  
  # make the grid
  grid_pts <- as.data.frame(iris.som$grid$pts)
  grid_pts$column <-  rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
  grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
  grid_pts$classif <- 1:nrow(grid_pts)
  
  # starting point - top most point of the triangle
  # use reverse for triangles the other way around
  grid_pts$triangle <- FALSE
  grid_pts[grid_pts$column == top_column & grid_pts$row == top_row, ][["triangle"]] <- TRUE
  
  # loop through the remaining rows and fill out the triangle
  value_row <- top_row
  value_start_column <- grid_pts[grid_pts$triangle == TRUE,]$x
  value_end_column <- grid_pts[grid_pts$triangle == TRUE,]$x
  if(reverse){
    row_move <- -1
  }else{
    row_move <- 1
  }
  
  # update triangle
  for(row in 1:(side_length-1)){
    value_row <- value_row + row_move
    value_start_column <- value_start_column - 0.5
    value_end_column <- value_end_column + 0.5
    grid_pts[grid_pts$row == value_row & 
               grid_pts$x >= value_start_column & 
               grid_pts$x <= value_end_column, ]$triangle <- TRUE
  }

  # visualise
  pl <- ggplot(grid_pts, aes(x=x, y=rev(row), col=as.factor(triangle))) + 
    geom_point(size=7) + 
    scale_color_manual(values=c("grey", "indianred")) + 
    theme_void()
  print(pl)
  
  return(grid_pts)
}

# take the grid and pick out the triangle
top_row <- 2
top_column <- 6
side_length <- 4
reverse <- FALSE # set to TRUE to flip the triangle ie go from the bottom
grid_pts <- findTriangle(top_row, top_column, side_length, iris.som, reverse)

# now add the classifier and merge to get the co-ordinates
iris.sc2 <- as.data.frame(iris.sc)
iris.sc2$classif <- iris.som$unit.classif
iris.sc2 <- merge(iris.sc2, grid_pts, by=c("classif"), all.x=TRUE)

# filter to the points in the triangle
iris.sc2[iris.sc2$triangle==TRUE,]

Output data:

   classif Sepal.Length Sepal.Width Petal.Length Petal.Width   x        y column row triangle
21      16  -1.01537328   0.5506423   -1.3287735  -1.3042249 6.0 1.732051      6   2     TRUE
22      16  -1.01537328   0.3214643   -1.4419091  -1.3042249 6.0 1.732051      6   2     TRUE
39      25  -0.89501479   1.0089981   -1.3287735  -1.3042249 5.5 2.598076      5   3     TRUE
40      25  -0.77465630   1.0089981   -1.2722057  -1.3042249 5.5 2.598076      5   3     TRUE
41      25  -0.77465630   0.7798202   -1.3287735  -1.3042249 5.5 2.598076      5   3     TRUE
42      25  -1.01537328   0.7798202   -1.2722057  -1.3042249 5.5 2.598076      5   3     TRUE
43      25  -0.89501479   0.7798202   -1.2722057  -1.3042249 5.5 2.598076      5   3     TRUE
44      26  -0.89501479   0.5506423   -1.1590702  -0.9108454 6.5 2.598076      6   3     TRUE
45      26  -1.01537328   0.7798202   -1.2156380  -1.0419719 6.5 2.598076      6   3     TRUE
58      36  -0.53393933   0.7798202   -1.2722057  -1.0419719 6.0 3.464102      6   4     TRUE
59      36  -0.41358084   1.0089981   -1.3853413  -1.3042249 6.0 3.464102      6   4     TRUE
60      36  -0.53393933   0.7798202   -1.1590702  -1.3042249 6.0 3.464102      6   4     TRUE
61      37  -1.01537328   1.0089981   -1.2156380  -0.7797188 7.0 3.464102      7   4     TRUE
62      37  -1.01537328   1.0089981   -1.3853413  -1.1730984 7.0 3.464102      7   4     TRUE
63      37  -0.89501479   1.0089981   -1.3287735  -1.1730984 7.0 3.464102      7   4     TRUE
74      44   0.06785311   0.3214643    0.5945312   0.7937995 4.5 4.330127      4   5     TRUE
75      46  -0.65429782   1.4673539   -1.2722057  -1.3042249 6.5 4.330127      6   5     TRUE
76      46  -0.53393933   1.4673539   -1.2722057  -1.3042249 6.5 4.330127      6   5     TRUE
77      47  -0.89501479   1.6965319   -1.0459346  -1.0419719 7.5 4.330127      7   5     TRUE
78      47  -0.89501479   1.6965319   -1.2156380  -1.3042249 7.5 4.330127      7   5     TRUE
79      47  -0.89501479   1.4673539   -1.2722057  -1.0419719 7.5 4.330127      7   5     TRUE
80      47  -0.89501479   1.6965319   -1.2722057  -1.1730984 7.5 4.330127      7   5     TRUE

Validation plotting on the grid: Marked output on the grid

Floppy answered 26/1, 2021 at 17:11 Comment(7)
Thank you so much for your answer! This is incredible - i never knew this problem required such a detailed solution! The only thing I have in mind : what happens if you want to select an irregular shape on the grid? I was trying to use plotly::ggplotly(plot(iris.som, type="count")) to make the plot interactive. From here, i thought the user could select whichever the circles on the grid they wanted, and find out which observations were in those circles. I can not thank you enough for your help!Capriccioso
I had posted a similar question over here: #65798982 . Here, someone showed me how to arbitrarily label the circles on the grid. I was trying to continue the same logic: suppose i wanted to know which observations (according to the way they have been identified by the previous user who answered my question) were in circles "87 86 85 84" ... is this possible? "Classif" tells you which observations are in a given circle... but you have no idea of knowing where that circle is on the grid, and which circles are neighboring itCapriccioso
If only there was a way of combining both of your answers! :)Capriccioso
You need to make a clustering on the "codes" and then see which neuron in map$unit.classif to which cluster belongs to. See my post for details.Downright
It won't be enough to use plotly on its own. Building a shiny app may be possible to select points and populate the table eg by following #57128622 but it's a lot more advanced. I'm not sure I understand "you have no idea of knowing where that circle is on the grid". In my function I attach the columns x & y, the coordinates on the grid and column, row, these converted to indexes. I may have got these the wrong way round as in the other article 1 is at the bottomFloppy
@JonnyPhelps: thank you for your reply! If you look at my question, i updated it: please look at the last picture i uploaded (4th picture). Suppose i want to know which observations are in circles 92,91, 82, 81, 72, 71. Is this possible?Capriccioso
Yes, it's simply iris.sc2[iris.sc2$classif %in% c(92,91,82,81,72,71),]Floppy
D
1

I elaborated the example in my post, however, not on the iris data set but I suppose it is no problem: R, SOM, Kohonen Package, Outlier Detection and also added code snippets you might need. They show

  1. How to generate data, add outliers and depict them on plots
  2. How to train the SOM
  3. How to do the clustering
  4. How to use hierarchic clustering to add the cluster boundaries to the SOM plots
  5. Finally, I added the clusters predicted by SOM to compare them with the real clusters in which I generated the data

I think this answers your questions. It would also be nice to compare the performance of SOM with t-SNE. I have only used SOM as an experiment on the data I generated and on the real wine data set. It would also be nice to prepare heat maps if you have more than 2 variables. All the best to you analysis!

Downright answered 26/1, 2021 at 18:5 Comment(1)
Thank you for your answer Tamas! This is a lot of good information! I am still trying to figure out how to identify observations belong in which neurons on the grid. Do you know how to do this?Capriccioso

© 2022 - 2024 — McMap. All rights reserved.