Creating a Sankey Diagram using NetworkD3 package in R

Asked 23/5, 2017 at 10:36 Answered 8/9, 2018 at 16:21

r plot sankey-diagram htmlwidgets networkd3

Currently I am trying to create an interactive Sankey with the networkD3 Package following the instructions by Chris Grandrud (https://christophergandrud.github.io/networkD3/).
What I don't understand is is table-format, since he just uses two columns for visualising more transitions. To be more specific, I have a dataset containing four columns which represent 4 years. Inside these columns are different hotel names, whereas each row represents one customer, who is "tracked" over these four years.

    URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
    Energy <- jsonlite::fromJSON(URL)

    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
         Target = "target", Value = "value", NodeID = "name",
         units = "TWh", fontSize = 12, nodeWidth = 30)

To give you an overview of my data here is a screenshot:

SampleDataScreenshot

I would give you more "coded" information but since I am very new to the topic of R I hope you can follow my train of thoughts in this problem. If not, please do not hesistate to question it.

Thank you :)

Spitfire answered 23/5, 2017 at 10:36 Comment(1)

please make a minimal reproducible example so that it's easier to help you – Coldiron 23/5, 2017 at 11:25

you need two dataframes: one listing all nodes (containing the names) and one listing the links. The latter contains three columns, the source node, the target node and some value, indicating the strength or width of the link. In the links dataframe you refer to the nodes by the (zero-based) position in the nodes dataframe.

Assuming you data looks like:

df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 stringsAsFactors = FALSE)

For the diagram you need to differentiate not only between the hotels but between the hotel/year combination since each of them should be one node:

df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)

the links are the "transitions" between the hotels from one year to the next:

library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())

colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")

links <- rbind(as.data.frame(trans1_2), 
               as.data.frame(trans2_3), 
               as.data.frame(trans3_4))

finally, the dataframes need to be referenced to each other:

nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

Then the diagram can be drawn:

library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
              Target = "target", Value = "sum", NodeID = "name",
              fontSize = 12, nodeWidth = 30)

There might be more elegant solutions, but this could be a starting point for your problem. If you don't like the "Year..." in the nodes' names you con remove them after setting up the dataframes.

Highchair answered 26/5, 2017 at 20:19 Comment(5)

Thanks for your reply, it was really useful for me in trying to visualize user journeys. I'm struggling to remove the Year (StepX in my case) after the data frames are created and how to visualize the empty cases (when users exit the website). Do you have any suggestions? – Hydrophobic 19/9, 2017 at 14:32

you can just modify the levels of nodes$name, e.g. – Highchair 20/9, 2017 at 9:4

you can just modify the levels of nodes$name, e.g. by nodes$name <- sub("Year[1-4]_", "", levels(nodes$name)). For the left users you may consider to add a new node in each step that grows from step to step. By using the LinkGroup parameter you can distinguish the links to the "leaving node" from the other links – Highchair 20/9, 2017 at 9:12

Hey, If I wish to implement onClick on clicking these lines, can you help me with this problem here. – Aeromarine 11/10, 2017 at 11:19

Unforunately not. Maybe that is worth a new question. – Highchair 12/10, 2017 at 9:0

This question comes up a lot... how to convert a dataset that has multiple links/edges defined on each row across several columns. Here's how I convert that into the type of dataset that sankeyNetwork (and many other packages that deal with edges/links/network data) uses... a dataset with one edge/link per row.

starting with an example dataset...

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
name  year1           year2         year3           year4
Bob   Hilton          Sheraton      Westin          Hyatt
John  "Four Seasons"  Ritz-Carlton  Westin          Sheraton
Tom   Ritz-Carlton    Westin        Sheraton        Hyatt
Mary  Westin          Sheraton      "Four Seasons"  Ritz-Carlton
Sue   Hyatt           Ritz-Carlton  Hilton          Sheraton
Barb  Hilton          Sheraton      Ritz-Carlton    "Four Seasons"
')
    
#   name        year1        year2        year3        year4
# 1  Bob       Hilton     Sheraton       Westin        Hyatt
# 2 John Four Seasons Ritz-Carlton       Westin     Sheraton
# 3  Tom Ritz-Carlton       Westin     Sheraton        Hyatt
# 4 Mary       Westin     Sheraton Four Seasons Ritz-Carlton
# 5  Sue        Hyatt Ritz-Carlton       Hilton     Sheraton
# 6 Barb       Hilton     Sheraton Ritz-Carlton Four Seasons

create a row number so that you'll still be able to determine which row/observation each individual link came from when you convert the data to long format
use tidyr's pivot_longer() function to convert the dataset to long format
convert the column name variable to the index/number of the column in the original dataset
grouped by row (each observation in the original dataset), create a variable for each source node's "target" by setting it to the node following it in the next column
filter out any rows that have NA for "target" (nodes in the last column of the original dataset will not have a "target", and therefore those rows do not specify a link)

library(dplyr)
library(tidyr)

links <-
  df %>%
  mutate(row = row_number()) %>%  # add a row id
  pivot_longer(-row, names_to = "column", values_to = "source") %>%  # gather all columns
  mutate(column = match(column, names(df))) %>%  # convert col names to col ids
  group_by(row) %>%
  mutate(target = lead(source, order_by = column)) %>%  # get target from following node in row
  ungroup() %>% 
  filter(!is.na(target))  # remove links from last column in original data

# # A tibble: 24 x 4
#      row column source       target      
#    <int>  <int> <chr>        <chr>       
#  1     1      1 Bob          Hilton      
#  2     1      2 Hilton       Sheraton    
#  3     1      3 Sheraton     Westin      
#  4     1      4 Westin       Hyatt       
#  5     2      1 John         Four Seasons
#  6     2      2 Four Seasons Ritz-Carlton
#  7     2      3 Ritz-Carlton Westin      
#  8     2      4 Westin       Sheraton    
#  9     3      1 Tom          Ritz-Carlton
# 10     3      2 Ritz-Carlton Westin      
# # … with 14 more rows

Now the data is already in the typical network data format of one link per row defined by "source" and "target" columns, and it could be used with the sankeyNetwork(). However, you will likely want nodes referring to the same thing appearing multiple times within your plot... if someone visited the Hilton in year 1, and then visited the Hilton again in year 3, you will probably want 2 separate nodes, both named Hilton, but appearing in different parts of the plot. In order to do that, you will have to identify each node in your "source" and "target" columns with the year in which they were visited. That's where keeping the "row" and "column" variables around will come in handy.

Append the column index to the "source" name, and append the column index + 1 to the "target" name, and now you will be able to distinguish, for instance, between the node for Hilton which was visited in year 1 and the node for Hilton that was visited in year 3.

links <-
  links %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

# # A tibble: 24 x 2
#    source         target        
#    <chr>          <chr>         
#  1 Bob_1          Hilton_2      
#  2 Hilton_2       Sheraton_3    
#  3 Sheraton_3     Westin_4      
#  4 Westin_4       Hyatt_5       
#  5 John_1         Four Seasons_2
#  6 Four Seasons_2 Ritz-Carlton_3
#  7 Ritz-Carlton_3 Westin_4      
#  8 Westin_4       Sheraton_5    
#  9 Tom_1          Ritz-Carlton_2
# 10 Ritz-Carlton_2 Westin_3      
# # … with 14 more rows

Now you can follow the rather standard procedure for using a source-target list of links to build the necessary data frames for sankeyNetwork().

Create a nodes data frame with all the unique nodes found in the "source" and "target" vectors. You can also create a label vector in the nodes data frame that does not include the year/column id suffix.

nodes <- data.frame(name = unique(c(links$source, links$target)))
nodes$label <- sub('_[0-9]*$', '', nodes$name) # remove column id from node label

# # A tibble: 23 x 2
#    name           label       
#    <chr>          <chr>       
#  1 Bob_1          Bob         
#  2 Hilton_2       Hilton      
#  3 Sheraton_3     Sheraton    
#  4 Westin_4       Westin      
#  5 John_1         John        
#  6 Four Seasons_2 Four Seasons
#  7 Ritz-Carlton_3 Ritz-Carlton
#  8 Tom_1          Tom         
#  9 Ritz-Carlton_2 Ritz-Carlton
# 10 Westin_3       Westin      
# # … with 13 more rows

Convert the "source" and "target" vectors in the links data frame to be the 0-based-index of the node in the nodes data frame. Add an arbitrary value for each link in the links data frame since it's required by sankeyNetwork(). Then plot it with sankeyNetwork()!

links$source_id <- match(links$source, nodes$name) - 1
links$target_id <- match(links$target, nodes$name) - 1
links$value <- 1

library(networkD3)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source_id',
              Target = 'target_id', Value = 'value', NodeID = 'label')

Coldiron answered 8/9, 2018 at 16:21 Comment(0)

Recommended topics

Hot tags