Hierarchical (categorical) data to tree plot

Asked 28/3, 2020 at 17:14 Answered 29/3, 2020 at 15:11

r ggplot2 tree cluster-analysis taxonomy

Data
I have the following (simplified) dataset, we call df from now on:

                     species    rank                   value
1           Pseudomonas putida  family        Pseudomonadaceae
2       Pseudomonas aeruginosa  family        Pseudomonadaceae
3  Enterobacter xiangfangensis  family      Enterobacteriaceae
4          Salmonella enterica  family      Enterobacteriaceae
5        Klebsiella pneumoniae  family      Enterobacteriaceae
6           Pseudomonas putida   genus             Pseudomonas
7       Pseudomonas aeruginosa   genus             Pseudomonas
8  Enterobacter xiangfangensis   genus            Enterobacter
9          Salmonella enterica   genus              Salmonella
10       Klebsiella pneumoniae   genus              Klebsiella
11          Pseudomonas putida species      Pseudomonas putida
12      Pseudomonas aeruginosa species  Pseudomonas aeruginosa
13 Enterobacter xiangfangensis species Enterobacter hormaechei
14         Salmonella enterica species     Salmonella enterica
15       Klebsiella pneumoniae species   Klebsiella pneumoniae

What I want to achieve

This data is taxonomy data that shows the species classification, where the rank is in order of family > genus > species. Due to the hierarchical nature I want to show this as a tree, preferentially in ggplot2 like so:

What I tried
While I found a package, taxize written to convert this (actually the full classification - only partially shown here) to a tree, using class2tree:

class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')
taxize::class2tree(class.dat)

This does not show the ranks like in my hand made graph, that I need in my visualization:

EDIT: dput of data

structure(list(species = c("Pseudomonas putida", "Pseudomonas putida", 
"Pseudomonas putida", "Pseudomonas aeruginosa", "Pseudomonas aeruginosa", 
"Pseudomonas aeruginosa", "Enterobacter xiangfangensis", "Enterobacter xiangfangensis", 
"Enterobacter xiangfangensis", "Salmonella enterica", "Salmonella enterica", 
"Salmonella enterica", "Klebsiella pneumoniae", "Klebsiella pneumoniae", 
"Klebsiella pneumoniae"), rank = c("family", "genus", "species", 
"family", "genus", "species", "family", "genus", "species", "family", 
"genus", "species", "family", "genus", "species"), value = c("Pseudomonadaceae", 
"Pseudomonas", "Pseudomonas putida", "Pseudomonadaceae", "Pseudomonas", 
"Pseudomonas aeruginosa", "Enterobacteriaceae", "Enterobacter", 
"Enterobacter hormaechei", "Enterobacteriaceae", "Salmonella", 
"Salmonella enterica", "Enterobacteriaceae", "Klebsiella", "Klebsiella pneumoniae"
)), row.names = c(NA, -15L), class = "data.frame", .Names = c("species", 
"rank", "value"))

EDIT: Response to @StupidWolf
I was able to convert the class.data to a dataframe and then into a parent-child dataframe to use it as input for the ggraph. The only thing left is having the xlabel, in this case the interest vector. However I'm not sure if that's possible in ggraph:

# Retreive data
class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')

# Specify interest
interest <- c('superkingdom', 'phylum','class','order','genus','species')

# Convert to wide matrix
df2 <- bind_rows(class.dat, .id = "column_label") %>%
  dplyr::select(-id) %>% 
  filter(rank %in% interest) %>%
  spread(rank, name) %>%
  dplyr::select(-column_label) %>%
  dplyr::select(interest) %>% # we need the order
  as.matrix()

# Empty parent child matrix
parent.child <- matrix(nrow=0,ncol=2)

# Add data to parent child
for (i in 1:(ncol(df2)-1)){
  parent.child <- rbind(parent.child,df2[,c(i,i+1)])
}

# To dataframe and add colnmaes
parent.child <- as.data.frame(parent.child)
colnames(parent.child) <- c('from', 'to')

# Convert this to a ggraph
g <- graph_from_data_frame(parent.child)
ggraph(g,layout='dendrogram',circular=FALSE) + 
  geom_edge_link() + 
  geom_node_label(aes(label=names(V(g))),size=3,nudge_y=-0.1) + 
  scale_y_reverse(labels = interest)  + coord_flip() +
  theme_classic()

Proportioned answered 28/3, 2020 at 17:14 Comment(4)

can you add a dput of your data? – Primateship 28/3, 2020 at 17:51

Of course! See the edit @Primateship – Proportioned 28/3, 2020 at 18:15

have a look at this thread here: gastonsanchez.com/visually-enforced/how-to/2014/06/29/… and the herein mentioned ape package cran.r-project.org/web/packages/ape/ape.pdf. this might be of hlep – Foist 28/3, 2020 at 18:43

Thankyou for pointing that out @Tjebo, but as far as I can see those produce similar plots as I could produce using taxize and thereby not explicitly show at what rank a species deviates – Proportioned 28/3, 2020 at 18:47

Then we create a hierarchical bundling

d1 = data.frame(from="origin",to=c("Pseudomonadaceae","Enterobacteriaceae"))
d2 = data.frame(from=c("Pseudomonadaceae","Pseudomonadaceae","Enterobacteriaceae","Enterobacteriaceae","Enterobacteriaceae"),to=c("Pseudomonas","Pseudomonas","Enterobacter","Salmonella","Klebsiella"))
d3 = data.frame(from=c("Pseudomonas","Pseudomonas","Enterobacter","Salmonella","Klebsiella"),to=c("Pseudomonas putida","Pseudomonas aeruginosa","Enterobacter hormaechei","Salmonella enterica","Klebsiella pneumoniae"))

hierarchy <- rbind(d1, d2,d3)

vertices <- data.frame(name = unique(c(as.character(hierarchy$from), as.character(hierarchy$to))) )

Then we either plot them using igraph:

g <- graph_from_data_frame( hierarchy, vertices=vertices )
lay = layout.reingold.tilford(g) 
par(mar=c(0,0,0,0))
plot(g, layout=-lay[, 2:1],vertex.label.cex=0.7,
vertex.size=1,edge.arrow.size= 0.4)

Or something like this in ggraph:

library(ggraph)
ggraph(g,layout='dendrogram',circular=FALSE) + 
geom_edge_link() + 
geom_node_label(aes(label=names(V(g))),size=2,nudge_y=-0.1) + 
scale_y_reverse()  + coord_flip() + theme_void()

Mccourt answered 28/3, 2020 at 19:59 Comment(7)

Would suit my need if we could generate the variables (like d1, d2, and d3) generic rather than defining their structure by hand. I only gave an example dataset here but mine consists of 7 ranks and more species – Proportioned 28/3, 2020 at 21:17

I get the data from

class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')

which already provides the dataframe. – Proportioned 28/3, 2020 at 21:49

Precisely my point. the output from classification(...) gives you a list, not a dataframe, and you did some conversion to get a data.frame. Do you mind trying to at least get a similar dataframe like d1,d2 and d3 from that? – Mccourt 28/3, 2020 at 22:4

for example, using your code above, if I do, t(sapply(class.dat,function(i)i[i$rank %in% c("family","genus"),"name"])), it's something similar to d2 right? – Mccourt 28/3, 2020 at 22:7

Found it! the default theme just hides it, so had to add geom_classic() and scale_y_reverse(labels = rev(interest) – Proportioned 29/3, 2020 at 11:1

Ok so if you find problems fitting the label in, one way is to use stringr::str_wrap(), so for example Vlab = str_wrap(names(V(g)),6) ; then geom_node_label(aes(label=Vlab),size=3,nudge_y=-0.1) – Mccourt 29/3, 2020 at 11:5

Thankyou! was just searching for that haha – Proportioned 29/3, 2020 at 11:12

Here's a graph based approach.

df = do.call(rbind, lapply(split(d, d$species), function(x){
    data.frame(rbind(c(x$value[match(c("family", "genus"), x$rank)], "root"),
                     c(x$value[match(c("genus", "species"), x$rank)], NA)),
               stringsAsFactors = FALSE)
}))
df = unique(df)
rownames(df) = NULL
df

library(igraph)

g = graph.data.frame(df, directed = FALSE)

plot(g, layout = layout_as_tree(g, root = which(V(g)$name %in% sort(unique(df[,1][df[,3] == "root"])))))

and ggplot

d2 = d %>%
    spread(rank, value) %>%
    arrange(family, genus, species) %>%
    mutate(species = sapply(strsplit(species, " "), "[", 2),
           y3 = row_number(),
           grp = row_number(),
           y2 = ave(y3, genus, FUN = function(x) mean(x)),
           y1 = ave(y2, family, FUN = function(x) mean(x))) %>%
    gather(key, y, -family, -genus, -species, -grp) %>%
    mutate(x = as.numeric(factor(key, c("y1", "y2", "y3"))),
           lbl = case_when(
               key == "y1" ~ family,
               key == "y2" ~ genus,
               key == "y3" ~ species,
               TRUE ~ NA_character_)) %>%
    arrange(x, y)

graphics.off()
ggplot(d2, aes(x, y, group = grp, label = lbl)) +
    geom_point(size = 2, shape = 21) +
    geom_line() +
    geom_text(hjust = "inward", vjust = "inward")

Seidler answered 28/3, 2020 at 19:9 Comment(1)

Is it possible to add an orgin here and the xl-lables as in my question example? besides this I like the results, although I would prefer something more generic instead of having to hardcode things like "y1" ~ family as my original data consists of 7 ranks rather than 3 – Proportioned 28/3, 2020 at 21:13

Great source for Phylogenetic trees with R by Prof.Guangchuang Yu:

https://yulab-smu.top/treedata-book/index.html

Heres my solution using ggtree:

# Packages :

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("ggtree")

library(ggplot2)
library(ggtree)
library(treeio)
library(ape)
library(tidytree)

its better to use a dedicated Format when computing phylogenies (for example NEXUS)


# New Hampshire eXtended format :

treetext="(((P.putida:1[&&NHX:S=S],P.aerufgiosa:1[&&NHX:S=S:B=])
:1.3[&&NHX:D=Pseudomonas:S=G]):1[&&NHX:D=Pseudonadaceae:S=F],
((K.pneumoniae:1[&&NHX:]):1.3[&&NHX:D=Klebsiella],(S.enterica:1[&&NHX:])
:1.3[&&NHX:D=Salmonella],(E.xiangfangensis:1[&&NHX:]):1.3[&&NHX:D=Enterobacter])
:1[&&NHX:D=Enterobacteriaceae])
:1[&&NHX:D=Gammaproteobacteria];"


tree <- read.nhx(textConnection(treetext))

# Plot Stuff

d <- data.frame(.panel = c('Tree','Tree','Tree','Tree'), 
                lab = c("Class","Family" ,"Genus", "Species"), 
                x=c(0,1,2,3), y=-2)

p<-ggtree(tree) + geom_tiplab(offset = F) + 
  geom_label(aes(x=branch, label=S), fill='lightgreen') + 
  geom_label(aes(label=D), fill='lightblue') + coord_cartesian(clip = 'off') + 
  theme_tree2(plot.margin=margin(3, 3, 3, 3 ,"cm"), axis.ticks = element_blank(), axis.text.x = element_blank())

p+geom_text(aes(label=lab), data=d)

Camara answered 29/3, 2020 at 15:11 Comment(0)

A solution with ggplot2:

# library
library(taxize)
library(ape)
library(ggdendro)
library(DECIPHER)
library(ggplot2)


# get data
class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')

#make tree
taxize::class2tree(class.dat, varstep=FALSE,check=TRUE) -> tree

#adjust length
tree$phylo <- compute.brlen(tree$phylo, 10)

#convert tree to Dendrogram
ape::write.tree(tree$phylo, file = "./data/test", append = FALSE,
           digits = 10, tree.names = FALSE)
dend <- DECIPHER::ReadDendrogram("./data/test")

#get data from the dendrogram
dend_data <- dendro_data(dend, type = "rectangle")

# plot it with ggplot2
ggplot() + 
  geom_segment(data=segment(dend_data), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=dend_data$labels, aes(x=x, y=y, label=label, hjust=0), size=3) +
  coord_flip() + 
  scale_y_reverse(limits=c(20,-12),expand=c(0.1,1),breaks=c(20,10,0), labels=c("Family","Genus","Species")) +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank())

Primateship answered 28/3, 2020 at 23:44 Comment(1)

Thankyou @ava, I lik this solution but here is seems that they share the rank family which they do not (see my example tree), also I provided a simple example of my data and in reality I use almost all ranks: Kingdom, phylum, class, order, family, genus, species – Proportioned 29/3, 2020 at 9:25

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags