Hierarchical (categorical) data to tree plot
Asked Answered
P

4

6

Data
I have the following (simplified) dataset, we call df from now on:

                     species    rank                   value
1           Pseudomonas putida  family        Pseudomonadaceae
2       Pseudomonas aeruginosa  family        Pseudomonadaceae
3  Enterobacter xiangfangensis  family      Enterobacteriaceae
4          Salmonella enterica  family      Enterobacteriaceae
5        Klebsiella pneumoniae  family      Enterobacteriaceae
6           Pseudomonas putida   genus             Pseudomonas
7       Pseudomonas aeruginosa   genus             Pseudomonas
8  Enterobacter xiangfangensis   genus            Enterobacter
9          Salmonella enterica   genus              Salmonella
10       Klebsiella pneumoniae   genus              Klebsiella
11          Pseudomonas putida species      Pseudomonas putida
12      Pseudomonas aeruginosa species  Pseudomonas aeruginosa
13 Enterobacter xiangfangensis species Enterobacter hormaechei
14         Salmonella enterica species     Salmonella enterica
15       Klebsiella pneumoniae species   Klebsiella pneumoniae

What I want to achieve

This data is taxonomy data that shows the species classification, where the rank is in order of family > genus > species. Due to the hierarchical nature I want to show this as a tree, preferentially in ggplot2 like so: enter image description here


What I tried
While I found a package, taxize written to convert this (actually the full classification - only partially shown here) to a tree, using class2tree:

class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')
taxize::class2tree(class.dat)

This does not show the ranks like in my hand made graph, that I need in my visualization:

enter image description here


EDIT: dput of data

structure(list(species = c("Pseudomonas putida", "Pseudomonas putida", 
"Pseudomonas putida", "Pseudomonas aeruginosa", "Pseudomonas aeruginosa", 
"Pseudomonas aeruginosa", "Enterobacter xiangfangensis", "Enterobacter xiangfangensis", 
"Enterobacter xiangfangensis", "Salmonella enterica", "Salmonella enterica", 
"Salmonella enterica", "Klebsiella pneumoniae", "Klebsiella pneumoniae", 
"Klebsiella pneumoniae"), rank = c("family", "genus", "species", 
"family", "genus", "species", "family", "genus", "species", "family", 
"genus", "species", "family", "genus", "species"), value = c("Pseudomonadaceae", 
"Pseudomonas", "Pseudomonas putida", "Pseudomonadaceae", "Pseudomonas", 
"Pseudomonas aeruginosa", "Enterobacteriaceae", "Enterobacter", 
"Enterobacter hormaechei", "Enterobacteriaceae", "Salmonella", 
"Salmonella enterica", "Enterobacteriaceae", "Klebsiella", "Klebsiella pneumoniae"
)), row.names = c(NA, -15L), class = "data.frame", .Names = c("species", 
"rank", "value"))

EDIT: Response to @StupidWolf
I was able to convert the class.data to a dataframe and then into a parent-child dataframe to use it as input for the ggraph. The only thing left is having the xlabel, in this case the interest vector. However I'm not sure if that's possible in ggraph:

# Retreive data
class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')

# Specify interest
interest <- c('superkingdom', 'phylum','class','order','genus','species')

# Convert to wide matrix
df2 <- bind_rows(class.dat, .id = "column_label") %>%
  dplyr::select(-id) %>% 
  filter(rank %in% interest) %>%
  spread(rank, name) %>%
  dplyr::select(-column_label) %>%
  dplyr::select(interest) %>% # we need the order
  as.matrix()

# Empty parent child matrix
parent.child <- matrix(nrow=0,ncol=2)

# Add data to parent child
for (i in 1:(ncol(df2)-1)){
  parent.child <- rbind(parent.child,df2[,c(i,i+1)])
}

# To dataframe and add colnmaes
parent.child <- as.data.frame(parent.child)
colnames(parent.child) <- c('from', 'to')

# Convert this to a ggraph
g <- graph_from_data_frame(parent.child)
ggraph(g,layout='dendrogram',circular=FALSE) + 
  geom_edge_link() + 
  geom_node_label(aes(label=names(V(g))),size=3,nudge_y=-0.1) + 
  scale_y_reverse(labels = interest)  + coord_flip() +
  theme_classic()
Proportioned answered 28/3, 2020 at 17:14 Comment(4)
can you add a dput of your data?Primateship
Of course! See the edit @PrimateshipProportioned
have a look at this thread here: gastonsanchez.com/visually-enforced/how-to/2014/06/29/… and the herein mentioned ape package cran.r-project.org/web/packages/ape/ape.pdf. this might be of hlepFoist
Thankyou for pointing that out @Tjebo, but as far as I can see those produce similar plots as I could produce using taxize and thereby not explicitly show at what rank a species deviatesProportioned
M
5

Then we create a hierarchical bundling

d1 = data.frame(from="origin",to=c("Pseudomonadaceae","Enterobacteriaceae"))
d2 = data.frame(from=c("Pseudomonadaceae","Pseudomonadaceae","Enterobacteriaceae","Enterobacteriaceae","Enterobacteriaceae"),to=c("Pseudomonas","Pseudomonas","Enterobacter","Salmonella","Klebsiella"))
d3 = data.frame(from=c("Pseudomonas","Pseudomonas","Enterobacter","Salmonella","Klebsiella"),to=c("Pseudomonas putida","Pseudomonas aeruginosa","Enterobacter hormaechei","Salmonella enterica","Klebsiella pneumoniae"))

hierarchy <- rbind(d1, d2,d3)

vertices <- data.frame(name = unique(c(as.character(hierarchy$from), as.character(hierarchy$to))) ) 

Then we either plot them using igraph:

g <- graph_from_data_frame( hierarchy, vertices=vertices )
lay = layout.reingold.tilford(g) 
par(mar=c(0,0,0,0))
plot(g, layout=-lay[, 2:1],vertex.label.cex=0.7,
vertex.size=1,edge.arrow.size= 0.4)

enter image description here

Or something like this in ggraph:

library(ggraph)
ggraph(g,layout='dendrogram',circular=FALSE) + 
geom_edge_link() + 
geom_node_label(aes(label=names(V(g))),size=2,nudge_y=-0.1) + 
scale_y_reverse()  + coord_flip() + theme_void()

enter image description here

Mccourt answered 28/3, 2020 at 19:59 Comment(7)
Would suit my need if we could generate the variables (like d1, d2, and d3) generic rather than defining their structure by hand. I only gave an example dataset here but mine consists of 7 ranks and more speciesProportioned
I get the data from class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi') which already provides the dataframe.Proportioned
Precisely my point. the output from classification(...) gives you a list, not a dataframe, and you did some conversion to get a data.frame. Do you mind trying to at least get a similar dataframe like d1,d2 and d3 from that?Mccourt
for example, using your code above, if I do, t(sapply(class.dat,function(i)i[i$rank %in% c("family","genus"),"name"])), it's something similar to d2 right?Mccourt
Found it! the default theme just hides it, so had to add geom_classic() and scale_y_reverse(labels = rev(interest)Proportioned
Ok so if you find problems fitting the label in, one way is to use stringr::str_wrap(), so for example Vlab = str_wrap(names(V(g)),6) ; then geom_node_label(aes(label=Vlab),size=3,nudge_y=-0.1)Mccourt
Thankyou! was just searching for that hahaProportioned
S
3

Here's a graph based approach.

df = do.call(rbind, lapply(split(d, d$species), function(x){
    data.frame(rbind(c(x$value[match(c("family", "genus"), x$rank)], "root"),
                     c(x$value[match(c("genus", "species"), x$rank)], NA)),
               stringsAsFactors = FALSE)
}))
df = unique(df)
rownames(df) = NULL
df

library(igraph)

g = graph.data.frame(df, directed = FALSE)

plot(g, layout = layout_as_tree(g, root = which(V(g)$name %in% sort(unique(df[,1][df[,3] == "root"])))))

and ggplot

d2 = d %>%
    spread(rank, value) %>%
    arrange(family, genus, species) %>%
    mutate(species = sapply(strsplit(species, " "), "[", 2),
           y3 = row_number(),
           grp = row_number(),
           y2 = ave(y3, genus, FUN = function(x) mean(x)),
           y1 = ave(y2, family, FUN = function(x) mean(x))) %>%
    gather(key, y, -family, -genus, -species, -grp) %>%
    mutate(x = as.numeric(factor(key, c("y1", "y2", "y3"))),
           lbl = case_when(
               key == "y1" ~ family,
               key == "y2" ~ genus,
               key == "y3" ~ species,
               TRUE ~ NA_character_)) %>%
    arrange(x, y)

graphics.off()
ggplot(d2, aes(x, y, group = grp, label = lbl)) +
    geom_point(size = 2, shape = 21) +
    geom_line() +
    geom_text(hjust = "inward", vjust = "inward")
Seidler answered 28/3, 2020 at 19:9 Comment(1)
Is it possible to add an orgin here and the xl-lables as in my question example? besides this I like the results, although I would prefer something more generic instead of having to hardcode things like "y1" ~ family as my original data consists of 7 ranks rather than 3Proportioned
C
3

Great source for Phylogenetic trees with R by Prof.Guangchuang Yu:

https://yulab-smu.top/treedata-book/index.html

Heres my solution using ggtree:

# Packages :

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("ggtree")

library(ggplot2)
library(ggtree)
library(treeio)
library(ape)
library(tidytree)


its better to use a dedicated Format when computing phylogenies (for example NEXUS)


# New Hampshire eXtended format :

treetext="(((P.putida:1[&&NHX:S=S],P.aerufgiosa:1[&&NHX:S=S:B=])
:1.3[&&NHX:D=Pseudomonas:S=G]):1[&&NHX:D=Pseudonadaceae:S=F],
((K.pneumoniae:1[&&NHX:]):1.3[&&NHX:D=Klebsiella],(S.enterica:1[&&NHX:])
:1.3[&&NHX:D=Salmonella],(E.xiangfangensis:1[&&NHX:]):1.3[&&NHX:D=Enterobacter])
:1[&&NHX:D=Enterobacteriaceae])
:1[&&NHX:D=Gammaproteobacteria];"


tree <- read.nhx(textConnection(treetext))

# Plot Stuff

d <- data.frame(.panel = c('Tree','Tree','Tree','Tree'), 
                lab = c("Class","Family" ,"Genus", "Species"), 
                x=c(0,1,2,3), y=-2)

p<-ggtree(tree) + geom_tiplab(offset = F) + 
  geom_label(aes(x=branch, label=S), fill='lightgreen') + 
  geom_label(aes(label=D), fill='lightblue') + coord_cartesian(clip = 'off') + 
  theme_tree2(plot.margin=margin(3, 3, 3, 3 ,"cm"), axis.ticks = element_blank(), axis.text.x = element_blank())

p+geom_text(aes(label=lab), data=d)

Image

Camara answered 29/3, 2020 at 15:11 Comment(0)
P
2

A solution with ggplot2:

# library
library(taxize)
library(ape)
library(ggdendro)
library(DECIPHER)
library(ggplot2)


# get data
class.dat <- classification(c("Pseudomonas putida", "Pseudomonas aeruginosa","Enterobacter xiangfangensis","Salmonella enterica","Klebsiella pneumoniae"), db = 'ncbi')

#make tree
taxize::class2tree(class.dat, varstep=FALSE,check=TRUE) -> tree

#adjust length
tree$phylo <- compute.brlen(tree$phylo, 10)

#convert tree to Dendrogram
ape::write.tree(tree$phylo, file = "./data/test", append = FALSE,
           digits = 10, tree.names = FALSE)
dend <- DECIPHER::ReadDendrogram("./data/test")

#get data from the dendrogram
dend_data <- dendro_data(dend, type = "rectangle")

# plot it with ggplot2
ggplot() + 
  geom_segment(data=segment(dend_data), aes(x=x, y=y, xend=xend, yend=yend)) + 
  geom_text(data=dend_data$labels, aes(x=x, y=y, label=label, hjust=0), size=3) +
  coord_flip() + 
  scale_y_reverse(limits=c(20,-12),expand=c(0.1,1),breaks=c(20,10,0), labels=c("Family","Genus","Species")) +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.x=element_blank())

enter image description here

Primateship answered 28/3, 2020 at 23:44 Comment(1)
Thankyou @ava, I lik this solution but here is seems that they share the rank family which they do not (see my example tree), also I provided a simple example of my data and in reality I use almost all ranks: Kingdom, phylum, class, order, family, genus, speciesProportioned

© 2022 - 2024 — McMap. All rights reserved.