Chapter 2 Manipulating Tree with Data

2.1 Manipulating tree data using tidy interface

All the tree data parsed/merged by treeio can be converted to tidy data frame using the tidytree package. The tidytree package provides tidy interfaces to manipulate tree with associated data. For instances, external data can be linked to phylogeny or evolutionary data obtained from different sources can be merged using tidyverse verbs. After the tree data was manipulated, it can be converted back to treedata object and exported to a single tree file, further analyzed in R or visualized using ggtree (Yu et al. 2017).

2.1.1 The phylo object

The phylo class defined in ape is fundamental for phylogenetic analysis in R. Most of the R packages in this field rely extensively on phylo object. The tidytree package provides as_tibble method to convert the phylo object to tidy data frame, a tbl_tree object.

library(ape)
set.seed(2017)
tree <- rtree(4)
tree
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
## 
## Rooted; includes branch lengths.
x <- as_tibble(tree)
x
## # A tibble: 7 x 4
##   parent  node branch.length label
##    <int> <int>         <dbl> <chr>
## 1      5     1       0.435   t4   
## 2      7     2       0.674   t1   
## 3      7     3       0.00202 t3   
## 4      6     4       0.0251  t2   
## 5      5     5      NA       <NA> 
## 6      5     6       0.472   <NA> 
## 7      6     7       0.274   <NA>

The tbl_tree object can be converted back to a phylo object.

as.phylo(x)
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
## 
## Rooted; includes branch lengths.

Using tbl_tree object makes tree and data manipulation more effective and easier. For example, we can link evolutionary trait to phylogeny using dplyr verbs full_join:

d <- tibble(label = paste0('t', 1:4),
            trait = rnorm(4))

y <- full_join(x, d, by = 'label')
y
## # A tibble: 7 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     5      NA       <NA>  NA    
## 6      5     6       0.472   <NA>  NA    
## 7      6     7       0.274   <NA>  NA

2.1.2 The treedata object

The tidytree package defines treedata class to store phylogenetic tree with associated data. After mapping external data to the tree structure, the tbl_tree object can be converted to a treedata object.

as.treedata(y)
## 'treedata' S4 object'.
## 
## ...@ phylo: 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t4" "t1" "t3" "t2"
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'trait'.

The treedata class is also used in treeio package to store evolutionary evidences inferred by commonly used software (BEAST, EPA, HYPHY, MrBayes, PAML, PHYLDOG, pplacer, r8s, RAxML and RevBayes, etc.) (see details in Chapter 1).

The tidytree package also provides as_tibble to convert treedata object to a tidy data frame. The phylogentic tree structure and the evolutionary inferences were stored in the tbl_tree object, making it consistent and easier for manipulating evolutionary statistics inferred by different software as well as linking external data to the same tree structure.

y %>% as.treedata %>% as_tibble
## # A tibble: 7 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     5      NA       <NA>  NA    
## 6      5     6       0.472   <NA>  NA    
## 7      6     7       0.274   <NA>  NA

2.1.3 Access related nodes

dplyr verbs can be applied to tbl_tree directly to manipulate tree data. In addition, tidytree provides several verbs to filter related nodes, including child, parent, offspring, ancestor, sibling and MRCA.

These verbs accept a tbl_tree and a selected node which can be node number or label.

child(y, 5)
## # A tibble: 2 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1         0.435 t4     0.943
## 2      5     6         0.472 <NA>  NA
parent(y, 2)
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      6     7         0.274 <NA>     NA
offspring(y, 5)
## # A tibble: 6 x 5
##   parent  node branch.length label  trait
##    <int> <int>         <dbl> <chr>  <dbl>
## 1      5     1       0.435   t4     0.943
## 2      7     2       0.674   t1    -0.171
## 3      7     3       0.00202 t3     0.570
## 4      6     4       0.0251  t2    -0.283
## 5      5     6       0.472   <NA>  NA    
## 6      6     7       0.274   <NA>  NA
ancestor(y, 2)
## # A tibble: 3 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      5     5        NA     <NA>     NA
## 2      5     6         0.472 <NA>     NA
## 3      6     7         0.274 <NA>     NA
sibling(y, 2)
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      7     3       0.00202 t3    0.570
MRCA(y, 2, 3)
## # A tibble: 1 x 5
##   parent  node branch.length label trait
##    <int> <int>         <dbl> <chr> <dbl>
## 1      6     7         0.274 <NA>     NA

All these methods also implemented in treeio for working with phylo and treedata objects. You can try accessing related nodes using the tree object. For instance, the following command will output child nodes of the selected internal node 5:

child(tree, 5)
## [1] 1 6

Beware that the methods work for tree objects output related node numbers, while the methods implemented for tbl_tree object output a tibble object that contains related information.

2.2 Data Integration

2.2.1 Combining tree data

The treeio package serves as an infrastructure that enables various types of phylogenetic data inferred from common analysis programs to be imported and used in R. For instance dN/dS or ancestral sequences estimated by CODEML, and clade support values (posterior) inferred by BEAST/MrBayes. In addition, treeio package supports linking external data to phylogeny. It brings these external phylogenetic data (either from software output or exteranl sources) to the R community and make it available for further analysis in R. Furthermore, treeio can combine multiple phylogenetic trees together into one with their node/branch-specific attribute data. Essentially, as a result, one such attribute (e.g., substitution rate) can be mapped to another attribute (e.g., dN/dS) of the same node/branch for comparison and further computations (Yu et al. 2017).

A previously published data set, seventy-six H3 hemagglutinin gene sequences of a lineage containing swine and human influenza A viruses (Liang et al. 2014), was here to demonstrate the utilities of comparing evolutionary statistics inferred by different software. The dataset was re-analyzed by BEAST for timescale estimation and CODEML for synonymous and non-synonymous substitution estimation. In this example, we first parsed the outputs from BEAST using read.beast and from CODEML using read.codeml into two treedata objects. Then the two objects containing separate sets of node/branch-specific data were merged via the merge_tree function.

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
beast_tree <- read.beast(beast_file)
codeml_tree <- read.codeml(rst_file, mlc_file)

merged_tree <- merge_tree(beast_tree, codeml_tree)
merged_tree
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree',
##  '/home/ygc/R/library/ggtree/examples/rst',
##  '/home/ygc/R/library/ggtree/examples/mlc'.
## 
## ...@ phylo: 
## Phylogenetic tree with 76 tips and 75 internal nodes.
## 
## Tip labels:
##  A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'height',   'height_0.95_HPD',  'height_median',
##  'height_range', 'length',   'length_0.95_HPD',
##  'length_median',    'length_range', 'posterior',    'rate',
##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'subs',
##  'AA_subs',  't',    'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',
##  'S_x_dS'.

After merging the beast_tree and codeml_tree objects, all node/branch-specific data imported from BEAST and CODEML output files are all available in the merged_tree object. The tree object was converted to tidy data frame using tidytree package and visualized as hexbin scatterplot of dN/dS, dN and dS inferred by CODEML versus rate (substitution rate in unit of substitutions/site/year) inferred by BEAST on the same branches.

library(dplyr)
df <- merged_tree %>% 
  as_tibble() %>%
  select(dN_vs_dS, dN, dS, rate) %>%
  subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%
  tidyr::gather(type, value, dN_vs_dS:dS)
df$type[df$type == 'dN_vs_dS'] <- 'dN/dS'
df$type <- factor(df$type, levels=c("dN/dS", "dN", "dS"))
ggplot(df, aes(rate, value)) + geom_hex() + 
  facet_wrap(~type, scale='free_y') 
Correlation of dN/dS, dN and dS versus substitution rate. After merging the BEAST and CodeML outputs, the branch-specific estimates (substitution rate, dN/dS , dN and dS) from the two analysis programs are compared on the same branch basis. The associations of dN/dS, dN and dS vs. rate are visualized in hexbin scatter plots.

Figure 2.1: Correlation of dN/dS, dN and dS versus substitution rate. After merging the BEAST and CodeML outputs, the branch-specific estimates (substitution rate, dN/dS , dN and dS) from the two analysis programs are compared on the same branch basis. The associations of dN/dS, dN and dS vs. rate are visualized in hexbin scatter plots.

The output is illustrated in Fig. 2.1. We can then test the association of these node/branch-specific data using Pearson correlation, which in this case showed that dN and dS, but not dN/dS are significantly (p-values) associated with rate.

Using merge_tree, we are able to compare analysis results using identical model from different software packages or different models using different or identical software. It also allows users to integrate different analysis finding from different software packages. Merging tree data is not restricted to software findings, associating external data to analysis findings is also granted. The merge_tree function is chainable and allows several tree objects to be merged into one.

phylo <- as.phylo(beast_tree)
N <- Nnode2(phylo)
d <- tibble(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))
fake_tree <- treedata(phylo = phylo, data = d)
triple_tree <- merge_tree(merged_tree, fake_tree)
triple_tree
## 'treedata' S4 object that stored information of
##  '/home/ygc/R/library/ggtree/examples/MCC_FluA_H3.tree',
##  '/home/ygc/R/library/ggtree/examples/rst',
##  '/home/ygc/R/library/ggtree/examples/mlc'.
## 
## ...@ phylo: 
## Phylogenetic tree with 76 tips and 75 internal nodes.
## 
## Tip labels:
##  A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'height',   'height_0.95_HPD',  'height_median',
##  'height_range', 'length',   'length_0.95_HPD',
##  'length_median',    'length_range', 'posterior',    'rate',
##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'subs',
##  'AA_subs',  't',    'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',
##  'S_x_dS',   'fake_trait',   'another_trait'.

The triple_tree object showed above contains analysis results obtained from BEAST and CODEML, and evolutionary trait from external sources. All these information can be used to annotate the tree using ggtree (Yu et al. 2017).

2.2.3 Grouping taxa

The groupOTU and groupClade methods are designed for adding taxa grouping information to the input tree object. The methods were implemented in tidytree, treeio and ggtree respectively to support adding grouping information at tbl_tree, phylo and treedata, and ggtree levels. These grouping information can be used directly in tree visualization (e.g. coloring tree based on grouping) with ggtree (Figure 6.5).

2.2.3.1 groupClade

The groupClade method accepts an internal node or a vector of internal nodes to add grouping information of clade/clades.

nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
tree <- read.tree(text=nwk)

groupClade(as_tibble(tree), c(17, 21))
## # A tibble: 25 x 5
##    parent  node branch.length label group
##     <int> <int>         <dbl> <chr> <fct>
##  1     20     1             4 A     1    
##  2     20     2             4 B     1    
##  3     19     3             5 C     1    
##  4     18     4             6 D     1    
##  5     17     5            21 E     1    
##  6     22     6             4 F     2    
##  7     22     7            12 G     2    
##  8     21     8             8 H     2    
##  9     24     9             5 I     0    
## 10     24    10             2 J     0    
## # … with 15 more rows

2.2.3.2 groupOTU

set.seed(2017)
tr <- rtree(4)
x <- as_tibble(tr)
## the input nodes can be node ID or label
groupOTU(x, c('t1', 't4'), group_name = "fake_group")
## # A tibble: 7 x 5
##   parent  node branch.length label fake_group
##    <int> <int>         <dbl> <chr> <fct>     
## 1      5     1       0.435   t4    1         
## 2      7     2       0.674   t1    1         
## 3      7     3       0.00202 t3    0         
## 4      6     4       0.0251  t2    0         
## 5      5     5      NA       <NA>  1         
## 6      5     6       0.472   <NA>  1         
## 7      6     7       0.274   <NA>  1

Both groupClade() and groupOTU() work with tbl_tree, phylo and treedata, and ggtree objects. Here is an example of using groupOTU() with phylo tree object.

groupOTU(tr, c('t1', 't4'), group_name = "fake_group") %>%
  as_tibble
## # A tibble: 7 x 5
##   parent  node branch.length label fake_group
##    <int> <int>         <dbl> <chr> <fct>     
## 1      5     1       0.435   t4    1         
## 2      7     2       0.674   t1    1         
## 3      7     3       0.00202 t3    0         
## 4      6     4       0.0251  t2    0         
## 5      5     5      NA       <NA>  1         
## 6      5     6       0.472   <NA>  1         
## 7      6     7       0.274   <NA>  1

Another example of working with ggtree object can be found in session 6.5.

The groupOTU will trace back from input nodes to most recent common ancestor. In this example, nodes 2, 3, 7 and 6 (2 (t1) -> 7 -> 6 and 3 (t4) -> 6) are grouping together.

Related OTUs are grouping together and they are not necessarily within a clade. They can be monophyletic (clade), polyphyletic or paraphyletic.

cls <- list(c1=c("A", "B", "C", "D", "E"),
            c2=c("F", "G", "H"),
            c3=c("L", "K", "I", "J"),
            c4="M")

as_tibble(tree) %>% groupOTU(cls)
## # A tibble: 25 x 5
##    parent  node branch.length label group
##     <int> <int>         <dbl> <chr> <fct>
##  1     20     1             4 A     c1   
##  2     20     2             4 B     c1   
##  3     19     3             5 C     c1   
##  4     18     4             6 D     c1   
##  5     17     5            21 E     c1   
##  6     22     6             4 F     c2   
##  7     22     7            12 G     c2   
##  8     21     8             8 H     c2   
##  9     24     9             5 I     c3   
## 10     24    10             2 J     c3   
## # … with 15 more rows

If there are conflicts when tracing back to mrca, user can set overlap parameter to “origin” (the first one counts), “overwrite” (default, the last one counts) or “abandon” (un-selected for grouping)8.

2.3 Rescaling Tree Branches

Phylogenetic data can be merged for joint analysis (Figure 2.1). They can be displayed on the same tree structure as more complex annotation to help visually inspection of their evolutionary patterns. All the numerical data stored in treedata object can be used to re-scale tree branches. For example, CodeML infers dN/dS, dN and dS, all these statistics can be used as branch lengths. All these values can also be used to color the tree (session 4.3.4) and can be project to vertical dimension to create two-dimensional tree or phenogram (session 4.2.2 and Figure 4.5 and 4.11).

p1 <- ggtree(merged_tree) + theme_tree2()
p2 <- ggtree(rescale_tree(merged_tree, 'dN')) + theme_tree2()
p3 <- ggtree(rescale_tree(merged_tree, 'rate')) + theme_tree2()

cowplot::plot_grid(p1, p2, p3, ncol=3, labels = LETTERS[1:3])
Re-scaling tree branches. The tree with branches scaled in time (year from the root) (A). The tree was re-scaled using dN as branch lengths (B). The tree was re-scaled using substitution rates (C).

Figure 2.2: Re-scaling tree branches. The tree with branches scaled in time (year from the root) (A). The tree was re-scaled using dN as branch lengths (B). The tree was re-scaled using substitution rates (C).

2.4 Subsetting Tree with Data

2.4.1 Removing tips in a phylogenetic tree

Sometimes we want to remove selected tips from a phylogenetic tree. This is due to several reasons, including low sequence quality, errors in sequence assembly, an alignment error in part of the sequence and an error in phylogenetic inference etc.

Let’s say that we want to remove three tips (colored by red) from the tree (Figure 2.3A), the drop.tip() method removes specified tips and update tree (Figure 2.3B). All associated data will be maintained in the updated tree.

f <- system.file("extdata/NHX", "phyldog.nhx", package="treeio")
nhx <- read.nhx(f)
to_drop <- c("Physonect_sp_@2066767",
            "Lychnagalma_utricularia@2253871",
            "Kephyes_ovata@2606431")
p1 <- ggtree(nhx) + geom_tiplab(aes(color = label %in% to_drop)) +
  scale_color_manual(values=c("black", "red")) + xlim(0, 0.8)

nhx_reduced <- drop.tip(nhx, to_drop)
p2 <- ggtree(nhx_reduced) + geom_tiplab() + xlim(0, 0.8)  
plot_grid(p1, p2, ncol=2, labels = c("A", "B"))
Removing tips from tree. Original tree with three tips (colored by red) to remove (A). Updated tree that removed selected tips (B).

Figure 2.3: Removing tips from tree. Original tree with three tips (colored by red) to remove (A). Updated tree that removed selected tips (B).

2.4.2 Subsetting tree by tip label

Tree can be large and difficult to look at only the portions of interest. The tree_subset() function was created in treeio package to extract a subset of the tree portion while still maintaining the structure of the tree portion. The beast_tree in Figure 2.4A is slighly crowded. Obviously, we can make the figure taller to allow more space for the labels (similar to use “Expansion” slider in FigTree) or we can make the text smaller. However, these solutions are not always applicable when you have a lot of tips (e.g. hundreds or thousands of tips). In particuarly, when you are only interested in the portion of the tree around a particular tip, you certainly don’t want to explore a large tree to find centain species you are interested in.

Let’s say you are interested in tip A/Swine/HK/168/2012 from the tree (Figure 2.4A) and you want to look at the immediate relatives of this tip.

The tree_subset() function allows for you to look at the portions of the tree that are of interest. By default, tree_subset() will internally call groupOTU() to assign group specified tip from the rest of other tips (Figure 2.4B). Additionally, the branch lengths and related associated data are maintained after subsetting (Figure 2.4C). The root of the tree is always anchored at zero for the subset tree by default and all the distances are relative to this root. If you want all the distances are relative to the original root, you can specify the root position (by root.position parameter) to the root edge of the subset tree, which is the sum of branch lengths from the original root to the root of the subset tree (Figure 2.4D and E).

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)

p1 = ggtree(beast_tree) + 
  geom_tiplab() +  xlim(0, 40) + theme_tree2()

tree2 = tree_subset(beast_tree, "A/Swine/HK/168/2012", levels_back=4)  
p2 <- ggtree(tree2, aes(color=group)) +
  scale_color_manual(values = c("black", "red")) +
  geom_tiplab() +  xlim(0, 4) + theme_tree2() 

p3 <- ggtree(tree2, aes(color=group)) +
  geom_tiplab(hjust = -.1) + xlim(0, 5) + 
  geom_point(aes(fill = rate), shape = 21, size = 4) +
  scale_color_manual(values = c("black", "red"), guide = FALSE) +
  scale_fill_continuous(low = 'blue', high = 'red') +
  theme_tree2() + theme(legend.position = 'right')


p4 <- ggtree(tree2, aes(color=group), 
          root.position = as.phylo(tree2)$root.edge) +
  geom_tiplab() + xlim(18, 24) + 
  scale_color_manual(values = c("black", "red")) +
  theme_tree2()

p5 <- ggtree(tree2, aes(color=group), 
          root.position = as.phylo(tree2)$root.edge) +
  geom_rootedge() + geom_tiplab() + xlim(0, 40) + 
  scale_color_manual(values = c("black", "red")) +
  theme_tree2()

plot_grid(p2, p3, p4, p5, ncol=2, labels=LETTERS[2:5]) %>%
  plot_grid(p1, ., ncol=2, labels=c("A", ""), rel_widths=c(.5, 1))
Subsetting tree for specific tip. The original tree (A). The subset tree (B). Subset tree with data (C). Visualize the subset tree relative to original position, without root edge (D) and with root edge (E).

Figure 2.4: Subsetting tree for specific tip. The original tree (A). The subset tree (B). Subset tree with data (C). Visualize the subset tree relative to original position, without root edge (D) and with root edge (E).

2.4.3 Subsetting tree by internal node number

If you are interesting at certain clade, you can specify the input node as an internal node number. The tree_subset() function will take the clade as a whole and also trace back to particular levels to look at the immediate relatives of the clade (Figure 2.5A and B). We can use tree_subset to zoom in selected portions and plot a whole tree with the portion of it, that is similar to the ape::zoom() function to explore very large tree (Figure 2.5C and D).

clade <- tree_subset(beast_tree, node=121, levels_back=0)
clade2 <- tree_subset(beast_tree, node=121, levels_back=2)
p1 <- ggtree(clade) + geom_tiplab() + xlim(0, 5)
p2 <- ggtree(clade2, aes(color=group)) + geom_tiplab() + 
  xlim(0, 8) + scale_color_manual(values=c("black", "red"))


library(ape)
library(tidytree)
library(treeio)

data(chiroptera)

nodes <- grep("Plecotus", chiroptera$tip.label)
chiroptera <- groupOTU(chiroptera, nodes)

clade <- MRCA(chiroptera, nodes)
x <- tree_subset(chiroptera, clade, levels_back = 0)

p3 <- ggtree(chiroptera, aes(colour = group)) + 
  scale_color_manual(values=c("black", "red")) +
  theme(legend.position = "none")
p4 <- ggtree(x) + geom_tiplab() + xlim(0, 5)
plot_grid(p1, p2, p3, p4, 
  ncol=2, labels=LETTERS[1:4])
Subsetting tree for specific clade. Extracting a clade (A). Extracting a clade and trace back to look at its immediate relatives (B). Viewing a very large tree (C) and a selected portion of it (D).

Figure 2.5: Subsetting tree for specific clade. Extracting a clade (A). Extracting a clade and trace back to look at its immediate relatives (B). Viewing a very large tree (C) and a selected portion of it (D).

2.5 Manipulating tree data for visualization

Tree visualization is supported by ggtree. Although ggtree implemented several methods for visual exploration of tree with data, you may want to do something that is not supported directly. In this case, you need to manipulate tree data with node coordination positions that used for visualization. This is quite easy with ggtree. User can use foritfy() method which internally call tidytree::as_tibble() to convert the tree to tidy data frame and add columns of coordination positions (i.e. x, y, branch and angle) that are used to plot the tree. You can also access the data via ggtree(tree)$data.

Here is an example to plot two trees face to face that is similar to a ape::cophyloplot().

library(dplyr)
library(ggtree)

x <- rtree(30)
y <- rtree(30)
p1 <- ggtree(x)
p2 <- ggtree(y)

d1 <- p1$data
d2 <- p2$data

## reverse x-axis and 
## set offset to make the tree in the right hand side of the first tree
d2$x <- max(d2$x) - d2$x + max(d1$x) + 1

pp <- p1 + geom_tiplab() + geom_tree(data=d2) + geom_tiplab(data = d2, hjust=1)

dd <- bind_rows(d1, d2) %>% 
  filter(!is.na(label))

pp + geom_line(aes(x, y, group=label), data=dd, color='grey')
Plot two phylogenetic trees face to face. Plotting a tree using ggtree() (left hand side) and subsequently add another layer of tree by geom_tree() (right hand side). The relative positions of the plotted trees can be manual adjusted and adding layers to each of the tree (e.g. tip labels) is independent.

Figure 2.6: Plot two phylogenetic trees face to face. Plotting a tree using ggtree() (left hand side) and subsequently add another layer of tree by geom_tree() (right hand side). The relative positions of the plotted trees can be manual adjusted and adding layers to each of the tree (e.g. tip labels) is independent.

It is quite easy to plot multiple trees and connect taxa in one figure. For instance, plotting trees contructed from all internal gene segments of influenza virus and connecting equivalent strans across the trees (Venkatesh et al. 2018).

z <- rtree(30)
d2 <- fortify(y)
d3 <- fortify(z)
d2$x <- d2$x + max(d1$x) + 1
d3$x <- d3$x + max(d2$x) + 1

dd = bind_rows(d1, d2, d3) %>% 
  filter(!is.na(label))

p1 + geom_tree(data = d2) + geom_tree(data = d3) + geom_tiplab(data=d3) + 
  geom_line(aes(x, y, group=label, color=node < 15), data=dd, alpha=.3)
Plot multiple phylogenetic trees side by side. Plotting a tree using ggtree() and subsequently add multiple layers of trees by geom_tree().

Figure 2.7: Plot multiple phylogenetic trees side by side. Plotting a tree using ggtree() and subsequently add multiple layers of trees by geom_tree().

2.6 Summary

The treeio package allows us to import diverse phylogeny associated data into R. However, phylogenetic tree is stored in way to facilitate computational processing which is not human fridenly and need expertise to manipulate and explore tree data. The tidytree package provides tidy interface for exploring tree data, while ggtree provides a set of utilitise to visualize and explore tree data using grammar of graphics. This full suit of packages make it easy for ordinary users to interact with tree data, and allow us to integrate phylogeny associated data from different sources (e.g. experimental result or analysis finding), which creates the possibility of comparative study.

References

Liang, Huyi, Tommy Tsan-Yuk Lam, Xiaohui Fan, Xinchun Chen, Yu Zeng, Ji Zhou, Lian Duan, et al. 2014. “Expansion of Genotypic Diversity and Establishment of 2009 H1N1 Pandemic-Origin Internal Genes in Pigs in China.” Journal of Virology, July, JVI.01327–14. https://doi.org/10.1128/JVI.01327-14.

Venkatesh, Divya, Marjolein J. Poen, Theo M. Bestebroer, Rachel D. Scheuer, Oanh Vuong, Mzia Chkhaidze, Anna Machablishvili, et al. 2018. “Avian Influenza Viruses in Wild Birds: Virus Evolution in a Multihost Ecosystem.” Journal of Virology 92 (15). https://doi.org/10.1128/JVI.00433-18.

Yu, Guangchuang, Tommy Tsan-Yuk Lam, Huachen Zhu, and Yi Guan. 2018. “Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree.” Molecular Biology and Evolution 35 (12): 3041–3. https://doi.org/10.1093/molbev/msy194.

Yu, Guangchuang, David K. Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36. https://doi.org/10.1111/2041-210X.12628.