Chapter 5 Gene Ontology Analysis

5.1 Supported organisms

GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available.

Bioconductor have already provide OrgDb for about 20 species. User can query OrgDb online by AnnotationHub or build their own by AnnotationForge. An example can be found in the vignette of GOSemSim.

If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher() and gseGO() functions to perform over-representation test and gene set enrichment analysis.

If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap function, which will infer indirection annotation and generate a data.frame that suitable for both enricher() and gseGO().

5.2 GO classification

In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList provided by DOSE. Please refer to vignette of DOSE for more details.

library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
head(gene.df)
##   ENTREZID         ENSEMBL SYMBOL
## 1     4312 ENSG00000196611   MMP1
## 2     8318 ENSG00000093009  CDC45
## 3    10874 ENSG00000109255    NMU
## 4    55143 ENSG00000134690  CDCA8
## 5    55388 ENSG00000065328  MCM10
## 6      991 ENSG00000117399  CDC20
ggo <- groupGO(gene     = gene,
               OrgDb    = org.Hs.eg.db,
               ont      = "CC",
               level    = 3,
               readable = TRUE)

head(ggo)
##                    ID                    Description Count
## GO:0005886 GO:0005886                plasma membrane    55
## GO:0005628 GO:0005628              prospore membrane     0
## GO:0005789 GO:0005789 endoplasmic reticulum membrane     8
## GO:0019867 GO:0019867                 outer membrane     3
## GO:0031090 GO:0031090             organelle membrane    16
## GO:0034357 GO:0034357        photosynthetic membrane     0
##            GeneRatio
## GO:0005886    55/207
## GO:0005628     0/207
## GO:0005789     8/207
## GO:0019867     3/207
## GO:0031090    16/207
## GO:0034357     0/207
##                                                                                                                                                                                                                                                                                                                                                      geneID
## GO:0005886 S100A9/MELK/S100A8/MARCO/ASPM/CXCL10/LAMP3/CEP55/UGT8/UBE2C/SLC7A5/CXCL9/FADS2/MSLN/IL1R2/KIF18A/S100P/GZMB/TRAT1/GABRP/AQP9/GPR19/SLC2A6/KIF20A/LAG3/NUDT1/CACNA1D/VSTM4/ITPR1/SYT17/SLC16A4/CORIN/KCNK15/CA12/KCNE4/HLA-DQA1/ADH1B/PDZK1/C7/ACKR1/COL17A1/PSD3/EMCN/SLC44A4/LRP2/NLGN4X/MAPT/ERBB4/CX3CR1/LAMP5/ABCA8/STEAP4/PTPRT/TMC5/CYBRD1
## GO:0005628                                                                                                                                                                                                                                                                                                                                                 
## GO:0005789                                                                                                                                                                                                                                                                                               FADS2/CDK1/CHODL/ITPR1/HLA-DQA1/CYP4F8/CYP4B1/FMO5
## GO:0019867                                                                                                                                                                                                                                                                                                                                  BCL2A1/MAOB/PGR
## GO:0031090                                                                                                                                                                                                                                                MARCO/BCL2A1/LAMP3/DUSP2/SLC2A6/DTL/NUDT1/MAOB/ITPR1/GASK1B/HLA-DQA1/LRP2/LAMP5/STEAP4/PGR/CYBRD1
## GO:0034357

The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb).

If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.

5.3 GO over-representation test

Over-representation test(Boyle et al. 2004) were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
        readable      = TRUE)
head(ego)
##                    ID
## GO:0005819 GO:0005819
## GO:0005876 GO:0005876
## GO:0000779 GO:0000779
## GO:0072686 GO:0072686
## GO:0000775 GO:0000775
## GO:0000776 GO:0000776
##                                         Description
## GO:0005819                                  spindle
## GO:0005876                      spindle microtubule
## GO:0000779 condensed chromosome, centromeric region
## GO:0072686                          mitotic spindle
## GO:0000775           chromosome, centromeric region
## GO:0000776                              kinetochore
##            GeneRatio   BgRatio       pvalue     p.adjust
## GO:0005819    25/200 272/11816 4.695505e-12 1.493171e-09
## GO:0005876    12/200  48/11816 1.623758e-11 2.581776e-09
## GO:0000779    15/200  91/11816 2.808998e-11 2.628919e-09
## GO:0072686    15/200  92/11816 3.306816e-11 2.628919e-09
## GO:0000775    18/200 154/11816 1.061302e-10 6.749884e-09
## GO:0000776    15/200 107/11816 3.047081e-10 1.614953e-08
##                  qvalue
## GO:0005819 1.378996e-09
## GO:0005876 2.384361e-09
## GO:0000779 2.427899e-09
## GO:0072686 2.427899e-09
## GO:0000775 6.233756e-09
## GO:0000776 1.491466e-08
##                                                                                                                                                         geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876                                                                             CENPE/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000779                                                            CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0072686                                                           KIF23/CENPE/ASPM/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/AURKB/KIFC1/KIF18B/AURKA
## GO:0000775                                           CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776                                                             CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
##            Count
## GO:0005819    25
## GO:0005876    12
## GO:0000779    15
## GO:0072686    15
## GO:0000775    18
## GO:0000776    15

As I mentioned before, any gene ID type that supported in OrgDb can be directly used in GO analyses. User need to specify the keyType parameter to specify the input gene ID type.

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keyType       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)

Gene ID can be mapped to gene Symbol by using paramter readable=TRUE or setReadable function.

ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)

5.3.1 drop specific GO terms or level

enrichGO test the whole GO corpus and enriched result may contains very general terms. With dropGO function, user can remove specific GO terms or GO level from results obtained from both enrichGO and compareCluster.

5.3.2 test GO at sepcific level

enrichGO doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter to restrict the result at specific GO level. It works with results obtained from both enrichGO and compareCluster.

5.3.3 reduce redundancy of enriched GO terms

GO is organized in parent-child structure, thus a parent term can be overlap with a large proportion with all its child terms. This can result in redundant findings. To solve this issue, clusterProfiler implement simplify method to reduce redundant GO terms from the outputs of enrichGO and gseGO. The function internally called GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term. An example can be found in the blog post.

5.4 GO Gene Set Enrichment Analysis

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

For algorithm details, please refer to the vignette of DOSE.

ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              nPerm        = 1000,
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

GSEA use permutation test, user can set nPerm for number of permutations. Only gene Set size in [minGSSize, maxGSSize] will be tested.

If you have issues in preparing your own geneList, please refer to the wiki page.

5.5 GO Semantic Similarity Analysis

GO semantic similarity can be calculated by GOSemSim(Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

5.5.1 GO analysis for non-model organisms

Both enrichGO and gseGO functions require an OrgDb object as background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt or Blast2GO. Then using enricher or GSEA function to analyze, similar to the examples using wikiPathways and MSigDB. Another solution is to create OrgDb by your own using AnnotationForge package.

References

Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18): 3710–5. https://doi.org/10.1093/bioinformatics/bth456.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.