ANNOVAR is an efficient software tool to utilize update-to-date information to functionally
annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, as
well as mouse, worm, fly, yeast and many others). Given a list of variants with
chromosome, start position, end position, reference nucleotide and observed
nucleotides, ANNOVAR can perform: (i) Gene-based annotation: identify whether
SNPs or CNVs cause protein coding changes and the amino acids that are affected.
(ii) Region-based annotations: identify variants in specific genomic regions, for
example, conserved regions among 44 species, predicted transcription factor binding
sites, segmental duplication regions, GWAS hits, database of genomic variants, DNAse
I hypersensitivity sites, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites, ChIP-Seq peaks,
RNA-Seq peaks, or many other annotations on genomic intervals. (iii) Filter-based
annotation: identify variants that are reported in dbSNP, identify the subset
of common SNPs (MAF>1%) in the 1000 Genome Project, identify subset of non-synonymous
SNPs with SIFT score>0.05, find intergenic variants with GERP++ score>2, or many
other annotations on specific mutations.
Collectively, the bedtools utilities are a swiss-army knife of tools for
a wide-range of genomics analysis tasks. The most widely-used tools enable
genome arithmetics: that is, set theory on the genome. For example, bedtools
allows one to intersect, merge, count, complement, and shuffle genomic
intervals from multiple files in widely-used genomic file formats such
as BAM, BED, GFF/GTF, and VCF.
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short
DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp
reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep
its memory footprint small: typically about 2.2 GB for the human genome
(2.9 GB for paired-end).
Circos is a software package for visualizing data and information. It
visualizes data in a circular layout for exploring relationships between
objects or positions. Circos creates publication-quality infographics and
illustrations with a high data-to-ink ratio, layered data and symmetries.
Cluster 3.0 is an implementation of k-means clustering, hierarchical clustering and self-organizing
maps in a single multi-purpose open-source library of C routines, callable
from other C and C++ programs. This library is an improved version of
Michael Eisen's well-known Cluster program for Windows, Mac OS X and
Linux/Unix. Additionally a Python and a Perl interface to the C Clustering
Library is implemented to combine the flexibility of a scripting language
with the speed of C.
DAVID is able to extract biological features and meanings associated with large gene lists.
DAVID is able to handle any type of gene list, no matter which genomic platform or software
package generated them. DAVID systematically maps a large number of interesting genes in a
list to the associated biological annotation (e.g., gene ontology terms), and then
statistically highlights the most overrepresented (enriched) biological annotation out
of thousands of linked terms and contents.
FANMOD is a tool for fast network motif detection. It relies on recently developed
algorithms to improve the efficiency of network motif detection by orders of magnitude.
This facilitates the detection of larger motifs in bigger networks than previously
possible. Additional benefits of FANMOD are the ability to analyze colored networks,
a graphical user interface and the ability to export results to a variety of machine-readable
and human-readable file formats, including comma-separated values and HTML.
F-seq is a software package that generates a continuous density estimation of sequence
tags mapped to a reference genome, which can be displayed using the UCSC Genome Browser.
The continuous density plots are more intuitive than discrete histogram-like plots used
by some applications. Using kernel density estimation, F-seq can aid the identification
of biologically meaningful sites.
GERP identifies constrained elements in multiple alignments by quantifying substitution deficits.
These deficits represent substitutions that would have occurred if the element were neutral DNA,
but did not occur because the element has been under functional constraint. These deficits,
or rejected substitutions, are a natural measure of constraint that reflects the strength of
past purifying selection on the element. GERP estimates constraint for each alignment column;
elements are identified as excess aggregations of constrained columns. A false-positive rate
(which is user-settable) is calculated using 'shuffled' alignments in which the order of columns is randomized.
GFS is a program that maps peptide mass fingerprint data directly to raw genomic sequence,
enabling rapid low-cost identification of proteins in genomes for which annotation is lacking.
An experimentally obtained peptide mass fingerprint is entered into the program, which then scans
a genome sequence of interest and outputs the most likely regions of the genome from which
the mass fingerprint is derived.
GOrilla is a web-based application that identifies enriched GO terms in ranked lists of genes,
without requiring the user to provide explicit target and background sets. It also employs a
flexible threshold statistical approach to discover GO terms that are significantly enriched
at the top of a ranked gene list. Building on a complete theoretical characterization of
the underlying distribution, GOrilla computes an exact p-value for the observed enrichment,
taking threshold multiple testing into account without the need for simulations. The output
of the enrichment analysis is visualized as a hierarchical structure, providing a clear view
of the relations between enriched GO terms.
GOstats is a set of tools implemented in R Bioconductor for interacting with GO and microarray data.
It provides a variety of basic manipulation tools for graphs, hypothesis testing including
hypergeometric tests, and visualization tools.
GREAT assigns biological meaning to a set of non-coding genomic regions by analyzing the
annotations of the nearby genes. Thus, it is particularly useful in studying cis functions
of sets of non-coding genomic regions. Cis-regulatory regions can be identified via both
experimental methods (e.g., ChIP-seq) and by computational methods (e.g. comparative genomics).
GSC (Genome Structure Correction)
Assessing the significance of observations within large scale genomic studies using
random subsampled genomic region is a difficult problem because there often exists a
complex dependency structure between observations. GSC is a data subsampling approach
based on a block stationary model for genomic features to alleviate the hidden dependencies.
This model is motivated by earlier studies of DNA sequences, which show that there are global
shifts in base composition, but that certain sequence characteristics are locally unchanging.
The hive plot is a visualization method for drawing networks. Nodes are mapped to and
positioned on radially distributed linear axes. Edges are drawn as curved links. Hive
plots can give quantitatively understanding for important aspects of a network's structure.
Hive plots can also manage the visual complexity arising from a large number of edges and
expose both trends and outlier patterns in a network structure.
Java Treeview is an open source, cross-platform gene expression visualization tool
and an interactive display of clustered gene expression data, similar to Eisen's treeview.
It is also an extensible starting point for other gene expression visualization tools.
KING is a rapid algorithm for relationship inference using high-throughput genotype
data typical of GWAS that allows the presence of an unknown population substructure.
The relationship of any pair of individuals can be precisely inferred by robust
estimation of their kinship coefficient, independent of sample composition or population
structure (sample invariance). KING performs properly even under extreme population
stratification, while algorithms assuming a homogeneous population give systematically
biased results. KING performs relationship inference on millions of pairs of individuals
in the order of minutes.
The lumi package in R provides an integrated solution for the Illumina microarray data
analysis. It includes functions of Illumina BeadStudio (GenomeStudio) data input, quality
control, BeadArray-specific variance stabilization, normalization and gene annotation at
the probe level. It also includes the functions of processing Illumina methylation microarrays,
especially Illumina Infinium methylation microarrays.
mfinder is a software tool for network motifs detection. Network motifs are defined as
basic interaction patterns that recur throughout biological networks, much more often
than in random networks. In order to detect network motifs mfinder implements two methods:
a full enumeration of subgraphs and a sampling of subgraphs for estimation of subgraph
concentrations. mfinder generates random networks based on the switching method,
the stubs method and "Go with the winners" algorithm.
Peppy is software that integrates several critical tasks of proteogenomic searching and proteogenomic
mapping such as: Full 6-frame translation and digestion of a genome, peptide/spectrum
matching and quality assessment, and calculation of false discovery rates.
RuleFit3 is a predictive learning method and interpretational tool. It is based on
general regression and classification models, which are constructed as linear combinations
of simple rules derived from the data. Each rule consists of a conjunction of a small number
of simple statements concerning the values of individual input variables.
WebGestalt is a "WEB-based GEne SeT AnaLysis Toolkit". It is designed for
functional genomic, proteomic and large-scale genetic studies from which a large number
of gene lists (e.g., differentially expressed gene sets, co-expressed gene sets, etc.)
are continuously generated. WebGestalt incorporates information from different public
resources and provides an easy way for biologists to make sense out of gene lists.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010 Sep;38(16):e164. PMID: 20601685; PMCID: PMC2938201
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. PMID: 20110278; PMCID: PMC2832824
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. PMID: 19261174; PMCID: PMC2690996
Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. Circos: an information aesthetic for comparative genomics. Genome Res. 2009 Sep;19(9):1639-45. PMID: 19541911; PMCID: PMC2752132
de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004 Jun 12;20(9):1453-4. PMID: 14871861
Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57. PMID: 19131956
Wernicke S, Rasche F. FANMOD: a tool for fast network motif detection. Bioinformatics. 2006 May 1;22(9):1152-3. PMID: 16455747
Boyle AP, Guinney J, Crawford GE, Furey TS. F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics. 2008 Nov 1;24(21):2537-8. PMID: 18784119; PMCID: PMC2732284
Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010 Dec 2;6(12):e1001025. PMID: 21152010; PMCID: PMC2996323
Giddings MC, Shah AA, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5. PMID: 12518051; PMCID: PMC140871
Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009 Feb 3;10:48. PMID: 19192299; PMCID: PMC2644678
Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007 Jan 15;23(2):257-8. PMID: 17098774
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010 May;28(5):495-501. PMID: 20436461
Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR. Subsampling methods for genomic inference. Annals of Applied Statistics. 2010;4(4):1660-1697
Krzywinski M, Birol I, Jones SJ, Marra MA. Hive plots--rational approach to visualizing networks. Brief Bioinform. 2012 Sep;13(5):627-44. PMID: 22155641
Saldanha AJ. Java Treeview--extensible visualization of microarray data. Bioinformatics. 2004 Nov 22;20(17):3246-8. PMID: 15180930
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010 Nov 15;26(22):2867-73. PMID: 20926424; PMCID: PMC3025716
Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008 Jul 1;24(13):1547-8. PMID: 18467348
Kashtan N, Itzkovitz S, Milo R, Alon U. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics. 2004 Jul 22;20(11):1746-58. PMID: 15001476
Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W741-8. PMID: 15980575; PMCID: PMC1160236