Warren Richard Gish | |
---|---|
Nationality | American |
Alma mater | University of California, Berkeley |
Known for | BLAST |
Scientific career | |
Fields | Bioinformatics |
Institutions | National Center for Biotechnology Information Washington University in St. Louis Advanced Biocomputing LLC University of California, Berkeley |
Thesis | I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (1988) |
Doctoral advisor | Michael Botchan[1] |
Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.[2][3]
Education
After initially studying physics, Gish obtained an A.B. degree in Biochemistry from University of California, Berkeley, and completed work for his Ph.D. degree in Molecular Biology at the same institution in 1988.[1]
Research
Gish is primarily known for his contributions to NCBI BLAST,[4][5] his creation of the BLAST Network Service and nr (non-redundant) databases, his 1996 release of the original gapped BLAST (WU-BLAST 2.0), and most recently his development and support of AB-BLAST. At Washington University in St. Louis, Gish also led the genome analysis group which annotated all finished human, mouse and rat genome data produced by the University's Genome Sequencing Center from 1995 through 2002.
As a graduate student, Gish applied the Quine-McCluskey algorithm to the analysis of splice site recognition sequences. In 1985, with a view toward rapid identification of restriction enzyme recognition sites in DNA, Gish developed a DFA function library in the C language. The idea to apply a finite-state machine to this problem had been suggested by fellow graduate student and BSD UNIX developer Mike Karels. Gish's DFA implementation was that of a Mealy machine architecture, which is more compact than an equivalent Moore machine and hence faster. Construction of the DFA was O(n), where n is the sum of the lengths of the query sequences. The DFA could then be used to scan subject sequences in a single pass with no backtracking in O(m) time, where m is the total length of the subject(s). The method of DFA construction was recognized later as being a consolidation of two algorithms, Algorithms 3 and 4 described by Alfred V. Aho and Margaret J. Corasick.[6]
While working for U.C. Berkeley in December 1986, Gish sped up the FASTP program [7] (later known as FASTA[8]) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results. When the performance modifications were communicated to Pearson and Lipman, Gish further suggested that a DFA (rather than a lookup table) would yield faster k-tuple identification and improve the overall speed of the program by perhaps as much as 10% in some cases; however such marginal improvement even in the best case was deemed by the authors to not be worth the added code complexity. Gish also envisioned at this time a centralized search service, wherein all nucleotide sequences from GenBank would be maintained in memory to eliminate I/O bottlenecks—and stored in compressed form to conserve memory—with clients invoking FASTN searches remotely via the Internet.
Gish's earliest contributions to BLAST were made while working at the NCBI, starting in July 1989. Even in early prototypes BLAST was typically much faster than FASTA. Gish recognized the potential added benefit in this application of using a DFA for word-hit recognition. He morphed his earlier DFA code into a flexible form that he incorporated into all BLAST search modes. Others of his contributions to BLAST include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; parallel processing; memory-mapped I/O; the use of sentinel bytes and sentinel words at the start and end of sequences to improve the speed of word-hit extension; the original implementations of BLASTX,[9] TBLASTN[4] and TBLASTX (unpublished); the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time; the NCBI BLAST E-mail Service with optional public key-encrypted communications; the NCBI Experimental BLAST Network Service; the NCBI non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank, Swiss-Prot, and the PIR. Gish developed the first BLAST API, which was used in EST[10] annotation and Entrez data production, as well as in the NCBI BLAST version 1.4 application suite (Gish, unpublished). Gish was also the creator of and project manager for the earliest NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker). First opened to outside users in December 1989, the NCBI Experimental BLAST Network Service, running the latest BLAST software on SMP hardware against the latest releases of the major sequence databases, quickly established the NCBI as a convenient, one-stop shop for sequence similarity searching.
At Washington University in St. Louis, Gish revolutionized similarity searching by developing the first BLAST suite of programs to combine rapid gapped sequence alignment with statistical evaluation methods appropriate for gapped alignment scores. The resulting search programs were significantly more sensitive but only marginally slower than ungapped BLAST, due to novel application of the BLAST dropoff score X during gapped alignment extension. Sensitivity of gapped BLAST was further improved by the novel application of Karlin-Altschul Sum statistics[11] to the evaluation of multiple, gapped alignment scores in all BLAST search modes. Sum statistics were originally developed analytically for the evaluation of multiple, ungapped alignment scores. The empirical use of Sum statistics in the treatment of gapped alignment scores was validated in collaboration with Stephen Altschul, from 1994-1995. In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI BLAST and WU-BLAST (both at version 1.4, after having forked in 1994). Little NIH funding was received for his WU-BLAST development, with an average of 20% FTE starting in November 1995, and ending shortly after the September 1997 release of the NCBI gapped BLAST (“blastall”). As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST algorithm than was used by the NCBI software for many years. In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects. This was also the first time any BLAST package introduced a new database format transparently to existing users, without abandoning support for prior formats, as a result of abstracting the database I/O functions away from the data analysis functions. WU-BLAST with XDF was the first BLAST suite to support indexed-retrieval of NCBI standard FASTA-format sequence identifiers (including the entire range of NCBI identifiers); the first to allow retrieval of individual sequences in part or in whole, natively, translated or reverse-complemented; and the first able to dump the entire contents of a BLAST database back into human-readable FASTA format. In 2000, unique support for reporting of links (consistent sets of HSPs; also called chains in some later software packages) was added, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the expected longest intron in the species of interest) and with the distance limitation entering into the calculation of E-values. Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST. Gish also proposed multiplexing query sequences to speed up BLAST searches by an order of magnitude or more (MPBLAST); implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies; and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid[12] package for RepeatMasker). With doctoral student Miao Zhang, Gish directed development of EXALIN,[13] which significantly improved the accuracy of spliced alignment predictions, by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation. Although EXALIN performed full dynamic programming by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming and speed up the process by about 100-fold with little loss of sensitivity or accuracy.
In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package.
References
- 1 2 Gish, Warren Richard (1988). I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (PhD thesis). University of California, Berkeley. ProQuest 303669506.
- ↑ Warren Gish publications indexed by Microsoft Academic
- ↑ Warren Gish at DBLP Bibliography Server
- 1 2 Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D. (1990). "Basic Local Alignment Search Tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712. S2CID 14441902.
- ↑ Sense from Sequences: Stephen F. Altschul on Bettering BLAST
- ↑ Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333–340. doi:10.1145/360825.360855. S2CID 207735784.
- ↑ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID 2983426.
- ↑ Pearson, W. R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–2448. Bibcode:1988PNAS...85.2444P. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.
- ↑ Gish, W.; States, D.J. (1993). "Identification of protein coding regions by database similarity search". Nature Genetics. 3 (3): 266–272. doi:10.1038/ng0393-266. PMID 8485583. S2CID 15295142.
- ↑ Boguski, M.S.; Lowe, T.M.; Tolstoshev, C.M. (1993). "dbEST--database for "expressed sequence tags"". Nature Genetics. 4 (4): 332–333. doi:10.1038/ng0893-332. PMID 8401577. S2CID 40138950.
- ↑ Karlin, S.; Altschul, S. F. (1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. Bibcode:1993PNAS...90.5873K. doi:10.1073/pnas.90.12.5873. PMC 46825. PMID 8390686.
- ↑ Bedell, J. A.; Korf, I.; Gish, W. (2000). "MaskerAid : A performance enhancement to RepeatMasker". Bioinformatics. 16 (11): 1040–1041. doi:10.1093/bioinformatics/16.11.1040. PMID 11159316.
- ↑ Zhang, M.; Gish, W. (2005). "Improved spliced alignment from an information theoretic approach". Bioinformatics. 22 (1): 13–20. doi:10.1093/bioinformatics/bti748. PMID 16267086.