Molecular Evolution: A Statistical Approach (英语) 平装 – 2014年5月15日
I think Molecular Evolution: A Statistical Approach would also work very well as a text for a graduate level course in statistical phylogenetics ... The exercises at the end of each chapterwould be useful for academics anting to use the book as a course textquestions cover an interesting range of problems that would get the class both thinking and programming. (Barbara R. Holland, Systematic Biology)
Ziheng Yang is currently RA Fisher Professor of Statistical Genetics in University College London. He obtained a Ph. D in agronomy in Beijing Agricultural University in 1992. Since then he held a few postdoctoral researcher positions in the UK and US. He joined UCL in 1997, first as a lecturer, then reader and professor. He teaches statistical genetics. He has published about 150 research papers and book chapters in molecular evolution, phylogenetics, population genetics, and computational biology. His program package paml is widely used in the molecular evolution community. He was elected a Fellow of the Royal Society in 2006.
Sequence analysis is today based on stochastic models of sequence evolution and many biological researchers have this as a central part of their work for years. Starting off taking a course in this book would be a good investment of their time. Methods have assumptions and pitfalls and if you don't know them, you are bound to do mistakes and make misinterpretations that could have been avoided. So spending 60-100 hours with this book would be a good investment of time. I probably only spent 40 hours on, since I am busy and have the idea that I probably know most of it. Not only arrogance since I read 1st edition with some care.
The book has 12 chapters
Chapter 1 - Models of Nucleotide Substitution - goes through the basic models [rate matrices and transition probability function] and how to estimate parameters and sequence distance in these models. Variable substitution rate models are presented.
Chapter 2 - Models of amino acid and codon substitutions - are much more complex than nucleotide evolution models, since their state space is either 20 amino acids or 61 codons, which would allow 190 or 3661 parameters in the corresponding rate matrices. Very empirical approaches are typically taken to fix these parameters. The distinction between amino acid changing [silent] and amino acid conserving [replacement] is discussed. One of the most practical and important distinctions in molecular evolution since it is used to measure selection strength and functionality.
Chapter 3 - Phylogeny Reconstruction: Overview - introduces basic tree concepts, tree counting, tree comparison, consensus trees and the reconstruction methods based on distance and parsimony. The chapter discusses searches in tree space and the distinction between gene and species trees.
Chapter 4 - Maximum Likelihood Methods - is a natural extension of chapter 3 and shows how to calculate the probability of a set of states observed at the leaves of a tree. The topics such as time reversibility, placing the root of an unrooted tree and the molecular closk is discussed. Issue such as alignment and missing data is covered and a series of more advanced substitution models such as varying rate, non-homogeneous, non-stationary and covarion models are presented. Finally the questions of tree space search selection, ancestral reconstruction is covered and testing non-nested models against each other.
For a long time phylogeny inference was viewed a very, very special statistical problem by biologists due to the combination of a discrete [topologies] with estimation of continuous parameters. This is the source of many of the technical problems in this field, but such problems are abound in statistics.
Chapter 5 - Comparison of phylogenetic methods and tests on trees - discusses concepts relating to the quality of a method such as Consistency [does the estimator converge to the true answer as data accumulate], identifiability [will different models give different distribution on data], robustness [if the model is only close to correct, does the conclusions still hold] and efficiency [does the estimator have smallest possible variance]. A series specific issues relating to Parsimony and Likelihood methods . Parsimony has been maligned but is still widely used, is computationally fast and can perform well under certain conditions. Likelihood (and Bayesian) methods have a better statistical foundation but are slower and formulating optimal tests is a challenge. Bootstrap methods and their properties are discussed.
Chapter 6 - Bayesian Theory - has a different conceptual foundation than likelihood methods. Likelihood methods has underlying parameter and model space that has no probability associated, while data is has a probability/density dependent on parameters/models. In Bayesian analysis the parameter and model space has a distribution associated - priors. After you have observed data, the distribution on parameters/models will change - posteriors. This setup leads to a set of technical issues concerning prior and posteriors of course and conjugate, non-informative and uniform priors are discussed. There is a section on integration.
This chapter is almost molecule free.
Chapter 7 - Bayesian computation (MCMC - Markov Chain Monte Carlo) - discussed a long series of technical issues related to stochastic integration. Tremendous strides have been done in this area over the last decades and increasingly fast computers have allowed the use of increasingly complex models. Such techniques are totally necessary for anything but the absolutely simplest models. [We had a student study group in Oxford on this topic 10 years ago based on Jun Liu's (2001) Monte Carlo Strategies in Scientific Computing. Although there were many errors and redundant notation in the 1st edition, it was very rewarding to read].
Despite the many technical issues the underlying problems and techniques are simple. The basic problem is integration or asking what is the area/volume/.. under a function in high dimensional space or possible comparisons of such measures. Three key advances were made in 1953, 1970 and 1995, when Metropolis proposed the simplest chains, where back and forth jumping probabilities must be the same, when Hastings relaxed the back and forth constraint and when Green allowed jumping between spaces of different dimension.
The moment that problems like "How do you avoid being trapped in one of several islands of high probability?" "How do you terminate chains of low probability?" and "How do you determine that the chain has converged?" are added to the equation, then techniques to address these like parallel chains, clever design of chains and convergence criteria are developed.
Again an almost molecule free chapter.
Chapter 8 - Bayesian phylogenetics - uses what has been presented in the two previous chapters. Central to phylogenetics is the dichotomy between the discrete component of duplications (called tree topology) and the continuous parameters of the evolutionary process and evolutionary process. This chapter discussed priors on both, which stochastic jumps between topologies and continuous parameters you can use and gives a series of specific examples.
Chapter 9 - Coalescent theory and species trees - this topic is covered in at least 2 250+ page textbooks, so this chapter is only a summary of this. However, it is a good summary with some final emphasis on the problem arising from the fact that sequence phylogenies can differ from species phylogenies.
Chapter 10 - Molecular clock and estimation of species divergence times. Trees inferred from data only taken from the present often have no root and rates of change on different branches might be very different. Adding a root suddenly gives all events a time direction and adding constant rates gives to possibility of dating all duplications in the tree. The key methods for this and advanced extensions such as the concept of a local clock is discussed.
Chapter 11 - Neutral and adaptive protein evolution. Distinguishing selected variants and measuring the strength of selection is crucial since selected variants relates to the function of the organism. This chapter discussed tests based on population data [all selection takes place in a population] and models only analyzing single sequences from different species. One of the most important and useful distinctions in sequence analysis is between silent and replacement substitutions in protein coding DNA sequences and many models have been created to analyze their respective rates and possible branch dependencies. Despite their ubiquity and usefulness they have biological counterintuitive assumptions like their lack of explicit formulation of selection. All they care about is if an event changes the protein or not and no protein is better than another. It seems to be that fitness landscape models are underexplored, probably because they typically necessitates a large number of parameters and creates a population size dependency in substitution rates.
Chapter 12 - Simulating molecular evolution - describes how to simulate the evolution of sequences based on continuous time Markov Models.
Despite its 450 pages there are many topics that haven't been discussed and I feel it would be have been a better investment of the text than the more than 100 pages of text on purely statistical methodology that can be found elsewhere anyway. To mention some of the topics missing:
1. Comparative Annotation is a huge field - possibly the most successful application of molecular evolution - and is based on the fact that a position or region depends on its function or some feature [annotation] that cannot be observed directly, most obviously being a protein gene or not. This is one of the most important functional uses of molecular evolution and it is totally absent.
2. Neighbor Dependence in sequence evolution is an important real phenomena that leads to interesting modeling problems. The classic example is CG avoidance, but there are others such as overlapping codons and overlapping sets of interacting sites in a protein.
3. Recombination is central in analysis of population genomes and without recombination there would be no concept of genetic mapping. So this is a major omission is a major one in chapter 9.
4. Models for structures more complicated than sequences. Sequence data is a golden kind of data: It easy to get, it is easy to represent, error level can be made very low and the total amount of it is very large. However, it is also challenging in the sense that translation into functional interpretation is very difficult. There are a hierarchy of data that are of great and increasing interest: Networks, structures (for instance protein), forms and more generally phenotypes would be the main classes. A lot of the experience from sequence analysis carries over to these data types since their evolution is also described by Markov Models.
5. Problems posed by next generation sequencing (NGS) are important in present data analysis. NSG leads to huge amounts of data, but it is error-prone and incomplete. Coping with this is important but is not treated in this book.
6. Stochastic models of insertion-deletions [statistical alignment] is one the remaining frontiers that must receive a lot of attention in coming years. Again it is surprising that this is not discussed more by ZY. Using statistical models for this has over the last decade been shown to improve inference. And there is much to do in this area that would be parallel to the evolution of sequence evolution models: how do you introduce heterogeneity among sites and combine with annotation. How do you make models with longer insertion-deletions? All unsolved problems that must receive a lot of attention in the coming decade.
7. Events beyond substitution/mutation and insertions, such as duplication, transposition, inversion and more are not treated at all, which again is surprising since they are central to any comparative genomics analysis.
8. End-Point Conditioned Sampling is very useful and getting increasing attention. A Markov Chain asserts that the next step only depend on the present state which is realistic and computationally convenient. But what if you know where you end up? This is exactly the situation in molecular evolution and much progress has been done in this field in the last decade.
The book has a very strong focus on the technical details of inference of substitution processes and phylogenies. Having used the 150 pages devoted to purely statistical and computational techniques to the exciting discoveries of molecular evolution would have been worthwhile. It is after all these discoveries that drive the need for the underlying models.
The present book is clearly second edition of "Computational Molecular Evolution" from 2006 although OUP has chosen to brand it as a new book. The main expansions are in the statistical methodology chapters.
In general concepts and algorithms are explained well - especially where Yang himself has contributed (which is a lot) - but in case of the Parsimony algorithm (Fitch-Hartigan-Sankoff), Yang uses several pages on explain unclearly, what could be explained simply in 1-2 paragraphs and its great similarity to the Likelihood algorithm (Felsenstein) is not clear either.
What lies ahead for this field is a hard and interesting question. Personally, I feel that on the methodological side:
1) Models of insertion-deletions are still in their infancy.
2) Linking rates to exact biochemical events is still an open problem
3) Models measuring selection are fundamentally selection-free since they just measure acceleration-decelerations of certain events are not based on statement of fitness. There are exceptions but much remains to be done using fitness landscapes. It is necessarily complicated and automatically introduces population in the models and a potentially huge parameter space of possible fitnesses.
4) Models with correlations beyond neighbor correlations will be of increased demand and has in the last 2-3 years been a major contributor to improvements in predicting protein structure.
5) Genomic Annotation is one of the great success stories of molecular evolution and there are lots to do. Multiple levels of annotation and finding regulatory signals by comparative methods are two obvious examples.
The book succeeds in presenting the present field of substitution models and phylogenetics and its technical background.
The book has omissions and does a less good job of giving a feeling that phylogenetics is still an exciting field with open questions and what the field has accomplished. To the latter Yang would probably respond that it isn't the purpose of the book. The field of MOLECULAR EVOLUTION needs an analogue of Jobling, Hurles and Tyler-Smith (2003, 2013) Human Evolutionary Genetics: Origins, Peoples and Disease. I have only read the 2003 edition that although theoretically basic, covered the field brilliantly.
Despite the blunt criticism above, this book must be highly recommended to any student of comparative genomics, bioinformatics, evolution and phylogenetics. I ended up getting 6 copies that I gave to students and myself. And despite the shortcomings it was highly rewarding for us to read.
The book must be among the 5-10 most important books in computational/statistical/mathematical evolution.
Other central books are:
* Joseph Felsenstein (2003) "Inferring Phylogenies" - is excellent in explain concepts and history of the field.
* Durbin, Eddy, Krogh and Mitchison (1998) "Biological Sequence Analysis" - is tremendously popular, which is great since it explains concepts - especially HMMs and SCFGs - extremely well.
* Steel and Semple (2003) "Phylogenetics" - is a hard read for a biologist and is now a bit dated, but did the field a major service in bringing the topic to mathematicians and especially combinatorialists. It lacks key topics such as recombination and phylogenetic alignment and I wish it was it had more specific numeric examples, while the authors kept the treatment quite abstract.
* Warren Ewens (2004) "Mathematical Population Genetics" - is an excellent digest of population genetics and topics covered in Yang's book chapter 9. It has many competitors, but in my view it is the best.
Most mornings for 3-4 weeks, Michael Golden, Patrick Gemmell, Søren Riis and Luke Kelly and I met and discussed this book page by page. I actually find this way of learning useful and I might try to make it a part of each day. One must continue to learn, but not let it fill more than 20% of the time, since then it reduces how much one ends up doing. And I think you are paid to do and not to learn.
Right now I feel like continuing with Weinberg (2013) "The Biology of Cancer", Berendsen (2007) "Modeling the Physical World: Hierarchical Modeling from Quantum Mechanics to Fluid Dynamics", Hamelryck et al. (eds., 2013) "Bayesian Methods in Structural Bioinformatics" and Daniel Gusfield (2014) "ReCombinatorics". If somebody wants to join please email me or if somebody has an excellent suggestion for a book.
When I read these books, I do it under that constraint that I use 1 hour to read 20 pages and 1 hour to discuss 20 pages. That is 2 hours a day and 10 hours a week, which is completely reasonable - although it is quite demanding.
I must have read 10-20 books this way with students and teachers. Often outliers relative to my field. Paul Davies (1990) "Quantum Mechanics", Smale and Hirsch (1974) "Differential Equations, Dynamical Systems, and Linear Algebra". I started to read Comrie (1990 ed.) "The World's Major Languages" with a student but I think he moved and I read the first 300 pages on my own (of 1000 pages). This book was very interesting, but I am bad at the phonetic alphabet so I lose a lot of finer points, when the authors discuss sound shifts between different languages and eventually I stop reading.
I first placed this review on my wall on facebook but somebody kindly pointed out it would get to more relevant audience if placed as review on Amazon. Despite my criticism I feel it deserves 5 stars.