The authors have declared that no competing interests exist.
In other to present a series of stochastic models from population dynamics capable of describing rudimentary aspects of genetic evolution, we studied two-allele Wright–Fisher and the Moran models for evolution of the relative frequencies of two alleles at a diploid locus under random genetic drift in a population of fixed size “simplest form, selection, and random mutation”. Principal results were presented in qualitative terms, illustrated by Monte Carlo simulations from R and http://www.radford.edu/~rsheehy/Gen_flash/popgen. Moran and the Wright-Fisher Models exhibited the same fixation probabilities, only that the Moran model runs twice as fast as the Wright-Fisher Model. A clue that can help us to understand this result is provided by the variance in reproductive success in the two models. Genetic changes due to drift were neither directional nor predictable in any deterministic way. Nonetheless, genetic drift led to evolutionary change in the absence of mutation (P=0.5), natural selection or gene flow. In general, alleles drift to fixation is significantly faster in smaller populations. Probability of fixation of an allele A was approximately equivalent to the initial frequency of that allele. With the inclusion of selection in our model, probability of fixation of a favoured allele due to natural selection increased with increase in fitness advantage and population size. The time taken to reach fixation is much slower, in case of no selective advantage, than a fixation under mutation with selective advantage.
This paper aim to presents a series of stochastic models from population dynamics capable of describing rudimentary aspects of genetic evolution
We consider a finite population of 2N genes (or alternatively –N diploid organisms) with each haploid possessing either allele A or allele a, which assumes random reproduction, and generations are not overlapping,
Let
S2N = {0,1,…,2N} ……(1)
Let the initial generations contain
P =
and the probability of choosing a non-allele A (failure) for each Bernoulli trial as:
P’ = 1 -
Where
Then the transition probabilities from xt to xt+1 is determined by sampling with replacement of 2N independent Bernoulli trials such that xt+1 = j is a binomial random variable from the genes of Generation t. For any integer i, j: X0, …, Xt – 1 in the state space, we have
P(xt+1 = j/Xt= i xt+1 = xt-1 = xt-1 ,…, = X0 = x0) = P(xt+1 = j/Xt= i) ……….(4)
This implies that given the present, the future is conditionally independent on the past. This expression which characterizes the Markov chain in general is the key to analyze the Wright-Fisher model, computed according to the binomial distribution as
P(xt+1 = j/Xt= i =Pij= (2N/j) (i/2N)j (1 –i/2N)2N-j
Pij=(2N/j) pjq2N-j ………(5)
We can use (5) to describe a “transition probability matrix” for the Wright-Fisher model, which gives the probability of going from any state i to any state j in one generation,
We represent the initial state of the system using a vector
ρ(0) = (ρ0(0) ρ1(0) ρ2(0) …)
ρ = (p(X(0) = 0) p(X(0) = 1) p(X(0) = 2)…) …..(6)
For example, if the population initially had two copies of the allele, then p(X(0) = 2)=1 and all other entries in this vector are zero.
And therefore the transition of the system is then given by the matrix equation;
Each column sums to one because a population that starts with i copies of the allele must have some number between 0 and N copies in the next generation
pij(n) = p(Xn=j/X0 = i …….(12)
By the Chapman Kolmogorov Equation
And by Extension, Markov Probability
The helpful part about writing (10) in matrix form is that it can be iterated using the rules of matrix multiplication
This model due to
Consider a Moran model for the evolution of a population of size N in which we track the number of individuals with a novel mutant allele (X) versus the number of individuals with the ancestral allele (N-X). Since the population size is a constant, this model has only one independent variable (X). Under the Moran model, evolution occurs when one individual is chosen to reproduce and, simultaneously, one individual is chosen to die.
let X be a random variable, representing the frequency of alleles of type A in the population, to replace individual X, we choose an individual at random from the population (including X itself) to be the parent of the new individual. Thus the model allows only one-step" transitions in which X either decreases or increases, but both transitions occur at the same rate, such that in population t + 1, the number of alleles A can be either (j = i - 1), (j = i + 1), or j = i.
The system can go from
Pi,i-1 =(2N-i/2N) .(i/2N)
= (1-p) …….(15)
Where p=i/2N
Similarly, if it is A that is chosen to die and a is chosen to reproduce, then the system can go from
Pi,i+1=(2N-i/2N) .(i/2N)
=p (1 - p) ……(16)
it takes either A to reproduce and die or a to reproduce and die, for the system to go from i to i, expressed as :
Pi,I= (2N-i/2N .(2N-i/2N)+ (i/2N .i/2N)
=p2+ (1-p)2 …….(17)
Where p = i/2N
And therefore the transition probability for the implied Markov chain for the Moran model is given by;
The probability of A to reach fixation is called the
Thus, for 0 < j < 2N,
The probability of extinction given that it started with i copies is;
And the probability of fixation given that it started with i copies is;
Note that with the martingale property (i.e. a random process without bias), the expectation at each time step is expected to be the same;
E(Xt)=E[E(Xt/Xt-1)]=E(Xt-1)=E(Xt-2)
= pA. 2N
= i/2N . 2N
= i …………..(22)
This shows that the expected allele frequency is constant,
Let,
u(i)= P (Xt= 2N/Xt = i) ……..(23)
Be the probability that A is eventually fixed in a population of size 2N that initially contains i copies of A.
then,
i = E[Xt/X0= i] = 2N.P (Xt=2N/X0=i) + 0.P (Xt= 0/X0= i)
i = E [Xt/X0 = i]=2N.P (Xt= 2N/X0 = i)
i = E [Xt/X0= i] = 2N.u(i)
∴i=2N.u(i)
u(i) = i/2N ……..(24)
In an identical manner, we can also express the probability that A eventually becomes lost in the population (extinction at 2N).
Let,
Then,
i = E [Xt/X0 =i] = 0.
P (Xt= 2N/X0 = i) + 2N. P (Xt= 0/X0 = i)
i = E [Xt/X0 = i] = 2N. P (Xt=0/X0 = i)
I = E[Xt/X0=i] = 2N. (1-u(i))
∴i=2N.(1-u(i)
i = 2N-2N. u(i)
∴u(i) = 1– i/2N ……(26)
A similarity between the Moran model and the WF-model is that both models have the same fixation probabilities. The only difference is that the Moran model runs twice as fast as the WF-model, a result we will show in the next chapter.
The Monte Carlo model
Population Size (To investigate the effect of population size and genetic drift)
The population size is allowed to change and to see a graphical display of the change in frequency of allele A overtime in generations, each different line is a separate locus (replicate). Fixation of allele A occurs when its frequency reaches 1.0, which implies extinction of allele a. Running unlinked loci simultaneously (collectively), each with initial gene frequencies of 0.5 is used to explore the effect of changing the population size by running the programme for different population sizes between 5 and 50. Each time we run the simulation, we;
Record the number of fixed loci for A and for a as well as the number of loci which remain polymorphic.
Approximate the number of generations until fixation and extinction for each population explored.
Simulations were repeated 20, 50 and 100 times for a total of 5 replicates for each population size.
To explore the number of generations it takes for one type to either fixes or go extinct, we will run unlinked loci simultaneously (collectively), each with an initial population size of 10, and simulations will be for 100 generations. To explore the fixation of an allele, by running this program for different population sizes, we will record the total number of loci fixed for allele A each time we run the simulation.
Markov Chain Monte Carlo (MCMC)
1. individual alleles A and a assumed to follow a binomial distribution with parameters 2N and P (where P is the initial proportion of each allele)
2. Values of N were varied as 5, 10, 20 and 50 to represent small and moderate sample of number of individuals in the population.
3. P is 0.5 if alleles are assumed to be contained in the same proportion at generation 0, and P=0.2 if alleles are assumed to be contained in the same proportion at generation 0
4. j will be generated as the number of copies of allele in the next generation.
For all experiments, iterations were made at 20, 50 and 100 times by the same series of random numbers.
The main qualitative difference, however, is the scale along the x axis. There are only 100 generations represented in
For our genetic model, we can also describe
Each rows sum to one because a population that starts with
The first and last rows are particularly simple because there is no mutation; if nobody is type
The matrix pij can be iterated using the rules of matrix multiplication. P2tells us the probability that there were
(The zeros in the middle of this matrix are not exactly zero, but they are less than 10_156).
The initial state of the system is represented using a vector, since our population initially had two copies of the allele, then
ρ(0) = (0 0 1 0 0 0)
Multiplying
The vector on the right indicated that there is approximately 50% chance that type
These results suggested that if we start with
Given that no clear conclusions emerge from a few replicate simulations, we must run many more replicate simulations to compare the Wright-Fisher and Moran models. Starting with
WRIGHT-FISHER MODEL | MORAN MODEL | |||||||
N=5 | N=10 | N=20 | N=50 | N=5 | N=10 | N=20 | N=50 | |
20 | 0.6(0.25) | 0.55(0.25) | 0.50(0.10) | 0.15(0.15) | 0.45(0.25) | 0.40(0.25) | 0.6(0.15) | 0.60(0.15) |
50 | 0.52(0.22) | 0.52(0.26) | 0.46(0.15) | 0.15(0.15) | 0.44(0.24) | 0.56(0.26) | 0.44(0.30) | 0.50(0.24) |
100 | 0.57(0.23) | 0.51(0.24) | 0.38(0.24) | 0.06(0.06) | 0.51(0.23) | 0.59(0.23) | 0.39(0.23) | 0.43(0.20) |
WRIGHT FISHER MODEL | MORAN MODEL | |
N=5 | 1.25 | 2.5 |
N=10 | 2.5 | 5.0 |
N=20 | 5.0 | 12.5 |
N=50 | 12.5 | 25.0 |
At different number of iterations and population sizes,
N=5 | N=10 | N=20 | N=50 | |||||
Average time until fixation | Average time until extinction | Average time until fixation | Averagetime until extinction | Average time until fixation | Average time until extinction | Average time until fixation | Averagetime until extinction | |
iterations | ||||||||
20 | 8.20(13.0) | 11.50 (11.1) | 24.80(14.22) | 21.11(46.25) | 47.70(40.33) | 42.75(56.38) | 64.5(140.92) | 74.00(136.50) |
50 | 13.92 (13.5) | 10.0(11.96) | 25.15(27) | 25.04(22.77) | 50.39(49.59) | 43.71(47.14) | 68.89(151.08) | 71.17(147.72) |
100 | 15.07(13.57) | 9.8(12.43) | 23.53(25.46) | 24.71(26.83) | 48.76(52.38) | 40.38(49.48) | 65.33(122.65) | 65.35(110.90) |
Average | 12.40(13.51) | 10.47(11.83) | 36.74(33.57) | 23.62(31.95) | 48.95(47.43) | 42.28(51.12) | 66.24(138.22) | 70.35(131.71) |
We can see from the table that the time to fixation or extinction of an allele is related to population size. The larger the population size, the longer it takes to achieve fixation, i.e. Probability of fixation is also influenced by population size for both Models
Our models allow the inclusion of other evolutionary forces: selection and mutation
So far, we have considered only neutral models of evolution, that is, those for which there is no preference for a particular allele. Despite being apparently a reasonable model for some aspects of genetic, ecological or linguistic behavior, geneticists in particular have been interested in the fate of alleles that are selected for or against
AA=1.0, Aa=0.75 and aa=0.50
Here, the allele A is more fit than the type a and began at a frequency of p (0) =0.2. The simulations were run for populations of sizes N = 5, 10, 20 and 50. In all cases, the alleles rose in frequency towards fixation within 100 generations. When the population size was small, the Wright-Fisher model exhibits more variability around the deterministic trajectory than when the population size was large. When N was only 5 and 10, we observed extinction of the beneficial allele in one of the five replicates, which made the probability of fixation to be 0.95 and 0.9 respectively, however, when population size became larger (N=20 and 50), none of the replicates went into extinction, and had the probability of fixation to be 1.0. Similar behavior occurred at a frequency of p (0) =0.5. Irrespective of the population size, there was not an extinction of the beneficial allele in any of the five replicates, which gave the probability of fixation to be 1.0. The probability of fixation of a favored allele due to natural selection increases with increased fitness advantage and with increased population size.
Also, in
Probability of fixation | Average time until fixation | |
N=5 | 1.0 (0.95) | 6.6(11.42) |
N=10 | 1.0 (0.90) | 14.0(13.94) |
N=20 | 1.0 (0.10) | 11.5(15.50) |
N=50 | 1.0 (0.10) | 15.35(19.00) |
These figures illustrated an important point: adding stochasticity to a model need not cause major changes to the results. In populations of small size (N = 50), we have seen allele frequency changed when there should have been none (the neutral case,
Over dominance = heterozygote most fit. Surprising things happen when the heterozygote is most fit.
In
Probability of fixation | Average time until fixation | |
N=5 | 0.95 (1.0) | 39.0 (25.60) |
N=10 | 0.6 (0.8) | 45.33 (38.25) |
N=20 | 0.0 (0.6) | - (44.33) |
N=50 | 0.0 (0.0) | - (-) |
At a population size of 5, all alleles experienced fixation (probability of fixation =1.0). At a population size of 20. However, not all alleles experienced fixation (probability of fixation = 0.6), but a population size of 50, no allele experienced fixation (probability of fixation =0). The larger the population size, the longer it takes to achieve fixation.
Example of heterozygote most fit is the case of the sickle-cell trait (in the presence of malaria)
A/A people die of malaria
S/S people die of sickle-cell anemia.
In conclusion,
Under dominance= heterozygote less fit
Here alleles are not contained in the same frequency in generation 0. At a population size of 5, probability of fixation = 0.25, at a population size of 20 however, probability of fixation reduced to 0.1, but a population size of 50, no allele experienced fixation (probability of fixation =0), all these indicated high probability of extinction (fixation of the allele a). Allele a will fix even though it does not maximizes population fitness. The population rolls to a small fitness peak, even though a larger one is possible. Population, which is fixed for allele a will, resists introduction of allele A. Here alleles are contained in the same frequency in generation 0, equilibrium is said to exit and the population is unstable. At a population size of 5, probability of fixation = 0.9, at a population size of 20, probability of fixation reduced to 0.9, but a population size of 50, all the alleles experienced fixation (probability of fixation =1.0), all this indicates high probability of extinction of the a allele.
When the heterozygote has the lowest fitness, the system is considered unstable, allele frequency will move until either A or a is fixed. Equilibrium occurs at 50% of each allele.
When P (A) is above equilibrium A will be fixed.
When P (A) is below equilibrium a will be fixed.
Example of the under dominance is the African butterfly pseudacraea eurytus, the orange and blue homozygotes each resemble a local toxic species, but the heterozygote resembles nothing in particular and is attractive to predators.
Although mutation is sometimes considered as the raw material of evolution, it is a very weak force in changing allele frequency. As shown in table 12 above, starting with an initial frequency of 0.2, it will take a hundred generations to change the frequency of the A allele to 0.1998 and a thousand generations to change the frequency to 0.198. Suppose alleles are contained in the same proportion with a frequency of 0.5, after a hundred generations, the frequency of the A allele will change 0.499 and a thousand generations to change the frequency to 0.495. But suppose only A alleles are contained in the population with a frequency of 1.0, after a hundred generations, the frequency of the A allele will change to 0.998 and a thousand generations to change the frequency to 0.990.
In
Probability of fixation | Average time until fixation | |
N=5 | 0.25 (0.9) | 10.8 (5.94) |
N=10 | 0.1 (0.95) | 9.5 (7.84) |
N=20 | 0.1 (0.9) | 17.0 (9.67) |
N=50 | - (1.0) | - (20.0) |
Numbers of generations | 100 | 200 | 500 | 1000 |
P=0.2 | 0.1998 | 0.1996 | 0.199 | 0.198 |
P=0.5 | 0.499 | 0.499 | 0.497 | 0.495 |
P=1.0 | 0.9990 | 0.998 | 0.995 | 0.990 |
Probability of fixation | Average time until fixation | |
N=5 | 0.15 (0.45) | 16 (7.67) |
N=10 | 0.2 (0.4) | 29.6 (17.7) |
N=20 | 0.25 (0.2) | 60.0 (43.7) |
N=50 | - (0.1) | - (64.75) |
At a population size of 5, the probability of fixation was 0.45 with an average time until fixation to be 7.67, when the population size became 10, the probability of fixation was 0.4 with an average time until fixation of 17.7, but when the population size became 20, the probability of fixation was 0.2 with an average time until fixation of 43.7. The larger the population size, the longer it takes to achieve fixation.
These results indicated that the probability of fixation with or without mutation rate was still approximately its initial frequency.
In
In
Probability of fixation | Average time until fixation | |
N=5 | 0.8 (1.0) | 9.38 (8.15) |
N=10 | 1.0 (1.0) | 13.65 (10.22) |
N=20 | 1.0 (1.0) | 15.95 (9.6) |
N=50 | 1.0 (1.0) | 19.50 (13.7) |
There is a behavioural difference between mutation with and without Selective Advantage. The time taken to reach fixation is much slower, in case of no selective advantage, than a fixation under mutation with selective advantage. To illustrate the difference between the two types of mutation, from the results obtained in
The procedure of stochastic modeling is not, of course, restricted to statistics. In particular, stochastic models have played a pivotal role in understanding the dynamics of evolutionary systems and as such one sees many similarities in the approaches and methods that have been used to those employed by Statistician. This paper has been a review of the ideas and formalism used to model stochastic processes in fields that statistical physicists are not typically acquainted with, specifically population genetics, ecology and linguistics. As a consequence, some parts of the discussion will seem familiar, other parts will not. We have tried, and we hope that we have succeeded, to explain the background ideas and motivation, since this will be the greatest obstacle to understanding among a readership of statistical physicists. On the other hand the degree of mathematical sophistication that has been assumed is greater than would be typical outside physics or mathematical biology.
In our discussions of the mathematical models, we have mostly used the language of population genetics, but the results obtained are more widely relevant. As the evolutionary paradigm becomes even more widely applied, there may be other areas in which analogies can be drawn. It is interesting to know how neutral processes turned out to have greater importance in all three areas we discussed. At the very least, neutral theories can be thought of as null models, against which data and other models can be compared. Most textbooks in population genetics begin their discussion of genetic drift with the Wright–Fisher model, although for physicists the use of non-overlapping generations and a ‘time’ measured in number of generations will not appear so natural. The Moran model, which has exactly the same limit when the number of genes becomes large, is far more familiar, resembling a birth/death process where a death is immediately followed by a birth. In addition, the continuous time limit may easily be taken, leading to a master equation of a kind well known in statistical physics.
In genetics, ecology or language, just as in physics, reality cannot be described by ideal models; there will be a multitude of ways in which real systems deviate from the ideal models created by scientists when they first enter a field. One of the methods that have been devised by population geneticists to deal with this will be very familiar to physicists. This is to characterize a non-ideal system by a few parameters, which will hopefully, if chosen correctly, capture the essence of the system. It may be that a simple model can then be utilized, but with these parameters built in. An example is the effective population size, N, discussed in the literature, which reflected how the non-ideal nature of the system changes the effective value of N: the effective size of the real population being the number of individuals in the ideal population which gave the same magnitude for the quantity of interest. The reason of our interest is not just the mathematical elegance of these models, but with the availability of massive amount of sequencing data. We actually can use these models (or advanced models incorporating variable population size, mutation effect etc., to solve and answer real questions in molecular biology. The stochastic evolution of a DNA segment that experiences recombination is a complex process; so many analyses are based on simulations. The aim of this paper is to give an account of useful analytical results in population genetics, together with their proofs.