[chapter]
Biological evolution is the change over time in the genetic
composition of a population.(Th Dobzhansky 1951) Our population
is made up of a set of interbreeding individuals, the genetic
composition of which is made up of the genomes that each individual
carries. The genetic composition of the population alters due to the
death of individuals or the migration of individuals in or out of the
population. If our individuals vary in the number of children they have,
this also alters the genetic composition of the population in the next
generation. Every new individual born into the population subtly changes
the genetic composition of the population. Their genome is a unique
combination of their parents’ genomes, having been shuffled by
segregation and recombination during meioses, and possibly changed by
mutation. These individual events seem minor at the level of the
population, but it is the accumulation of small changes in aggregate
across individuals and generations that is the stuff of evolution. It is
the compounding of these small changes over tens, hundreds, and millions
of generations that drives the amazing diversity of life that has
emerged on this earth.
Population genetics is the study of the genetic composition of natural
populations and its evolutionary causes and consequences. Quantitative
genetics is the study of the genetic basis of phenotypic variation and
how phenotypic changes evolve over time. Both fields are closely
conceptually aligned as we’ll see throughout these notes. They seek to
describe how the genetic and phenotypic composition of populations can
be changed over time by the forces of mutation, recombination,
selection, migration, and genetic drift. To understand how these forces
interact, it is helpful to develop simple theoretical models to help our
intuition. In these notes we will work through these models and
summarize the major areas of population- and quantitative-genetic
theory.
While the models we will develop will seem naı̈ve, and indeed they
are, they are nonetheless incredibly useful and powerful. Throughout the
course we will see that these simple models often yield accurate
predictions, such that much of our understanding of the process of
evolution is built on these models. We will also see how these models
are incredibly useful for understanding real patterns we see in the
evolution of phenotypes and genomes, such that much of our analysis of
evolution, in a range of areas from human medical genetics to
conservation, is based on these models. Therefore, population and
quantitative genetics are key to understanding various applied
questions, from how medical genetics identifies the genes involved in
disease to how we preserve species from extinction.
Population genetics emerged from early efforts to reconcile Mendelian
genetics with Darwinian thought. Part of the power of population
genetics comes from the fact that the basic rules of transmission
genetics are simple and nearly universal. One of the truly remarkable
things about population genetics is that many of the important ideas and
mathematical models emerged before the 1940s, long before the
mechanistic-basis of inheritance (DNA) was discovered, and yet the
usefulness of these models has not diminished. This is a testament to
the fact that the models are established on a very solid foundation,
building from the basic rules of genetic transmission combined with
simple mathematical and statistical models.
Much of this early work traces to the ideas of R.A. Fisher, Sewall Wright, and J.B.S. Haldane, who, along with many others, described the early principals and mathematical models underlying our understanding of the evolution of populations. Building on this conceptual fusion of genetics and evolution, there followed a flourishing of evolutionary thought, the modern evolutionary synthesis, combining these ideas with those from the study of speciation, biodiversity, and paleontology. In total, this work showed that both short-term evolutionary change and the long-term evolution of biodiversity could be well understood through the gradual accumulation of evolutionary change within and among populations. This evolutionary synthesis continues to this day, combining new insights from genomics, phylogenetics, ecology, and developmental biology.
Population and quantitative genetics are a necessary but not sufficient description of evolution; it is only by combining the insights of many fields that a rich and comprehensive picture of evolution emerges. We certainly do not need to know the genes underlying the displays of the birds of paradise to study how the divergence of these displays, due to sexual selection, may drive speciation. Indeed, as we’ll see in our discussion of quantitative genetics, we can predict how populations respond to selection, including sexual selection and assortative mating, without any knowledge of the loci involved. Nor do we need to know the precise selection pressures and the ordering of genetic changes to study the emergence of the tetrapod body plan. We do not necessarily need to know all the genetic details to appreciate the beauty of these, and many other, evolutionary case studies. However, every student of biology gains from understanding the basics of population and quantitative genetics, allowing them to base their studies on a solid bedrock of understanding of the processes that underpin all evolutionary change.
The history of genetics and evolutionary biology is intertwined with the history of eugenics and scientific racism. Francis Galton, one of the first people to systematically study human inheritance, coined the term “eugenics” in 1883 to describe the idea of ‘human improvement’ through controlled breeding of humans (Galton 1883). Historically, eugenics is much more than just the idea that selection through breeding would work in humans; it is the idea that particular people are “genetically inferior” and therefore “unfit” to reproduce (D. B. Paul 2014). Eugenicists’ obsession with human worth and genetic inferiority also meant that eugenicists also often held that people from some races and ethnicities are genetically superior to others. Thus, ideas about eugenics also built on older racist fields of science that sought to classify humans into a discrete racial hierarchy, while in parallel scientists in these fields were forcing ideas from genetics and evolution into an essentialist view of race. These deeply flawed hierarchies have frequently been used by the powerful to justify subjugating and disenfranchising minorities and Indigenous people.
Although eugenics is often correctly associated with the Nazi party and the Holocaust, eugenic ideas and eugenic policies were also widespread in the US and UK during the 1920s and 1930s and sometimes aligned with progressive causes of that era (D. Paul 1984; Kevles 1995). Eugenic ideas were also implemented as policy —with horrific consequences—in a number of countries. Immigration policies based explicitly on eugenic arguments were put in place in the US from the 1920s until their repeal in the 1950s and 60s. These policies strongly favoured immigration from Northern Europe and were a deliberate action to restrict or bar immigration from Asia and eastern and southern Europe based on xenophobic, racist, and anti-Semitic views (Okrent 2020). During the 20th Century, many US states passed eugenics sterilization laws (Reilly 2015), that in practice were often targeted against Black, Latino, and Indigenous people (Hansen and King 2013). For example, the state of California from 1919 to 1972 used eugenics ideas to justify the sterilization of 20,000 people who had been labelled unfit and mentally defective, a disproportionate number of whom were Latino (Stern et al. 2017; Novak et al. 2018)
Many early geneticists during this time were proponents of eugenics and many supported racist views in their genetics research. One notable example is R.A. Fisher, who we’ll encounter throughout this book. Fisher is arguably the father of much of evolutionary genetics and modern statistics, having made huge contributions to the foundations of both fields. He pursued these fields in part because of his eugenic interests and concerns about the “genetically inferiority” of the lower classes (Norton 1983; Mazumdar 2005). For example, he devoted a number of the later chapters in his classic evolutionary genetics book to eugenics (Fisher 1930). He was hardly alone in his views, with many prominent geneticists lending their voices to eugenic and racist arguments. Indeed, many famous genetics institutions grew from roots in eugenics. For example, the Cold Spring Harbor Laboratory hosted a large Eugenics Record Office, and prior to 1954, the journal Annals of Human Genetics was called Annals of Eugenics. Scientists and their institutions strongly shaped the eugenic views and policies of their time and at times bent science to lend support to their racist views. Given their lasting contributions to our field, we should not shy away from reading and discussing their work. But despite their scientific accomplishments, we should resist the urge to celebrate or idolize them. We should also guard against inheriting their thinking by continually questioning the frameworks and language they put in place.
From its inception, geneticists have also been central to movements against eugenics and scientific racism on scientific as well as moral grounds. For instance, Thomas Hunt Morgan and Lancelot Hogben were both prominent geneticists who argued that eugenicists failed to recognize the environmental and social causes of inequality (Hogben 1933; Tabery 2008; Allen 2011). These arguments thread into later debates, where geneticists pushed back on simplistic and erroneous claims about genetics, IQ and behavioural differences among human populations (Theodosius Dobzhansky 1961; Richard C. Lewontin 1970; D. B. Paul 1994). Population geneticists have also been central to the pushback against scientific racism, highlighting the close genetic relationships among all humans due to their recent common ancestry and the ephemeral nature of populations (UNESCO 1952; Richard C. Lewontin 1972; Provine 1986; Gannett 2013). Racists continue to advance a selective view of population-genetic results to further their ends. As scientists, it is too easy to claim that we are just interested in the facts and ignore others who seek to present a distorted view of the science to advance their own political and social agendas. It is our job as population geneticists to argue against misuse of our field. As human genomics and personal genomics rise in prominence, we also need to resist public adoption of genetic determinism and essentialist, racialized thinking. We must question the topics we choose to investigate, the assumptions we make, and the conversations we prioritize as a field. Through exploring our own biases and those embedded in the presentation and use of our field, we can help to combat the misrepresentations of genetics and evolution that continue to cause harm in our society.
In this chapter we will work through how the basics of Mendelian genetics play out at the population level in sexually reproducing organisms.
Loci and alleles are the basic currency of population genetics–and indeed of genetics. A locus may be an entire gene, or a single nucleotide base pair such as A-T. At each locus, there may be multiple genetic variants segregating in the population—these different genetic variants are known as alleles. If all individuals in the population carry the same allele, we say that the locus is monomorphic; at this locus there is no genetic variability in the population. If there are multiple alleles in the population at a locus, we say that this locus is polymorphic (this is sometimes referred to as a segregating site).
Table [Table:ADH] shows a small stretch of orthologous sequence for the ADH locus from samples from Drosophila melanogaster, D. simulans, and D. yakuba. D. melanogaster and D. simulans are sister species and D. yakuba is a close outgroup to the two. Each column represents a single haplotype from an individual (the individuals are diploid but were inbred so they’re homozygous for their haplotype). Only sites that differ among individuals of the three species are shown. Site \(834\) is an example of a polymorphism; some D. simulans individuals carry a \(C\) allele while others have a \(T\). Fixed differences are sites that differ between the species but are monomorphic within the species. Site \(781\) is an example of a fixed difference between D. melanogaster and the other two species.
We can also annotate the alleles and loci in various ways. For example, position \(781\) is a non-synonymous fixed difference. We call the less common allele at a polymorphism the minor allele and the common allele the major allele, e.g. at site \(1068\) the \(T\) allele is the minor allele in D. melanogaster. We call the more evolutionarily recent of the two alleles the derived allele and the older of the two the ancestral allele. We infer that the \(T\) allele at site 1068 is the derived allele because the \(C\) is found in both other species, suggesting that the \(T\) allele arose via a \(C \rightarrow T\) mutation.
Question . A) How many
segregating sites does the sample from D. simulans have in the
ADH gene?
B) How many fixed differences are there
between D. melanogaster and D. yakuba?
Allele frequencies are a central unit of population genetics analysis, but from diploid individuals we only get to observe genotype counts. Our first task then is to calculate allele frequencies from genotype counts. Consider a diploid autosomal locus segregating for two alleles (\(A_1\) and \(A_2\)). We’ll use these arbitrary labels for our alleles, merely to keep this general. Let \(N_{11}\) and \(N_{12}\) be the number of \(A_1A_1\) homozygotes and \(A_1A_2\) heterozygotes, respectively. Moreover, let \(N\) be the total number of diploid individuals in the population. We can then define the relative frequencies of \(A_1A_1\) and \(A_1A_2\) genotypes as \(f_{11} = N_{11}/N\) and \(f_{12} = N_{12}/N\), respectively. The frequency of allele \(A_1\) in the population is then given by
\[p = \frac{2 N_{11} + N_{12}}{2N} = f_{11} + \frac{1}{2} f_{12}.\] Note that this follows directly from how we count alleles given individuals’ genotypes, and holds independently of Hardy–Weinberg proportions and equilibrium (discussed below). The frequency of the alternate allele (\(A_2\)) is then just \(q=1-p\).
One common measure of genetic diversity is the average number of single nucleotide differences between haplotypes chosen at random from a sample. This is called nucleotide diversity and is often denoted by \(\pi\).
For example, we can calculate \(\pi\) for our ADH locus from Table [Table:ADH] above: we have 6 sequences from D. simulans (a-f), there’s a total of 15 ways of pairing these sequences, and
\[\pi=\frac{1}{15} \big( (2 + 1 + 1 + 1 + 0 ) + (3 + 3 + 3 + 2 ) +(0 + 0 + 1) + (0 + 1) + (1) \big)=1.2\overline{6}\]
where the first bracketed term gives the pairwise differences between
a and b-f, the second bracketed term the differences between b and c-f
and so on.
Our \(\pi\) measure will depend on the
length of sequence it is calculated for. Therefore, \(\pi\) is usually normalized by the length
of sequence, to be a per site (or per base) measure. For example, our
ADH sequence covers \(397\)bp of DNA
and so \(\pi =
1.2\overline{6}/397=0.0032\) per site in D. simulans for
this region. Note that we could also calculate \(\pi\) per synonymous site (or
non-synonymous). For synonymous site \(\pi\), we would count up number of
synonymous differences between our pairs of sequences, and then divide
by the total number of sites where a synonymous change could have
occurred.
Another measure of genetic variability is the total number of sites that are polymorphic (segregating) in our sample. One issue is that the number of segregating sites will grow as we sequence more individuals (unlike \(\pi\)). Later in the course, we’ll talk about how to standardize the number of segregating sites for the number of individuals sequenced (see eqn[watterson_theta]).
We also often want to compile information about the frequency of alleles across sites. We call alleles that are found once in a sample singletons, alleles that are found twice in a sample doubletons, and so on. We count up the number of loci where an allele is found \(i\) times out of \(n\), e.g. how many singletons are there in the sample, and this is called the frequency spectrum. We’ll want to do this in some consistent manner, such as calculating the frequency spectrum of the minor allele or the derived allele.
Question . How many minor-allele singletons are there in D. simulans in the ADH region? [Defining minor allele just within D. simulans.]
Two observations have puzzled population geneticists since the inception of molecular population genetics. The first is the relatively high level of genetic variation observed in most obligately sexual species. This first observation, in part, drove the development of the Neutral theory of molecular evolution, the idea that much of this molecular polymorphism may simply reflect a balance between genetic drift and mutation. The second observation is the relatively narrow range of polymorphism across species with vastly different census sizes. This observation represented a puzzle as the Neutral theory predicts that levels of genetic diversity should scale with population size. Much effort in theoretical and empirical population genetics has been devoted to trying to reconcile models with these various observations. We’ll return to discuss these ideas throughout our course.
The first observations of molecular genetic diversity within natural populations were made from surveys of allozyme data, but we can revisit these general patterns with modern data. For example, Leffler et al. (2012) compiled data on levels of within-population, autosomal nucleotide diversity (\(\pi\)) for 167 species across 14 phyla from non-coding and synonymous sites (Figure [fig:Leffer]). The species with the lowest levels of \(\pi\) in their survey was Lynx, with \(\pi = 0.01\%\), i.e. only \(1/10000\) bases differed between two sequences. In contrast, some of the highest levels of diversity were found in Ciona savignyi, Sea Squirts, where a remarkable \(1/12\) bases differ between pairs of sequences. This \(800\)-fold range of diversity seems impressive, but census population sizes have a much larger range.
Imagine a population mating at random with respect to genotypes, i.e. no inbreeding, no assortative mating, no population structure, and no sex differences in allele frequencies. The frequency of allele \(A_1\) in the population at the time of reproduction is \(p\). An \(A_1A_1\) genotype is made by reaching out into our population and independently drawing two \(A_1\) allele gametes to form a zygote. Therefore, the probability that an individual is an \(A_1A_1\) homozygote is \(p^2\). This probability is also the expected frequencies of the \(A_1A_1\) homozygote in the population. The expected frequency of the three possible genotypes are
\(f_{11}\) | \(f_{12}\) | \(f_{22}\) |
---|---|---|
\(p^2\) | \(2pq\) | \(q^2\) |
i.e. their Hardy-Weinberg expectations (Hardy et al. 1908; Weinberg 1908). Note that we only need to assume random mating with respect to our focal allele in order for these expected frequencies to hold in the zygotes forming the next generation. Evolutionary forces, such as selection, change allele frequencies within generations, but do not change this expectation for new zygotes, as long as \(p\) is the frequency of the \(A_1\) allele in the population at the time when gametes fuse. We only need the assumptions of no migration, selection, and mutation in order for these Hardy-Weinberg expectations of genotypes to represent a long term equilibrium.
Question . On the coastal islands of British Columbia there is a subspecies of black bear (Ursus americanus kermodei, Kermode’s bear). Many members of this black bear subspecies are white; they’re sometimes called spirit bears. These bears aren’t hybrids with polar bears, nor are they albinos. They are homozygotes for a recessive change at the MC1R gene. Individuals who are \(GG\) at this SNP are white, while \(AA\) and \(AG\) individuals are black.
Below are the genotype counts for the MC1R polymorphism in a sample of bears from British Columbia’s island populations from Ritland, Newton, and Marshall (2001).
\(AA\) | \(AG\) | \(GG\) |
---|---|---|
42 | 24 | 21 |
What are the expected frequencies of the three genotypes under HW?
See Figure 2.1 for a nice empirical demonstration of Hardy–Weinberg proportions. The mean frequency of each genotype closely matches its HW expectations, and much of the scatter of the dots around the expected line is due to our small sample size (\(\sim 60\) individuals). While HW often seems like a silly model, it often holds remarkably well within populations. This is because individuals don’t mate at random, but they do mate at random with respect to their genotype at most of the loci in the genome.
Question . You are investigating a locus with three alleles, A, B, and C, with allele frequencies \(p_A\), \(p_B\), and \(p_C\). What fraction of the population is expected to be homozygotes under Hardy–Weinberg?
Microsatellites are regions of the genome where individuals vary for the number of copies of some short DNA repeat that they carry. These regions are often highly variable across individuals, making them a suitable way to identify individuals from a DNA sample. This so-called DNA fingerprinting has a range of applications from establishing paternity and identifying human remains to matching individuals to DNA samples from a crime scene. The FBI make use of the CODIS database. The CODIS database contains the genotypes of over 13 million people, most of whom have been convicted of a crime. Most of the profiles record genotypes at 13 microsatellite loci that are tetranucleotide repeats (since 2017, 20 sites have been genotyped).
The allele counts for two loci (D16S539 and TH01) are shown in table [table:CODIS_1] and [table:CODIS_2] for a sample of 155 people of European ancestry. You can assume these two loci are on different chromosomes.
Question . You extract a DNA sample from a crime scene. The
genotype is 100/80 at the D16S539 locus and 70/93 at TH01.
A) You have a suspect in custody. Assuming
this suspect is innocent and of European ancestry, what is the
probability that their genotype would match this profile by chance (a
false-match probability)?
B) The FBI uses \(\geq\) 13 markers. Why is this higher
number necessary to make the match statement convincing evidence in
court?
C) An early case that triggered debate
among forensic geneticists was a crime among the Abenaki, a Native
American community in Vermont (see R. C. Lewontin 1994, for
discussion). There was a DNA sample from the crime scene, and the
perpetrator was thought likely to be a member of the Abenaki community.
Given that allele frequencies vary among populations, why would people
be concerned about using data from a non-Abenaki population to compute a
false match probability?
One major violation of the assumptions of Hardy Weinberg is non-random mating with respect to the genotype at a locus. One way that individuals can mate non-randomly is if individuals choose to mate based on a phenotype determined by (in part) the genotype at a locus. This non-random mating can be between: 1) individuals with similar phenotype, so called positive assortative mating or 2) individuals with dissimilar phenotypes, negative assortative mating or disassortative mating. Here we’ll briefly discuss a couple of real examples of assortative mating to make sure we’re all on the same page. We’ll encounter other forms of non-random mating, due to inbreeding and population structure, in the next few chapters.
Positive assortative mating on the basis of a phenotype can create an excess of homozygotes. Heliconius butterflies are famous for their mimicry, where poisonous pairs of distantly related species mimic each others’ bright colour patterns and so share the benefits of being avoided by visual predators (Müllerian mimics). H. melpomene rosina and H. cydno chioneus are closely related species that co-occur in central Panama, but mimic different other co-occuring species (Figure 2.2 ). These differences in colouration pattern are due to a few loci with large phenotypic effects. The two species can hybridize and produce viable F1 hybrids. These F1 hybrids are heterozygotes at the colour loci, and their intermediate appearance means that they’re poor mimics and so are quickly eaten by predators. However, these heterzygote (F1) hybrids are very rare in nature \(< \frac{1}{1000}\), as the parental species show strong positive assortatively mating based on colour pattern, based on genetic differences in mate preference Merrill et al. (2019).