Molecular Biology and Evolution, msz202, https://doi.org/10.1093/molbev/msz202
Mol. Biol. Evol. doi:10.1093/molbev/msaa015
AbstractApproaches for studying the evolution of globular proteins are now well established yet are unsuitable for disordered sequences. Our understanding of the evolution of proteins containing disordered regions therefore lags that of globular proteins, limiting our capacity to estimate their evolutionary history, classify paralogs, and identify potential sequence–function relationships. Here, we overcome these limitations by using new analytical approaches that project representations of sequence space to dissect the evolution of proteins with both ordered and disordered regions, and the correlated changes between these. We use the fasciclin-like arabinogalactan proteins (FLAs) as a model family, since they contain a variable number of globular fasciclin domains as well as several distinct types of disordered regions: proline (Pro)-rich arabinogalactan (AG) regions and longer Pro-depleted regions.Sequence space projections of fasciclin domains from 2019 FLAs from 78 species identified distinct clusters corresponding to different types of fasciclin domains. Clusters can be similarly identified in the seemingly random Pro-rich AG and Pro-depleted disordered regions. Sequence features of the globular and disordered regions clearly correlate with one another, implying coevolution of these distinct regions, as well as with the N-linked and O-linked glycosylation motifs. We reconstruct the overall evolutionary history of the FLAs, annotated with the changing domain architectures, glycosylation motifs, number and length of AG regions, and disordered region sequence features. Mapping these features onto the functionally characterized FLAs therefore enables their sequence–function relationships to be interrogated. These findings will inform research on the abundant disordered regions in protein families from all kingdoms of life.
AbstractFor most sequenced flowering plants, multiple whole-genome duplications (WGDs) are found. Duplicated genes following WGD often have different fates that can quickly disappear again, be retained for long(er) periods, or subsequently undergo small-scale duplications. However, how different expression, epigenetic regulation, and functional constraints are associated with these different gene fates following a WGD still requires further investigation due to successive WGDs in angiosperms complicating the gene trajectories. In this study, we investigate lotus (Nelumbo nucifera), an angiosperm with a single WGD during the K–pg boundary. Based on improved intraspecific-synteny identification by a chromosome-level assembly, transcriptome, and bisulfite sequencing, we explore not only the fundamental distinctions in genomic features, expression, and methylation patterns of genes with different fates after a WGD but also the factors that shape post-WGD expression divergence and expression bias between duplicates. We found that after a WGD genes that returned to single copies show the highest levels and breadth of expression, gene body methylation, and intron numbers, whereas the long-retained duplicates exhibit the highest degrees of protein–protein interactions and protein lengths and the lowest methylation in gene flanking regions. For those long-retained duplicate pairs, the degree of expression divergence correlates with their sequence divergence, degree in protein–protein interactions, and expression level, whereas their biases in expression level reflecting subgenome dominance are associated with the bias of subgenome fractionation. Overall, our study on the paleopolyploid nature of lotus highlights the impact of different functional constraints on gene fate and duplicate divergence following a single WGD in plant.
AbstractRecombination increases the local GC-content in genomic regions through GC-biased gene conversion (gBGC). The recent discovery of a large genomic region with extreme GC-content in the fat sand rat Psammomys obesus provides a model to study the effects of gBGC on chromosome evolution. Here, we compare the GC-content and GC-to-AT substitution patterns across protein-coding genes of four gerbil species and two murine rodents (mouse and rat). We find that the known high-GC region is present in all the gerbils, and is characterized by high substitution rates for all mutational categories (AT-to-GC, GC-to-AT, and GC-conservative) both at synonymous and nonsynonymous sites. A higher AT-to-GC than GC-to-AT rate is consistent with the high GC-content. Additionally, we find more than 300 genes outside the known region with outlying values of AT-to-GC synonymous substitution rates in gerbils. Of these, over 30% are organized into at least 17 large clusters observable at the megabase-scale. The unusual GC-skewed substitution pattern suggests the evolution of genomic regions with very high recombination rates in the gerbil lineage, which can lead to a runaway increase in GC-content. Our results imply that rapid evolution of GC-content is possible in mammals, with gerbil species providing a powerful model to study the mechanisms of gBGC.
AbstractUnderstanding the persistence of genetic variation within populations has long been a goal of evolutionary biology. One promising route toward achieving this goal is using population genetic approaches to describe how selection acts on the loci associated with trait variation. Gene expression provides a model trait for addressing the challenge of the maintenance of variation because it can be measured genome-wide without information about how gene expression affects traits. Previous work has shown that loci affecting the expression of nearby genes (local or cis-eQTLs) are under negative selection, but we lack a clear understanding of the selective forces acting on variants that affect the expression of genes in trans. Here, we identify loci that affect gene expression in trans using genomic and transcriptomic data from one population of the obligately outcrossing plant, Capsella grandiflora. The allele frequencies of trans-eQTLs are consistent with stronger negative selection acting on trans-eQTLs than cis-eQTLs, and stronger negative selection acting on trans-eQTLs associated with the expression of multiple genes. However, despite this general pattern, we still observe the presence of a trans-eQTL at intermediate frequency that affects the expression of a large number of genes in the same coexpression module. Overall, our work highlights the different selective pressures shaping variation in cis- and trans-regulation.
AbstractComparative genomics and molecular phylogenetics are foundational for understanding biological evolution. Although many studies have been made with the aim of understanding the genomic contents of early life, uncertainty remains. A study by Weiss et al. (Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin WF. 2016. The physiology and habitat of the last universal common ancestor. Nat Microbiol. 1(9):16116.) identified a number of protein families in the last universal common ancestor of archaea and bacteria (LUCA) which were not found in previous works. Here, we report new research that suggests the clustering approaches used in this previous study undersampled protein families, resulting in incomplete phylogenetic trees which do not reflect protein family evolution. Phylogenetic analysis of protein families which include more sequence homologs rejects a simple LUCA hypothesis based on phylogenetic separation of the bacterial and archaeal domains for a majority of the previously identified LUCA proteins (∼82%). To supplement limitations of phylogenetic inference derived from incompletely populated orthologous groups and to test the hypothesis of a period of rapid evolution preceding the separation of the domains, we compared phylogenetic distances both within and between domains, for thousands of orthologous groups. We find a substantial diversity of interdomain versus intradomain branch lengths, even among protein families which exhibit a single domain separating branch and are thought to be associated with the LUCA. Additionally, phylogenetic trees with long interdomain branches relative to intradomain branches are enriched in information categories of protein families in comparison to those associated with metabolic functions. These results provide a new view of protein family evolution and temper claims about the phenotype and habitat of the LUCA.
AbstractEvidence is accumulating that evolutionary changes are not only common during biological invasions but may also contribute directly to invasion success. The genomic basis of such changes is still largely unexplored. Yet, understanding the genomic response to invasion may help to predict the conditions under which invasiveness can be enhanced or suppressed. Here, we characterized the genome response of the spotted wing drosophila Drosophila suzukii during the worldwide invasion of this pest insect species, by conducting a genome-wide association study to identify genes involved in adaptive processes during invasion. Genomic data from 22 population samples were analyzed to detect genetic variants associated with the status (invasive versus native) of the sampled populations based on a newly developed statistic, we called C2, that contrasts allele frequencies corrected for population structure. We evaluated this new statistical framework using simulated data sets and implemented it in an upgraded version of the program BayPass. We identified a relatively small set of single-nucleotide polymorphisms that show a highly significant association with the invasive status of D. suzukii populations. In particular, two genes, RhoGEF64C and cpo, contained single-nucleotide polymorphisms significantly associated with the invasive status in the two separate main invasion routes of D. suzukii. Our methodological approaches can be applied to any other invasive species, and more generally to any evolutionary model for species characterized by nonequilibrium demographic conditions for which binary covariables of interest can be defined at the population level.
AbstractDifferent evolutionary forces shape gene content and sequence evolution on autosomes versus sex chromosomes. Location on a sex chromosome can favor male-beneficial or female-beneficial mutations depending on the sex determination system and selective pressure on different sexual morphs. An X0 sex determination can lead to autosomal enrichment of male-biased genes, as observed in some hemipteran insect species. Aphids share X0 sex determination; however, models predict the opposite pattern, due to their unusual life cycles, which alternate between all-female asexual generations and a single sexual generation. Predictions include enrichment of female-biased genes on autosomes and of male-biased genes on the X, in contrast to expectations for obligately sexual species. Robust tests of these models require chromosome-level genome assemblies for aphids and related hemipterans with X0 sex determination and obligate sexual reproduction. In this study, we built the first chromosome-level assembly of a psyllid, an aphid relative with X0 sex determination and obligate sexuality, and compared it with recently resolved chromosome-level assemblies of aphid genomes. Aphid and psyllid X chromosomes differ strikingly. In aphids, female-biased genes are strongly enriched on autosomes and male-biased genes are enriched on the X. In psyllids, male-biased genes are enriched on autosomes. Furthermore, functionally important gene categories of aphids are enriched on autosomes. Aphid X-linked genes and male-biased genes are under relaxed purifying selection, but gene content and order on the X is highly conserved, possibly reflecting constraints imposed by unique chromosomal mechanisms associated with the unusual aphid life cycle.
AbstractSatellite repeats are major sequence constituents of centromeres in many plant and animal species. Within a species, a single family of satellite sequences typically occupies centromeres of all chromosomes and is absent from other parts of the genome. Due to their common origin, sequence similarities exist among the centromere-specific satellites in related species. Here, we report a remarkably different pattern of centromere evolution in the plant tribe Fabeae, which includes genera Pisum, Lathyrus, Vicia, and Lens. By immunoprecipitation of centromeric chromatin with CENH3 antibodies, we identified and characterized a large and diverse set of 64 families of centromeric satellites in 14 species. These families differed in their nucleotide sequence, monomer length (33–2,979 bp), and abundance in individual species. Most families were species-specific, and most species possessed multiple (2–12) satellites in their centromeres. Some of the repeats that were shared by several species exhibited promiscuous patterns of centromere association, being located within CENH3 chromatin in some species, but apart from the centromeres in others. Moreover, FISH experiments revealed that the same family could assume centromeric and noncentromeric positions even within a single species. Taken together, these findings suggest that Fabeae centromeres are not shaped by the coevolution of a single centromeric satellite with its interacting CENH3 proteins, as proposed by the centromere drive model. This conclusion is also supported by the absence of pervasive adaptive evolution of CENH3 sequences retrieved from Fabeae species.
AbstractThe basidiomycete Schizophyllum commune has the highest level of genetic polymorphism known among living organisms. In a previous study, it was also found to have a moderately high per-generation mutation rate of 2×10−8, likely contributing to its high polymorphism. However, this rate has been measured only in an experiment on Petri dishes, and it is unclear how it translates to natural populations. Here, we used an experimental design that measures the rate of accumulation of de novo mutations in a linearly growing mycelium. We show that S. commune accumulates mutations at a rate of 1.24×10−7 substitutions per nucleotide per meter of growth, or ∼2.04×10−11 per nucleotide per cell division. In contrast to what has been observed in a number of species with extensive vegetative growth, this rate does not decline in the course of propagation of a mycelium. As a result, even a moderate per-cell-division mutation rate in S. commune can translate into a very high per-generation mutation rate when the number of cell divisions between consecutive meiosis is large.
AbstractEuglena gracilis is a metabolically flexible, photosynthetic, and adaptable free-living protist of considerable environmental importance and biotechnological value. By label-free liquid chromatography tandem mass spectrometry, a total of 1,786 proteins were identified from the E. gracilis purified mitochondria, representing one of the largest mitochondrial proteomes so far described. Despite this apparent complexity, protein machinery responsible for the extensive RNA editing, splicing, and processing in the sister clades diplonemids and kinetoplastids is absent. This strongly suggests that the complex mechanisms of mitochondrial gene expression in diplonemids and kinetoplastids occurred late in euglenozoan evolution, arising independently. By contrast, the alternative oxidase pathway and numerous ribosomal subunits presumed to be specific for parasitic trypanosomes are present in E. gracilis. We investigated the evolution of unexplored protein families, including import complexes, cristae formation proteins, and translation termination factors, as well as canonical and unique metabolic pathways. We additionally compare this mitoproteome with the transcriptome of Eutreptiella gymnastica, illuminating conserved features of Euglenida mitochondria as well as those exclusive to E. gracilis. This is the first mitochondrial proteome of a free-living protist from the Excavata and one of few available for protists as a whole. This study alters our views of the evolution of the mitochondrion and indicates early emergence of complexity within euglenozoan mitochondria, independent of parasitism.
AbstractAerobic performance is tied to fitness as it influences an animal’s ability to find food, escape predators, or survive extreme conditions. At high altitude, where low O2 availability and persistent cold prevail, maximum metabolic heat production (thermogenesis) is an aerobic performance trait that is closely linked to survival. Understanding how thermogenesis evolves to enhance survival at high altitude will yield insight into the links between physiology, performance, and fitness. Recent work in deer mice (Peromyscus maniculatus) has shown that adult mice native to high altitude have higher thermogenic capacities under hypoxia compared with lowland conspecifics, but that developing high-altitude pups delay the onset of thermogenesis. This finding suggests that natural selection on thermogenic capacity varies across life stages. To determine the mechanistic cause of this ontogenetic delay, we analyzed the transcriptomes of thermoeffector organs—brown adipose tissue and skeletal muscle—in developing deer mice native to low and high altitude. We demonstrate that the developmental delay in thermogenesis is associated with adaptive shifts in the expression of genes involved in nervous system development, fuel/O2 supply, and oxidative metabolism pathways. Our results demonstrate that selection has modified the developmental trajectory of the thermoregulatory system at high altitude and has done so by acting on the regulatory systems that control the maturation of thermoeffector tissues. We suggest that the cold and hypoxic conditions of high altitude force a resource allocation tradeoff, whereby limited energy is allocated to developmental processes such as growth, versus active thermogenesis, during early development.
AbstractPurifying (negative) natural selection is a hallmark of functional biological sequences, and can be detected in protein-coding genes using the ratio of nonsynonymous to synonymous substitutions per site (dN/dS). However, when two genes overlap the same nucleotide sites in different frames, synonymous changes in one gene may be nonsynonymous in the other, perturbing dN/dS. Thus, scalable methods are needed to estimate functional constraint specifically for overlapping genes (OLGs). We propose OLGenie, which implements a modification of the Wei–Zhang method. Assessment with simulations and controls from viral genomes (58 OLGs and 176 non-OLGs) demonstrates low false-positive rates and good discriminatory ability in differentiating true OLGs from non-OLGs. We also apply OLGenie to the unresolved case of HIV-1’s putative antisense protein gene, showing significant purifying selection. OLGenie can be used to study known OLGs and to predict new OLGs in genome annotation. Software and example data are freely available at https://github.com/chasewnelson/OLGenie (last accessed April 10, 2020).
AbstractFisher’s fundamental theorem of natural selection predicts no additive variance of fitness in a natural population. Consistently, studies in a variety of wild populations show virtually no narrow-sense heritability (h2) for traits important to fitness. However, counterexamples are occasionally reported, calling for a deeper understanding on the evolution of additive variance. In this study, we propose adaptive divergence followed by population admixture as a source of the additive genetic variance of evolutionarily important traits. We experimentally tested the hypothesis by examining a panel of ∼1,000 yeast segregants produced by a hybrid of two yeast strains that experienced adaptive divergence. We measured >400 yeast cell morphological traits and found a strong positive correlation between h2 and evolutionary importance. Because adaptive divergence followed by population admixture could happen constantly, particularly in species with wide geographic distribution and strong migratory capacity (e.g., humans), the finding reconciles the observation of abundant additive variances in evolutionarily important traits with Fisher’s fundamental theorem of natural selection. Importantly, the revealed role of positive selection in promoting rather than depleting additive variance suggests a simple explanation for why additive genetic variance can be dominant in a population despite the ubiquitous between-gene epistasis observed in functional assays.
AbstractGene duplication serves a critical role in evolutionary adaptation by providing genetic raw material to the genome. The evolution of duplicated genes may be influenced by epigenetic processes such as DNA methylation, which affects gene function in some taxa. However, the manner in which DNA methylation affects duplicated genes is not well understood. We studied duplicated genes in the honeybee Apis mellifera, an insect with a highly sophisticated social structure, to investigate whether DNA methylation was associated with gene duplication and genic evolution. We found that levels of gene body methylation were significantly lower in duplicate genes than in single-copy genes, implicating a possible role of DNA methylation in postduplication gene maintenance. Additionally, we discovered associations of gene body methylation with the location, length, and time since divergence of paralogous genes. We also found that divergence in DNA methylation was associated with divergence in gene expression in paralogs, although the relationship was not completely consistent with a direct link between DNA methylation and gene expression. Overall, our results provide further insight into genic methylation and how its association with duplicate genes might facilitate evolutionary processes and adaptation.
AbstractParasites are arguably among the strongest drivers of natural selection, constraining hosts to evolve resistance and tolerance mechanisms. Although, the genetic basis of adaptation to parasite infection has been widely studied, little is known about how epigenetic changes contribute to parasite resistance and eventually, adaptation. Here, we investigated the role of host DNA methylation modifications to respond to parasite infections. In a controlled infection experiment, we used the three-spined stickleback fish, a model species for host–parasite studies, and their nematode parasite Camallanus lacustris. We showed that the levels of DNA methylation are higher in infected fish. Results furthermore suggest correlations between DNA methylation and shifts in key fitness and immune traits between infected and control fish, including respiratory burst and functional trans-generational traits such as the concentration of motile sperm. We revealed that genes associated with metabolic, developmental, and regulatory processes (cell death and apoptosis) were differentially methylated between infected and control fish. Interestingly, genes such as the neuropeptide FF receptor 2 and the integrin alpha 1 as well as molecular pathways including the Th1 and Th2 cell differentiation were hypermethylated in infected fish, suggesting parasite-mediated repression mechanisms of immune responses. Altogether, we demonstrate that parasite infection contributes to genome-wide DNA methylation modifications. Our study brings novel insights into the evolution of vertebrate immunity and suggests that epigenetic mechanisms are complementary to genetic responses against parasite-mediated selection.
AbstractMagnesium chelatase chlIDH and cobalt chelatase cobNST enzymes are required for biosynthesis of (bacterio)chlorophyll and cobalamin (vitamin B12), respectively. Each enzyme consists of large, medium, and small subunits. Structural and primary sequence similarities indicate common evolutionary origin of the corresponding subunits. It has been reported earlier that some of vitamin B12 synthesizing organisms utilized unusual cobalt chelatase enzyme consisting of a large cobalt chelatase subunit (cobN) along with a medium (chlD) and a small (chlI) subunits of magnesium chelatase. In attempt to understand the nature of this phenomenon, we analyzed >1,200 diverse genomes of cobalamin and/or chlorophyll producing prokaryotes. We found that, surprisingly, genomes of many cobalamin producers contained cobN and chlD genes only; a small subunit gene was absent. Further on, we have discovered a diverse group of chlD genes with functional programed ribosomal frameshifting signals. Given a high similarity between the small subunit and the N-terminal part of the medium subunit, we proposed that programed translational frameshifting may allow chlD mRNA to produce both subunits. Indeed, in genomes where genes for small subunits were absent, we observed statistically significant enrichment of programed frameshifting signals in chlD genes. Interestingly, the details of the frameshifting mechanisms producing small and medium subunits from a single chlD gene could be prokaryotic taxa specific. All over, this programed frameshifting phenomenon was observed to be highly conserved and present in both bacteria and archaea.
AbstractMetabolic networks are complex cellular systems dependent on the interactions among, and regulation of, the enzymes in the network. Although there is great diversity of types of enzymes that make up metabolic networks, the models meant to understand the possible evolutionary outcomes following duplication neglect specifics about the enzyme, pathway context, and cellular constraints. To illuminate the mechanisms that shape the evolution of biochemical pathways, I functionally characterize the consequences of gene duplication of an enzyme family that performs multiple subsequent enzymatic reactions (a multistep enzyme) in the corticosteroid pathway in primates. The products of the corticosteroid pathway (aldosterone and cortisol) are steroid hormones that regulate metabolism and stress response in tetrapods. These steroid hormones are synthesized by a multistep enzyme Cytochrome P450 11B (CYP11B) that performs subsequent steps on different carbon atoms of the steroid derivatives. Through ancestral state reconstruction and in vitro characterization, I find that the primate ancestor of the CYP11B1 and CYP11B2 paralogs had moderate ability to synthesize both cortisol and aldosterone. Following duplication in Old World primates, the CYP11B1 homolog specialized on the production of cortisol, whereas its paralog, CYP11B2, maintained its ability to perform multiple subsequent steps as in the ancestral pathway. Unlike CYP11B1, CYP11B2 could not specialize on the production of aldosterone because it is constrained to perform earlier steps in the corticosteroid synthesis pathway to achieve the final product aldosterone. These results suggest that enzyme function, pathway context, along with tissue-specific regulation, both play a role in shaping potential outcomes of metabolic network elaboration.
AbstractSatellite DNAs (satDNAs) are among the most dynamically evolving components of eukaryotic genomes and play important roles in genome regulation, genome evolution, and speciation. Despite their abundance and functional impact, we know little about the evolutionary dynamics and molecular mechanisms that shape satDNA distributions in genomes. Here, we use high-quality genome assemblies to study the evolutionary dynamics of two complex satDNAs, Rsp-like and 1.688 g/cm3, in Drosophila melanogaster and its three nearest relatives in the simulans clade. We show that large blocks of these repeats are highly dynamic in the heterochromatin, where their genomic location varies across species. We discovered that small blocks of satDNA that are abundant in X chromosome euchromatin are similarly dynamic, with repeats changing in abundance, location, and composition among species. We detail the proliferation of a rare satellite (Rsp-like) across the X chromosome in D. simulans and D. mauritiana. Rsp-like spread by inserting into existing clusters of the older, more abundant 1.688 satellite, in events likely facilitated by microhomology-mediated repair pathways. We show that Rsp-like is abundant on extrachromosomal circular DNA in D. simulans, which may have contributed to its dynamic evolution. Intralocus satDNA expansions via unequal exchange and the movement of higher order repeats also contribute to the fluidity of the repeat landscape. We find evidence that euchromatic satDNA repeats experience cycles of proliferation and diversification somewhat analogous to bursts of transposable element proliferation. Our study lays a foundation for mechanistic studies of satDNA proliferation and the functional and evolutionary consequences of satDNA movement.
AbstractConvergent evolution is pervasive in nature, but it is poorly understood how various constraints and natural selection limit the diversity of evolvable phenotypes. Here, we analyze the transcriptome across fruiting body development to understand the independent evolution of complex multicellularity in the two largest clades of fungi—the Agarico- and Pezizomycotina. Despite >650 My of divergence between these clades, we find that very similar sets of genes have convergently been co-opted for complex multicellularity, followed by expansions of their gene families by duplications. Over 82% of shared multicellularity-related gene families were expanding in both clades, indicating a high prevalence of convergence also at the gene family level. This convergence is coupled with a rich inferred repertoire of multicellularity-related genes in the most recent common ancestor of the Agarico- and Pezizomycotina, consistent with the hypothesis that the coding capacity of ancestral fungal genomes might have promoted the repeated evolution of complex multicellularity. We interpret this repertoire as an indication of evolutionary predisposition of fungal ancestors for evolving complex multicellular fruiting bodies. Our work suggests that evolutionary convergence may happen not only when organisms are closely related or are under similar selection pressures, but also when ancestral genomic repertoires render certain evolutionary trajectories more likely than others, even across large phylogenetic distances.
AbstractUnderstanding how organisms adapt to extreme environments is fundamental and can provide insightful case studies for both evolutionary biology and climate-change biology. Here, we take advantage of the vast diversity of lifestyles in ants to identify genomic signatures of adaptation to extreme habitats such as high altitude. We hypothesized two parallel patterns would occur in a genome adapting to an extreme habitat: 1) strong positive selection on genes related to adaptation and 2) a relaxation of previous purifying selection. We tested this hypothesis by sequencing the high-elevation specialist Tetramorium alpestre and four other phylogenetically related species. In support of our hypothesis, we recorded a strong shift of selective forces in T. alpestre, in particular a stronger magnitude of diversifying and relaxed selection when compared with all other ants. We further disentangled candidate molecular adaptations in both gene expression and protein-coding sequence that were identified by our genome-wide analyses. In particular, we demonstrate that T. alpestre has 1) a higher level of expression for stv and other heat-shock proteins in chill-shock tests and 2) enzymatic enhancement of Hex-T1, a rate-limiting regulatory enzyme that controls the entry of glucose into the glycolytic pathway. Together, our analyses highlight the adaptive molecular changes that support colonization of high-altitude environments.
AbstractThe dN/dS ratio provides evidence of adaptation or functional constraint in protein-coding genes by quantifying the relative excess or deficit of amino acid-replacing versus silent nucleotide variation. Inexpensive sequencing promises a better understanding of parameters, such as dN/dS, but analyzing very large data sets poses a major statistical challenge. Here, I introduce genomegaMap for estimating within-species genome-wide variation in dN/dS, and I apply it to 3,979 genes across 10,209 tuberculosis genomes to characterize the selection pressures shaping this global pathogen. GenomegaMap is a phylogeny-free method that addresses two major problems with existing approaches: 1) It is fast no matter how large the sample size and 2) it is robust to recombination, which causes phylogenetic methods to report artefactual signals of adaptation. GenomegaMap uses population genetics theory to approximate the distribution of allele frequencies under general, parent-dependent mutation models. Coalescent simulations show that substitution parameters are well estimated even when genomegaMap’s simplifying assumption of independence among sites is violated. I demonstrate the ability of genomegaMap to detect genuine signatures of selection at antimicrobial resistance-conferring substitutions in Mycobacterium tuberculosis and describe a novel signature of selection in the cold-shock DEAD-box protein A gene deaD/csdA. The genomegaMap approach helps accelerate the exploitation of big data for gaining new insights into evolution within species.
AbstractUnderstanding why some species accumulate more deleterious substitutions than others is an important question relevant in evolutionary biology and conservation sciences. Previous studies conducted in terrestrial taxa suggest that life history traits correlate with the efficiency of purifying selection and accumulation of deleterious mutations. Using a large genome data set of 76 species of teleostean fishes, we show that species with life history traits associated with vulnerability to fishing have an increased rate of deleterious mutation accumulation (measured via dN/dS, i.e., nonsynonymous over synonymous substitution rate). Our results, focusing on a large clade of aquatic species, generalize previous patterns found so far in few clades of terrestrial vertebrates. These results also show that vulnerable species to fishing inherently accumulate more deleterious substitutions than nonthreatened ones, which illustrates the potential links among population genetics, ecology, and fishing policies to prevent species extinction.
AbstractMost molecular evolutionary studies of natural selection maintain the decades-old assumption that synonymous substitution rate variation (SRV) across sites within genes occurs at levels that are either nonexistent or negligible. However, numerous studies challenge this assumption from a biological perspective and show that SRV is comparable in magnitude to that of nonsynonymous substitution rate variation. We evaluated the impact of this assumption on methods for inferring selection at the molecular level by incorporating SRV into an existing method (BUSTED) for detecting signatures of episodic diversifying selection in genes. Using simulated data we found that failing to account for even moderate levels of SRV in selection testing is likely to produce intolerably high false positive rates. To evaluate the effect of the SRV assumption on actual inferences we compared results of tests with and without the assumption in an empirical analysis of over 13,000 Euteleostomi (bony vertebrate) gene alignments from the Selectome database. This exercise reveals that close to 50% of positive results (i.e., evidence for selection) in empirical analyses disappear when SRV is modeled as part of the statistical analysis and are thus candidates for being false positives. The results from this work add to a growing literature establishing that tests of selection are much more sensitive to certain model assumptions than previously believed.
AbstractEstimating past population dynamics from molecular sequences that have been sampled longitudinally through time is an important problem in infectious disease epidemiology, molecular ecology, and macroevolution. Popular solutions, such as the skyline and skygrid methods, infer past effective population sizes from the coalescent event times of phylogenies reconstructed from sampled sequences but assume that sequence sampling times are uninformative about population size changes. Recent work has started to question this assumption by exploring how sampling time information can aid coalescent inference. Here, we develop, investigate, and implement a new skyline method, termed the epoch sampling skyline plot (ESP), to jointly estimate the dynamics of population size and sampling rate through time. The ESP is inspired by real-world data collection practices and comprises a flexible model in which the sequence sampling rate is proportional to the population size within an epoch but can change discontinuously between epochs. We show that the ESP is accurate under several realistic sampling protocols and we prove analytically that it can at least double the best precision achievable by standard approaches. We generalize the ESP to incorporate phylogenetic uncertainty in a new Bayesian package (BESP) in BEAST2. We re-examine two well-studied empirical data sets from virus epidemiology and molecular evolution and find that the BESP improves upon previous coalescent estimators and generates new, biologically useful insights into the sampling protocols underpinning these data sets. Sequence sampling times provide a rich source of information for coalescent inference that will become increasingly important as sequence collection intensifies and becomes more formalized.
AbstractEpigenetic variation might play an important role in generating adaptive phenotypes by underpinning within-generation developmental plasticity, persistent parental effects of the environment (e.g., transgenerational plasticity), or heritable epigenetically based polymorphism. These adaptive mechanisms should be most critical in organisms where genetic sources of variation are limited. Using a clonally reproducing freshwater snail (Potamopyrgus antipodarum), we examined the stability of an adaptive phenotype (shell shape) and of DNA methylation between generations. First, we raised three generations of snails adapted to river currents in the lab without current. We showed that habitat-specific adaptive shell shape was relatively stable across three generations but shifted slightly over generations two and three toward a no-current lake phenotype. We also showed that DNA methylation specific to high-current environments was stable across one generation. This study provides the first evidence of stability of DNA methylation patterns across one generation in an asexual animal. Together, our observations are consistent with the hypothesis that adaptive shell shape variation is at least in part determined by transgenerational plasticity, and that DNA methylation provides a potential mechanism for stability of shell shape across one generation.
AbstractGenome-wide association studies have uncovered thousands of genetic variants that are associated with a wide variety of human traits. Knowledge of how trait-associated variants are distributed within and between populations can provide insight into the genetic basis of group-specific phenotypic differences, particularly for health-related traits. We analyzed the genetic divergence levels for 1) individual trait-associated variants and 2) collections of variants that function together to encode polygenic traits, between two neighboring populations in Colombia that have distinct demographic profiles: Antioquia (Mestizo) and Chocó (Afro-Colombian). Genetic ancestry analysis showed 62% European, 32% Native American, and 6% African ancestry for Antioquia compared with 76% African, 10% European, and 14% Native American ancestry for Chocó, consistent with demography and previous results. Ancestry differences can confound cross-population comparison of polygenic risk scores (PRS); however, we did not find any systematic bias in PRS distributions for the two populations studied here, and population-specific differences in PRS were, for the most part, small and symmetrically distributed around zero. Both genetic differentiation at individual trait-associated single nucleotide polymorphisms and population-specific PRS differences between Antioquia and Chocó largely reflected anthropometric phenotypic differences that can be readily observed between the populations along with reported disease prevalence differences. Cases where population-specific differences in genetic risk did not align with observed trait (disease) prevalence point to the importance of environmental contributions to phenotypic variance, for both infectious and complex, common disease. The results reported here are distributed via a web-based platform for searching trait-associated variants and PRS divergence levels at http://map.chocogen.com (last accessed August 12, 2020).
AbstractRepeated emergence of similar adaptations is often explained by parallel evolution of underlying genes. However, evidence of parallel evolution at amino acid level is limited. When the analyzed species are highly divergent, this can be due to epistatic interactions underlying the dynamic nature of the amino acid preferences: The same amino acid substitution may have different phenotypic effects on different genetic backgrounds. Distantly related species also often inhabit radically different environments, which makes the emergence of parallel adaptations less likely. Here, we hypothesize that parallel molecular adaptations are more prevalent between closely related species. We analyze the rate of parallel evolution in genome-size sets of orthologous genes in three groups of species with widely ranging levels of divergence: 46 species of the relatively recent lake Baikal amphipod radiation, a species flock of very closely related cichlids, and a set of significantly more divergent vertebrates. Strikingly, in genes of amphipods, the rate of parallel substitutions at nonsynonymous sites exceeded that at synonymous sites, suggesting rampant selection driving parallel adaptation. At sites of parallel substitutions, the intraspecies polymorphism is low, suggesting that parallelism has been driven by positive selection and is therefore adaptive. By contrast, in cichlids, the rate of nonsynonymous parallel evolution was similar to that at synonymous sites, whereas in vertebrates, this rate was lower than that at synonymous sites, indicating that in these groups of species, parallel substitutions are mainly fixed by drift.