Long-term evolution experiments
All populations originated from two ancestral strains with different initial mutation-rates. Ancestral wild-type (WT) are descendants of PFM2, a prototrophic derivative of E. coli K-12 str. MG1655, while ancestral mismatch repair knockouts (MMR-) are descendants of PFM5 (PFM2, ∆mutL)24. The long-term cultures consist of three transfer-size treatments, differing by 1 mL (large transfer-size treatment, L1), 1uL (medium transfer-size treatment, M1) and 1nL (small transfer-size treatment, S1) daily transfer and maintained in 10 mL LB broth shaking at 175 rpm at 37 °C. In addition, the large transfer-size treatment was run with 10- (L10) and 100-day (L100) replenishment cycles to yield different starvation regimes. At each transfer condition, WT and MMR- were both applied with four parallel evolution experiments, generating 5 × 2 × 4 = 40 populations. To maintain and trace the historical record of populations, 1 mL of copies of cultures were collected at days 0, 90 (or 100 for 100-day populations), 200, and then every 100 days, frozen with liquid nitrogen, and stored at −80 °C for sequencing. In addition, we froze samples every 20, 30, or 40 days for 1-day and 10-day populations until day 400 to revive populations in the case of contamination or extinction. Specifically, to avoid a case of a whole population extinction in 100-day populations, we maintained 100-day populations in triplicate (1 main culture and 2 backup cultures) with only the main culture being used to seed new triplicate cultures every 100 days. One backup culture would replace the main culture if it extincted and the experiment would continue. In this study, whole-population extinction events only occurred during the first 200-days. During the experimental evolution, L10 and L100 populations did show smaller population sizes before the transfer than L1 populations, but the difference is small22.
Effective population sizes for L1, M1, and S1 populations
During the experimental evolution, we observed that the carrying capacities of the populations are always ∼1010 per tube. Theoretically, the effective population size is the harmonic mean of fluctuated census population sizes44,45. Therefore, for L1 populations, the dilution factor = 10−1, Ne is about
$$frac{5}{tfrac{1}{{10}^{9}}+tfrac{1}{2times {10}^{9}}+tfrac{1}{4times {10}^{9}}+tfrac{1}{8times {10}^{9}}+tfrac{1}{{10}^{10}}}, approx ,2.5times {10}^{9}$$
Similarly, for M1 populations, the dilution factor = 10−4, and Ne is about 7.5 × 106. For S1 populations, the dilution factor = 10−7, and Ne is ∼1.3 × 103.
Rates of genomic evolution
We extracted DNA of long-term evolved populations from cultures every 100 days until day 90022, and sequenced as paired-end reads to a target depth of 100× on an Illumina HiSeq 2500 or an Illumina NextSeq 500. Available sequencing profiles were identified using Breseq46 from trimmomatic47 trimmed reads. For each population, at each sample point, we quantified the genomic divergence by summation of derived allele frequencies of all detected SNM rates. Then, we performed linear regression by the function “lm” in R with formula “genomic divergence ~ guaranteed generations + 0” for estimating the rate of genomic evolution. The estimated generations per day are 3.3, 13.3, 23.3, 0.33, and 0.25 for L1, M1, S1, L10, and L100 populations, respectively22,25.
Mutation accumulation (MA) procedure
We labelled MMR- parallel populations as A or B, and WT populations as C, D, E or F. After 1000 days of evolution, each combination of transfer conditions and initial mutation rate was assessed in quadruplicate with two clones assessed from each of two representative populations, except for WT L10, where two additional populations were also investigated because they exhibited extremely high evolutionary rates and fluctuation tests suggested these populations may have evolved uniquely high mutation rates (Supplementary Fig. 1). In total, we focused on 44 experimental clones (2 long term evolution initial mutation rates × 5 transfer conditions × 2 populations × 2 clones + 4 extra WT L10 clones = 44) and 4 ancestral clones (2 WT and 2 MMR-). Each clone was cultured on LB (Miller) agar plates, and 50 distinct colonies were obtained to form parallel MA lines. These MA lines were streaked onto new plates from random single colonies daily for up to 60 days; this transfer method strongly bottlenecks the MA lines so that nearly all mutations are fixed though genetic drift, regardless of their fitness effects24. The offspring of three clones (L10-A1, L10-A2 and L100-C1) grew so slowly that transfers were done every other day and had to be terminated early (30, 31, and 38 days).
The number of generations (G) per passage were estimated by CFU counts of dissolved and diluted colonies of focal plates as follow:
$$G ,=frac{left({G}_{0}+{G}_{30}right)times a+left({G}_{30}+{G}_{N}right)times b}{N},, (a=15; {b}=N/2-15),$$
(1)
where G0, G30 and GN are number of generations respectively estimated at days 0, 30, and the last day of the entire MA experiment. Typically G ranges from 18 to 24. N is number of days of the entire MA experiment.
At the end of MA procedure, WGS was applied to these 48 evolved and ancestral MA progenitor clones and ∼1200 of their derived MA lines (∼25 MAs per clone) after propagation for an average of ∼1500 generations.
Genomic DNA isolation and high-throughput sequencing
Genomic DNA samples of MA ancestral clones and MA lines were extracted using the Wizard DNeasy UltraClean Microbial Kit (Qiagen), Wizard Genomic DNA Purification Kit (Promega), or the MagMax DNA Multisample Ultra 2.0 Kit (Applied Biosystems). Fresh DNA from MA lines or progenitor clones was submitted to the Beijing Genomics Institution (BGI-Hongkong) and sequenced on a DNBseq-G400 platform.
Variants calling
Using trimmomatic47, we removed residual adapters and trimmed low quality sequences. Cleaned reads were then mapped to the reference genome of E. coli K-12 str. MG1655 (NC_000913.3) with BWA48. Only paired reads were allowed to map. After the mapping, 15 MA lines were discarded from the following analysis because of their low depths (<50×) or low coverages (<30%), which are considered as contaminated lines during the MA experiments. In the end, on average, 99.6% of genomic positions were covered by the high quality reads (Q30), displaying an average depth of 149×.
We then removed duplicated reads from filtered alignments, and called candidate SNM and SIM using GATK449. Base quality scores were corrected by corresponding MA progenitor. The candidate variants with read depths <4 or phred scores <10 were discarded. Ancestral variants were also discarded; the ancestral variants were identified from the differences between the reference genome and the MA progenitors, which are expected to be supported by at least four reads with phred scores higher than ten. In addition, only the indels <= 4 bp (SIM) were retained. The variants that were consensus among MA lines were also discarded; consensus SNM calls (or SIM) were identified if present in more than two (or four) MA lines with the same MA ancestor. To detect structural variations (SVs), including deletions, duplications, inversions, and insertions, we used CNVnator50, Lumpy51 and BreakDancer52. The redundant predictions were discarded. Among the maximum 500 bp upstream and downstream extended regions of the predicted SVs, we first removed the redundant predictions of CNVnator from the Lumpy predictions. We then removed the redundant predictions of Breakdancer if they are present in the Lumpy/CNVnator predictions. For each MA experiment, we removed SVs that were either called in the MA ancestral clone or found in at least three parallel MA lines, allowing a maximum 500 bp extension or substraction at both ends.
Estimating mutation rates
SNM and SIM rates (per genomic site per generation) were calculated for each MA line by the equation:
$$u ,=, frac{m}{Gtimes n},$$
(2)
where u denotes the mutation rate, m represents the number of SNM or SIM observed, G is the total number of generations elapsed per lineage, and n is the number of ancestral sites covered by sequencing reads with high quality.
SVM rates (per genome per generation) were calculated for each MA line by the equation:
$$u ,=, frac{m}{G},$$
(3)
where m represents the observed SV numbers.
Excluding outliers with odd mutation rates
Before analyzing SNM rates, we removed several MA lines with odd rates using a cutoff of absolute Z-score > 2.5:
$$Z,=, frac{u-U}{S},$$
(4)
where u is the mutation rate of a MA line, U and S are the mean and standard deviation of mutation rates for all lines related to the same MA ancestor. We also adopted this cutoff for filtering odd SIM and SVM rates in analysis. All filtered mutations in MA experiment are listed in Supplementary Data 2–4.
Mutation trajectories
To find candidate mutations that associate with mutation-rate evolution, we investigated nonsynonymous substitutions and indels in MA ancestors, including variants called from ancestral clones and consensus calls among MA lines with the same MA ancestor (see above mentioned). Here, we focused on genes involving DNA replication and repair processes, including GO-terms of “0006298,” “0006281,” “0006260,” “0006285,” “0006284,” “0006307,” “0000731,” “0007059,” “0051103,” “0006310,” “0009432,” “0045004,” “0006261,” and “0006271” (http://geneontology.org/). The temporal trajectories of allele frequencies for these putative mutations were directly from the detected mutations of previous longitudinal sequencing data following previous methodology (Supplementary Fig. 4)22.
Polymerase chain reactions (PCR)
To validate mutational trajectories of candidate hypermutators in mutL or mutS, we designed primers targeting the mutated regions using Primer-BLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). The targeted PCR products are 900 and 700 bp for mutS and mutL, respectively. Below are the primer sequences used.
mutS-F: 5’-TCTGGCACGTCTGGCTTTAC-3’;
mutS-R: 5’-AGATGCGATCGATAGGTCCAA-3’;
mutL-F: 5’-TCAGGTCTTACCGCCACAAC-3’;
mutL-R: 5’-GCCAGCGCTTGTTCAAGAAA-3’.
To estimate the allele frequency of candidate hypermutators during the experimental evolution, we streaked frozen stocks for WT-L10-C, D, E, and F onto LB plates and cultured overnight at 37 °C. We used frozen samples collected at days 0, 90, 180, 200, 230, 300 and then every 100 days. To prepare samples for PCR, we picked 20 single colonies from each stock randomly and then transferred each of them to a 1.5 mL tube containing 0.6 mL of LB broth for overnight culture. Following overnight incubation, 1 uL of each revived colony was taken to perform PCR using Q5 High-Fidelity 2X Master Mix (NEB). The PCR conditions were starting with 1 min at 98 °C for denaturation, then subjected to 30 cycles of amplification (10 s at 98 °C for denaturation, 30 s for annealing, and 1 min at 72 °C for extension). The annealing temperatures are 66 °C for the mutS primers and 68 °C for the mutL primers. At the end, a final extension was performed 2 min at 72 °C. PCR products were then purified and collected by GeneJET PCR Purification Kit (Thermo Fisher Scientific). Purified DNAs were submitted to the KED Genomics Core at Arizona State University for Sanger sequencing.
By testing PCR products of 20 single colonies from each frozen stock, we found that several colonies from as early as day 300 of WT-L10-C and WT-L10-F, day 230 of WT-L10-D, and day 180 of WT-L10-E contain hypermutators. Thus, the occurrence times of hypermutator candidates were estimated to be between days 230–300, 200–230, 90–180, and 230–300 for WT-L10-C, D, E, and F, respectively.
Fluctuation test
Luria-Delbrück fluctuation tests can be used for the determination of the spontaneous mutation rate using a known molecular marker, such as antibiotic resistance53. To study whether the candidate hypermutators affect mutation rates, we used the fluctuation test with rifampicin resistance for clones isolated from L10-C, D, E and F at days 90, 200, 300, and 400, as well as the MA ancestor. Whether the clones with or without the candidate was examined by PCR as mentioned above. Three independent clones with the hypermutator and/or three independent clones without the hypermutator were used for each L10 population at each time-point.
During the fluctuation test, each targeted clone was first overnight cultured, and 44 replicates of the targeted clones were then transferred (1: 107 dilution) and grown for 24–48 h in a 96-well culture plate containing 1 mL of LB broth per well. Four were diluted and spread on LB agar for total cell counts; 40 were plated on LB + Rifampin agar (100 mg/L) for resistance selection. The number of resistant mutants in CFU/mL were converted to an estimated mutation rate by the “newton.LD” function in the R package “rSalvador”54.
Mutation patterns in populations
In addition to the six-class SNM spectra (C > A, C > G, C > T, T > A, T > C, and T > G), we also reported the mutation contexts with 96 classes constituted by six SNM classes, plus the flanking 5’ and 3’ A, G, C, and T bases. To further extract orthogonal mutation patterns from the evolved populations, we constructed mutation tables with 96-class mutation contexts and identified the “mutation patterns (MPs)” using the non-negative matrix factorization (NMF) function in the R package “MutationalPatterns”55. Following the recommendation in the manual, to determine the optimal number of MPs, we estimated cophenetic correlation coefficients with different MP numbers. We choose the smallest values for which cophenetic correlation coefficient starts decreasing as the best number of MPs. We finally obtained three MPs related to MMR-defect populations and one MP for WT populations.
MMR-defect mutation patterns in other species
We obtained mutations from several MMR deficiency related MA studies, including Bacillus subtilis NCIB 361056 (NZ_CM000488.1), Corynebacterium glutamicum ATCC 1303257 (NC_003450.3), Deinococcus radiodurans R158,59 (NC_001263.1, NC_001264.1), Mycobacterium smegmatis MC2 15560 (NC_018289.1), Pseudomonas fluorescens SBW2559 (NC_012660.1), Vibrio Cholerae 2740-8061 (NZ_CP016324.1, NZ_CP016325.1) and N1696162 (NC_002505.1, NC_002506.1), and Vibrio Fischeri ES11461 (NC_006840.2, NC_006841.2). All bacterial genomes were downloaded from NCBI and used to estimate 96-class spectra. The MMR-defect mutation spectrum of Caenorhabditis elegans was obtained from ref .63. The mutation data for Arabidopsis thaliana MMR- lines were obtained from ref. 64, and we downloaded the reference genome from The Arabidopsis Information Resource (https://www.arabidopsis.org/) to estimate 96-class spectra.
Cancer mutation patterns
A number of Single Base Substitution (SBS) patterns extracted from human cancer samples were obtained from COSMIC (https://cancer.sanger.ac.uk/cosmic/signatures), seven of which are associated with defective DNA mismatch repair: SBS6, SBS14, SBS15, SBS20, SBS21, SBS26, and SBS4428. These mutation patterns are involved in 67, 33, 220, 83, 13, 37, and 246 cancer genomes across 13, 4, 12, 7, 6, 11, and 11 cancer types, respectively.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.