Sampling the radiation
To understand the phylogenetic relationships between Alpine whitefish, we carried out whole-genome resequencing on 96 previously collected whitefish (with associated phenotypic measurements including standard length and gill-raker counts; collected in accordance with permits issued by the cantons of Zurich (ZH128/15), Bern (BE68/15), and Lucerne (LU04/14); these fish were previously caught within the context of an assessment of the Swiss fish fauna, Projet Lac64, and documentation of endemic whitefish species in Switzerland31), in addition to three previously sequenced whitefish (discussed below). Fish were selected from lakes Constance, Lucerne, Thun, Brienz, Biel, Neuchâtel, Zurich, and Walen which make up five separate lake systems (Constance, Lucerne, Thun/Brienz, Biel/Neuchâtel, and Walen/Zurich; Fig. 1a and Supplementary Data 1). Individuals from each whitefish species within each lake, representing the phenotypic diversity of Swiss Alpine whitefish, were sampled, including three species from Lake Constance, six from Lake Lucerne, six from Lake Thun, and four from Lake Brienz, two from Lake Biel, two from Lake Neuchâtel, three in Lake Zurich, and two in Lake Walen. In addition to these Swiss whitefish a number of outgroup individuals were also sampled, including two Coregonus albula (European cisco), and a number of members of the European C. lavaretus species complex including two Norwegian C. lavaretus, and four samples of North German whitefish thought to be the closest relatives of the Alpine whitefish radiation members: two German C. holsatus (from Lake Drewitz) and two German C. maraenoides (from Lake Schaal).
The whitefish species we sampled spanned a range of six different ecomorphs that differ in their morphology, including body length, depth, and feeding morphology, as well as spawning depth and time, and diet (sampled species in each lake and the ecomorphs to which these species belong were plotted according to their distribution; Fig. 1a, b). Species in this study were assigned to each ecomorph based on their phenotype by whitefish taxonomic experts and co-authors Oliver M. Selz and Ole Seehausen. The ‘Balchen’ whitefish ecomorph is characterised by large-bodied shallow spawning species which predominantly feed on benthic macroinvertebrates. Conversely, the ‘Albeli’ ecomorph is characterised by small species which spawn deeper (intermediate depth to very deep) and feed on zooplankton in the pelagic zone of lakes. The third ecomorph is the ‘Felchen’ type, which grow to larger sizes than the ‘Albeli’ ecomorph but not as large as the ‘Balchen’, feed on zooplankton, and feed and spawn from an intermediate depth to very deep. In addition to these three widespread ecomorphs, three less-widespread ecomorphs occur in three or fewer lake systems. These include two variations of profundal ecomorphs, a ‘benthic-profundal’ species, C. profundus from Lake Thun (an additional, now extinct, ‘benthic-profundal’ species C. gutturosus was also once present in Lake Constance), which have few gill-rakers but spawn at intermediate to great depth and a ‘pelagic-profundal’ species, C. nobilis in Lake Lucerne, which spawn deep but have a high number of gill-rakers. The final ecomorph we sampled were the ‘large-pelagic’, and included the species C. wartmanni from Lake Constance, C. acrinasus from Lake Thun, and C. suspensus from Lake Lucerne, which, although they are large-bodied, have a high gill-raker count and feed predominantly on zooplankton. C. wartmanni has a well-described pelagic spawning behaviour, while the other two ‘large-pelagic’ species are so far less well characterised in that respect. A full breakdown of the fish included in this study, their gill-raker counts, standard-length measurements, and the ecomorph assignment of each species can be seen in Supplementary Data 1.
DNA for each individual was extracted from either fin or muscle tissue from each fish that had been stored at −80 °C using Qiagen DNeasy extraction columns, quantified using a Qubit 2.0, and run on a 1% agarose gel to assess DNA quality. DNA was then sequenced on the Illumina NovaSeq 6000 with a 550 bp insert size (Next Generation Sequencing Platform, University of Bern). To this data, we added Illumina HiSeq 3000 data sequenced from one Coregonus sp. “Balchen” (ENA accession: GCA_902810595.1; now re-classified as C. steinmanni31) from Lake Thun (Switzerland) that was previously used to polish and validate the Alpine whitefish reference genome assembly47.
Genotyping and loci filtering
After sequencing, all fastq files were quality checked using FastQC65 before being mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (ENA accession: GCA_902810595.1; ref. 47; with additional un-scaffolded contigs (https://datadryad.org/stash/dataset/doi:10.5061/dryad.xd2547ddf;66) to ensure accurate mapping) using bwa-mem v.0.7.1767 changing the ‘r’ setting to 1 to allow more accurate, albeit more time-consuming, alignment. Mosdepth v.0.2.868 was used to calculate mean sequencing coverage from the BAM files for each of the 97 individuals which ranged from 15.32x to 41.69x (an additional two individuals were added to this dataset after genotype calling discussed below). Picard-tools (Version 2.20.2; http://broadinstitute.github.io/picard/) was then used to mark duplicate reads (MarkDuplicates), fix mate information, (FixMateInformation) and replace read groups (AddOrReplaceReadGroups). Genotypes were then called across the 40 chromosome-scale scaffolds included in the Coregonus sp. “Balchen” Alpine whitefish assembly (ENA accession: GCA_902810595.1; ref. 47) using HaplotypeCaller in GATK v.188.8.131.52 using a minimum mapping quality filter of 30. The resulting VCF file was then filtered using vcftools v.0.1.1470 to remove indels (–remove-indels) and include biallelic loci (–min-alleles 2 –max-alleles 2) which have a minor allele count > 3 (–mac 3), no missing data (–max-missing 1), a minimum depth > 3 (–min-meanDP 3 –minDP 3), a maximum depth < 50 (–max-meanDP 50 –maxDP 50), and a minimum quality of 30 (–minQ 30), to leave 16,926,710 SNPs. Loci that fell within potentially collapsed regions of the genome assembly (as identified in47) were removed using BEDTools v.2.28.0 (ref. 71; bedtools subtract) and any loci with duplicate IDs which were identified with PLINK v.1.9072 were removed with VCFtools70 resulting in 15,841,979 SNPs. To increase our sampling of the species C. macrophthalmus from Lake Constance from one individual to three, we added sequencing data from an additional two individuals (previously sequenced by Frei et al.73; Supplementary Data 1). To avoid the downstream impacts of combining sequencing data from different runs (which can result from different biased nucleotide calls and introduce erroneous signals of genetic differentiation; as outlined in ref. 74), we mapped these two samples as above (resulting in a mean genome-wide coverage of 9.32× and 16.58×) and called genotypes again for all samples (including the two additional C. macrophthalmus individuals) at each of the original 15,841,979 SNP positions. Following this genotype calling, which resulted in 15,521,925 SNPs, SNP filtering was repeated as before, leaving 14,313,952 SNPs with no missing data across the dataset of 99 individuals.
PCA, phylogenetics and admixture analysis
PLINK v.1.9072 was used to produce a genomic PCA of all 91 Alpine whitefish genomes with the aim of understanding how each of the individuals, species, and lakes were differentiated from one another. All eight outgroup individuals were removed from the full dataset of leaving only Alpine whitefish from the five lake systems. Loci were then filtered based on linkage disequilibrium using PLINK v.1.90 (ref. 72; 50 kb windows with a step size of 10 bp and filtering for an R2 > 0.1). This resulted in 1,133,255 loci which were processed by PLINK to calculate eigenvector distances between individuals. PCAs were plotted using R75.
We took a phylogenetic approach to understand the relationships between each of the Alpine whitefish species we sampled. First, the full VCF file was thinned to include only SNPs which were 500 bp apart using VCFtools (ref. 70;–thin 500). The thinned SNP dataset containing 2,039,744 SNPs was then filtered using bcftools (part of SAMtools v.1.876; bcftools view -i ‘COUNT(GT = “RR”) > 0 & COUNT(GT = “AA”) > 0’) to leave only SNPs that were present at least once in our dataset as homozygous for the reference allele, and homozygous for the alternative allele, as required by RAxML. This reduced the dataset to 1,692,559 SNPs (specifically compiled for RAxML and not used in any other analysis). This filtered VCF file was then converted to a PHYLIP file using vcf2phylip v.277 before RAxML v.8.2.1278 was run with the ASC_GTRGAMMA substitution model (-m ASC_GTRGAMMA–asc-corr=lewis, -k -f a) with 100 bootstraps and specifying the C. albula samples as outgroups to produce the maximum likelihood tree. The phylogenetic tree, excluding the long node to C. albula, was then plotted using Figtree v.1.4.479.
The same linkage-pruned dataset of 1,133,255 SNPs that was used to produce the full genomic PCA was used to calculate admixture proportions. The.bed file from PLINK resulting from the PCA was analysed using admixture v.1.3.080 to estimate admixture for values of K between 2 and 14, specifying 20 cross-validations (–cv=20). As the CV error increased with the range of K that we tested (Supplementary Fig. 2, we selected the K which helped to resolve the lake systems and deep clade splits best, K = 7, and plotted admixture barplots in R (additional admixture plots for K = 2-K = 10 can be found in Supplementary Fig. 3).
To identify the degree of genetic parallelism between ‘Balchen’ and ‘Albeli’ whitefish species from across the radiation, we subsetted 24 individuals representing three ‘Balchen’ species and three ‘Albeli’ species from four of the lakes we sampled: Lake Brienz, Lake Lucerne, Lake Walen and Lake Neuchâtel out of our full 99 individual dataset. ‘Albeli’ species included C. candidus, C. albellus, C. muelleri and C. heglingus (for lakes Neuchâtel, Brienz, Lucerne, and Walen), and ‘Balchen’ species included C. palea, C. alpinus, C. litoralis, and C. duplex (for lakes Neuchâtel, Brienz, Lucerne, and Walen). To first confirm the independent evolution of each ‘Balchen’ and ‘Albeli’ species pair within each of these four lakes, as indicated by the phylogeny, F4 statistics were calculated across a four-taxon tree (as used in ref. 81), allowing us to estimate the degree of correlated allele frequencies between ‘Balchen’ and ‘Albeli’ individuals within and between lake systems. First, loci were pruned based on linkage disequilibrium using the script ldPruning.sh (https://github.com/joanam/scripts/raw/master/ldPruning.sh), resulting in 1,315,105 SNPs. Then the script plink2treemix.py (from https://speciationgenomics.github.io/Treemix/) was used to convert data into the treemix format before F4 calculations were implemented using f4.py (https://raw.githubusercontent.com/mmatschiner/F4/master/f4.py). We calculated F4 for two different topologies, placing ‘Balchen’ and ‘Albeli’ species from all pairwise combinations of the four lakes on a four-taxon tree. In the first four-taxon tree ((A,B),(C,D)) we placed sympatric ‘Balchen’ and ‘Albeli’ species from a first lake as A and B, and ‘Balchen’ and ‘Albeli’ species from a second lake as C and D. In this context, the resulting F4 (F41) represents the correlated allele frequency between A or B and C or D that would indicate introgression, or in our case, representative of a single evolution of ‘Balchen’ and ‘Albeli’ followed by sorting into lakes. We then calculated F4 where allopatric ‘Balchen’ species from two different lakes were placed as A and B and allopatric ‘Albeli’ species from the same two lakes as C and D (F42). F4 in this second arrangement represents the correlated allele frequencies of sympatric species, again between A or B and C or D. Where F41 < F42 there is stronger support for the scenario in which ‘Balchen’ and ‘Albeli’ are truly sympatric species pairs, and therefore independently originated across lakes rather than for a single origin of the two ecomorphs.
To explore whether ‘Balchen’ and ‘Albeli’ species of whitefish show a parallel genetic basis of evolution in different lakes, regardless of lake structure, we used the cluster separation score (CSS; introduced in ref. 42 and the therein reported formula corrected in ref. 43), a measure of genomic differentiation between individuals assigned to two groups. Here we assigned individuals from the four ‘Balchen’ species to one group and those from the four ‘Albeli’ species to another. When calculated in windows of the genome, the CSS score quantifies the genetic distance between these ecomorph groups relative to the overall genetic variance in this particular window42. We calculated CSS in 50 kb windows using a custom R script (https://github.com/marqueda/PopGenCode/blob/master/CSSm.R) where the 24 whitefish individuals were split into two groups according to ecomorph (i.e., ‘Balchen’ or ‘Albeli’). A stratified permutation test which reshuffles the assignment of individuals to each of the ecomorph groups within each lake to test the statistical significance of the CSS score for each window, whilst maintaining population structure, was then carried out 100,000 times using a custom R script (https://github.com/marqueda/PopGenCode/blob/master/CSSm_permutation.R). Windows with fewer than 24 SNPs were removed (in accordance with ref. 43) and outlier windows were identified based on a false discovery rate adjusted P-value cut-off of P < 0.01, using ‘fdr.level = 0.01’ in the R package ‘qvalue’ (ref. 82; similarly to ref. 43). The median CSS score across all 34,102 windows with ≥24 SNPs was 0.0083 and the median CSS score across all 1659 outlier windows was 0.0973. A PCA was then produced for all 91 Alpine whitefish (excluding our outgroup samples) using PLINK v.1.90 starting with only the 690,101 SNPs that fell within these 1659 CSS outlier windows. Filtering for linkage disequilibrium was carried out as above, resulting in 56,127 SNPs that were then used to determine the genomic variation between whitefish species within these genomic regions. Correlations between PC1, which separated out species, and traits (gill-raker count and standard length) were carried out using the linear model function (lm) in R.
To confirm that this pattern is not simply driven only by the inclusion of individuals used to define the outlier CSS windows, we produced a second PCA as above but excluding the original 24 individuals prior to the calculation of the PCA (Supplementary Fig. 7). In this instance, CSS PC1 was still significantly correlated with standard length (R2 = 0.2081, P = 1.183 × 10−4) and gill-raker count (R2 = 0.1135, P = 5.667 × 10−3 when including the outlier C. profundus; R2 = 0.2201, P = 1.05 × 10−4 when excluding the outlier C. profundus), albeit, and unsurprisingly, to a lesser extent.
We also identified genes that were annotated on chromosome-scale scaffolds in the whitefish reference genome47 which overlapped with the 1659 outlier CSS outlier windows by a minimum of 1 bp using ‘bedtools intersect’71. And then used the topGO package v.2.46.083 in R to identify significantly enriched gene ontology terms (P-values < 0.05 according to both the ‘weight’ and ‘elim’ algorithms) associated with these outlier windows (Supplementary Data 2). A non-parametric Mann–Whitney U test showed that the 1702 genes that overlapped with our 1659 CSS outlier windows were significantly longer than non-overlapping genes (40,993 of the full 42,695 gene set; P-value = 2.775 × 10−8), however, the difference was small in absolute terms (means between the groups varied from 13,648 bp in outlier genes to 11,638 bp in non-outlier genes; Supplementary Fig. 5).
We then calculated pairwise genome-wide relative divergence between sympatric ‘Balchen’ and ‘Albeli’ species for each lake separately. Weir and Cockerham FST was calculated between ‘Balchen’ and ‘Albeli’ species in each lake after filtering out loci which had a minor allele count <1 between the two using vcftools v.0.1.14 (ref. 70;–weir-fst –mac 1) specifying a window size of 50 kb. Windows with fewer than 10 SNPs were removed. The mean FST of all windows along the genome was then calculated for each species pair to determine the total extent of differentiation between sympatric ‘Balchen’ and ‘Albeli’ species. To identify regions of the genome which underpin the phenotypic contrast between ecomorphs we identified the top percentile of most differentiated windows in each lake and species pair using R and those outlier windows which were shared between two or more species pairs were noted. Since this 1% FST value cut-off was used to identify outlier windows in each of the four independent FST scans, by chance we would expect 20.7234 windows to be shared across two lakes (0.012 × 34,539 windows × 6 combinations of two lakes), 0.1381 across three lakes (0.013 × 34,539 windows × 4 combinations of three lakes), and 0.0003 across four lakes (0.014 × 34,539 windows × 1 combination of 4 lakes). As with CSS outlier windows, genes that overlapped with the top 1% outlier windows from each of the four species pairs were identified using ‘bedtools intersect’71. KEGG orthology was identified for 28,673 of the 46,397 annotated genes in the whitefish Coregonus sp. “Balchen” assembly using BlastKOALA (https://www.kegg.jp/blastkoala/; using the taxon id 861768 and selecting the genus_eukaryotes database) and as a result, the genes and KEGG orthology terms that overlapped with each of the FST outlier windows, and genes overlapping with these windows, for each of the four species-pair comparisons were identified. For each species pair, the KEGG gene pathways that were associated with KEGG orthology terms associated with lake-specific FST outlier windows were also identified using the KEGG orthology database (https://www.kegg.jp/kegg/ko.html). The genes, KEGG orthology terms and KEGG gene pathways that were associated with each species-pair-specific set of FST outlier windows were then compared to identify any features that were associated with ‘Balchen’-‘Albeli’ differentiation across all lake systems. Full protein sequences for genes associated with the shared KEGG orthology terms K07526 (augustus_masked-PGA_scaffold11__203_contigs__length_63881516-processed-gene-394.0 and maker-PGA_scaffold9__196_contigs__length_60468309-snap-gene-345.2) and K12959 (maker-PGA_scaffold11__203_contigs__length_63881516-snap-gene-396.10 and maker-PGA_scaffold9__196_contigs__length_60468309-snap-gene-342.13) were BLASTed using blastp (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) and the resulting best hits, those with the highest E-value and an annotated gene name in a salmonid species, were noted (Supplementary Data 3). As with CSS, a non-parametric Mann–Whitney U test showed that the 1130 genes that overlapped with FST outlier windows across the four comparisons were significantly longer than non-overlapping genes (41,565 of the full 42,695 gene set; P-value = 2.693 × 10−6). Again, the absolute difference between groups was small (means between the groups varied from 14,084 bp in outlier genes to 11,654 bp in non-outlier genes; Supplementary Fig. 5).
Genome-wide association mapping
To identify the genetic basis of gill-raker variation across the Alpine whitefish radiation, we used a mixed model approach implemented in EMMAX v.2012021084 (as in ref. 14). First, EMMAX was used to produce a Balding-Nichols kinship matrix between all 90 Alpine whitefish samples for which we had gill-raker counts using ‘emmax-kin’ using only the 9,120,498 SNPs that were polymorphic within the Alpine whitefish radiation. We then used EMMAX to calculate the association of each SNP marker with gill-raker count. Two significance thresholds were determined. A strict Bonferroni multiple-testing P-value threshold was calculated using the total number of SNPs tested: −log10(0.05/9120498) = 8.26, in addition to an LD-considerate threshold of −log10(0.05/4536915) = 7.96, which was calculated by removing linked markers (R2 > 0.95) in 50 kb sliding windows across the genome using PLINK72. One SNP on WFS23 had an association above the LD-considerate threshold and the allele frequencies within each of the six ecomorph groups was calculated for this SNP using vcftools –freq on each subset of ecomorphs separately (Fig. 2e; in addition to each ecomorph within each lake separately; Supplementary Fig. 9). The gene that overlapped with this SNP was identified with BEDTools71 and the full protein sequence from the gene that overlapped with the SNP (maker-PGA_scaffold22__199_contigs__length_52020451-snap-gene-302.9) was BLASTed using Ensembl TBLASTN against the Atlantic Salmon, Rainbow Trout, Brown Trout and Coho Salmon genomes, hitting with high confidence against the ectodysplasin-A receptor (edar) gene (E-value 1e-20; ID% 97.62 in Brown Trout fSalTru1.1; ENSSTUG00000036900 and E-value 7e-20; ID% 100 in Atlantic Salmon ICSASG_v2; ENSSSAG00000053655). The variance in gill-raker count across our samples explained by the most significantly associated SNP was calculated using the equation: PVE = ((2*(beta2)*MAF*(1-MAF))/(2*(beta2)*MAF*(1-MAF) + (se_beta2)*2*N*MAF*(1-MAF))) where N = the sample size (90), se_beta = the standard error of effect size of the SNP, beta = SNP effect size, and MAF = SNP minor allele frequency (from the Supplementary Information S1 associated with ref. 85).
This EMMAX association mapping was repeated using sex as a binary trait for 90 Alpine whitefish individuals. The most substantial associated peak was observed on WFS04. As above, genes that overlapped with these SNPs were identified with BEDTools71 and the protein sequence from the single gene that overlapped with this peak of SNPs on WFS04, maker-PGA_scaffold3__454_contigs__length_92224161-snap-gene-551.2, was BLASTed using Ensembl TBLASTN against the Atlantic Salmon, Rainbow Trout, Brown Trout and Coho Salmon genomes, however, no annotated genes were hit with high confidence using this approach.
To calculate excess allele-sharing across the dataset, and test whether species of the less-widespread ecomorphs with unique trait combinations (i.e., combinations of traits that contrast with the direction of correlation among combinations of traits seen in the widespread ecomorphs) have evolved as a result of gene flow between lake systems, we used the f-branch statistic fb(C) as calculated by the package Dsuite v.0.386 as in ref. 87. First, a simplified version of the full RAxML phylogenetic tree was prepared. To make use of the multiple samples per species in our dataset and get robust estimates of excess allele-sharing both within and between lake systems, collapsed nodes in the phylogenetic tree using the R package ‘ape’88 where possible. Individuals which looked like potential F1 hybrids as indicated by close to 50/50 splitting in the admixture analysis or were placed discordantly in our genome-wide PCA and phylogeny (including the C. alpinus 0129 and C. zuerichensis 099) and individuals which did not sit in the same clade as other individuals of the same species in the same lake system were kept separated so as to not skew species-wide estimates of excess allele-sharing from single, potentially recent introgression events, and thus not included in node collapsing. Nodes were then collapsed, and the individuals within that clade assigned as a single tree tip, if all individuals within the clade belonged to the same species or species of the same ecomorph from a single lake or, where possible, single lake system (excluding potential F1 individuals). All outgroup individuals in the tree were collapsed into a single outgroup tip. Dsuite86 was then run specifying Dtrios, DtriosCombine, and finally f-branch, each time specifying the collapsed tree. Dsuite was used to first calculate f4 admixture ratios f(A,B;C,O) across the dataset where combinations of taxa fit the necessary relationship ((A, B), C) in our phylogenetic tree, with the 8 non-Alpine whitefish set as the outgroup. The f-branch statistic fb(C) was then calculated from these f4 statistics using the phylogenetic tree to identify excess allele-sharing between any taxa into any other taxon or node in the phylogeny. fb(C) is particularly powerful for complex systems such as the Alpine whitefish radiation since, unlike Patterson’s D, it provides branch-specific estimates of excess allele-sharing, meaning that specific instances of gene flow do not skew excess allele-sharing estimates across multiple nodes or branches, providing a phylogenetically-guided and robust estimate of excess allele-sharing87. F-branch statistics plotted in Fig. 3 are provided in Supplementary Data 4 along with a version of the figure highlighting within-lake introgression in Supplementary Fig. 10. Significant instances of excess allele-sharing were identified by calculating a stringent Bonferroni multiple-testing significance threshold, which involved dividing the p-value threshold of P < 0.01 by the number of cells in the f-branch matrix for which fb(C) could be calculated (1910) and converting this to a Z-score using R. All cells with Z-scores higher than this threshold i.e., Z > 4.41 represented significant excess allele-sharing between taxa in the tree and were indicated as such.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
#Genomic #architecture #adaptive #radiation #hybridization #Alpine #whitefish #Nature #Communications