Data collection

Metagenomic project information was collected from the MGnify metagenomic database31. Currently (September 2021), microbiome data (sequence, taxonomic, and functional information, etc.) of 325,323 environmental samples can be found in this database. Often, microbes from similar ecological communities have been studied by different groups at different times and locations. Therefore, multiple projects were deposited for most of the microbial habitats. Here we considered the projects for which at least 5 samples are deposited (for reliable statistical tests) in this database and for any habitat type if we found multiple projects were deposited by the same group of investigators we considered one project. We start our analysis by considering 5 samples randomly from 90 metagenomic projects. The final classification of the selected samples was made considering sample information from various sources such as BioSamples database59, Sequence Read Archive (SRA) database60, and MGnify database31. Samples were classified broadly according to the ecological features of the sample material; if no such information is available from BioSamples59 or SRA database60, we considered the classification as suggested in MGnify database31. Details of the selected projects and associated sample information can be found at Zenodo (

Processing of metagenomic reads, assembly, and prediction of coding sequences

Raw sequence read files of each selected sample were retrieved from the SRA database60. These reads are then processed with a rigorous quality check pipeline (Fig. 1). Given the run id, this pipeline automatically retrieves all read files associated with a sample from the SRA database60 using fastq-dump (, processes the reads with different quality check algorithms, assembles the high-quality reads, and provides CDS sequences in the final step. Depending on the read file type, this pipeline processes the samples either in the single-end or in paired-end mode; any unpaired reads in paired-end samples are processed in single-end mode and are included in the assembly step. The steps involved in this pipeline are briefly described here. Details of the commands used for each step can be found in our supplementary software file. At the first step, sequences containing bar code, adopter, or bad-quality nucleotides are trimmed as well as low-quality sequences (quality score below 10) and short sequences (length below 50 nucleotides) are removed using the BBDuk algorithm of BBTools. BBTools is a suite of bioinformatics tools designed for the analysis of DNA and RNA next-generation sequence data developed by the Joint Genome Institute ( Filtered reads were subjected to quality check again using the Trimmomatic sequence quality filtering tool61. All samples are checked for possible host sequence contamination by aligning the reads against reference genomes using the Bowtie 2 aligner62. As a general step, reads from all samples are aligned against the human reference genome. Additionally, samples from other hosts such as mouse, cow, pig, etc. are aligned again against the reference genome of the respective organism. Bowtie2 index files of the most recent versions (as of October 2020) of all the reference host genomes (build NCBI) were downloaded from the Bowtie 2 website62. The final quality of the processed reads (single-end or paired-end) was accessed through the FastQC algorithm ( At the next step, the samples that passed the FastQC quality check were assembled using MEGAHIT63 with the version optimized for metagenomic samples. MEGAHIT is a de Bruijn graph-based assembler that does not need any reference genome and was considered to be one of the best algorithms for assembling metagenomic reads63. MEGAHIT was run either in the single-end or paired-end mode (depending on the read file type) by adjusting the “presets” option as “sensitive” (slower but more rigorous search) or “large” (recommended for large metagenomes) option depending on the size of fastq read files. The minimum length of the contigs to be reported was chosen to be 60 nucleotides. Potential protein-coding regions were predicted from the contigs using the default settings of MetaProdigal64, an updated version of the well-known gene prediction tool Prodigal65 specifically designed for the metagenomic reads. MetaProdigal was shown to identify genes in short, anonymous coding sequences with a high degree of accuracy65. Sample for which FastQC reported problems in the read quality or less than 10,000 CDS were predicted in the gene prediction step was discarded and an alternative sample was chosen from the same project and processed in a similar way. This process was repeated until we got 5 samples for each project after processing. For a few projects (13 projects) we couldn’t find 5 samples that passed the FastQC quality check step and for which at least 10,000 CDSes were predicted. For these projects, we considered less than 5 samples. For comparison of results, predicted CDS sequences of each sample were also collected from the MGnify metagenomic database31 and analyzed separately. This database used a generic pipeline (details in reference31) to process the raw reads of each sample and predicted CDS/protein sequences from the processed reads along with their functional and taxonomical analysis. The database provides predicted protein sequences of each sample but does not provide the predicted CDS sequences directly. To get the CDS sequences, we predicted CDS from the filtered reads (retrieved from the database) of each sample using the same algorithms by which the database predicted protein-coding sequences and then matched the translated protein products of the CDSes with the protein sequences provided in the database. In this way, we could retrieve protein-coding gene sequences for 407 of 422 test samples.

Calculation of codon/AA usage distances among metagenomic samples

To calculate similarity/dissimilarity among the metagenomic samples in terms of their codon usage we calculated the frequencies of 61 codons (excluding stop codons) considering all the genes in each sample. Codon frequencies were calculated following two approaches: (i) count of each codon in the sample was normalized by the total count of all the 61 codons in the sample (referred as absolute codon usage frequency or absCUFs), and (ii) count of each codon in the sample was normalized by the sum of all its synonymous codons in the sample (following Diament et al., we referred this as synonymous codon usage frequency or synCUFs66). The basic difference between these two approaches is that in the synCUFs approach normalization was done for each group of codons coding the same AA separately considering only the frequencies of synonymous codons rather than all other codons as in absCUFs. Being independent of AA usage and the abundance of synonymous codons, synCUFs can directly reflect synonymous codon usage bias. For each sample, AA frequencies were calculated by dividing the total count of each AA by the total length of all the proteins in the sample.

To estimate the similarity/dissimilarity in codon/AA usage among the selected samples we considered three well-known measures of distances (i) EU distance, (ii) ES distance66, and (iii) BC dissimilarity. The EU distance in codon usage between any two samples was calculated as follows

$$d_{pq} = sqrt {mathop {sum}limits_{i = 1}^{61} {left( {q_i – p_i} right)^2} }$$

Where qi and pi are codon frequency vectors (absCUFs or synCUFs) of the two samples, respectively.

ES distance method is based on Kullback–Leibler divergence for information gain, a measure widely used for comparing probability distributions in the field of information theory. Diament et al.66, first introduced this method to estimate codon usage distances between genes. Given the frequency vectors (absCUFs or synCUFs) of a pair of samples p and q, the ES distance between the samples was calculated as:

$$d_{_{KLleft( {p,q} right)}} = mathop {sum}limits_i^{61} {{{{mathrm{log}}}}frac{{p_i}}{{q_i}}p_i}$$

$${{{mathrm{m}}}} = frac{1}{2}left( {{{{mathrm{p}}}} + {{{mathrm{q}}}}} right)$$

$${{{mathrm{d}}}}_{{{{mathrm{ES}}}}}({{{mathrm{p}}}},{{{mathrm{q}}}}) = sqrt {d_{KL}left( {p,m} right) + d_{KL}left( {q,m} right)}$$

Where qi and pi are codon frequency vectors (absCUFs or synCUFs) of the two samples respectively and dKL is the Kullback–Leibler divergence.

BC dissimilarity is a well-known dissimilarity function, widely applied in ecological studies to compare species abundances67. BC dissimilarity in codon usage between any two samples was calculated as:

$${{{mathrm{d}}}}_{{{{mathrm{BC}}}}} = 1 – frac{{2Cpg}}{{Sp + Sq}}$$

Where Cpq is sum of minimum values of codon frequencies for each codon between the two samples and Sp and Sq are the sum of codon frequencies for all codons in the two samples respectively. BC dissimilarity ranges from 0 to 1, where higher values suggest more dissimilar codon usage.

Distances in AA usage frequencies among the samples were calculated using the same distance methods (EU, ES and BC dissimilarity).

Principal Component Analysis and ANOSIM test

Principal component analysis (PCA) helps visualization of complex multi-dimensional data into lower dimensions without losing much information68. We applied PCA to cluster samples based on their codon and AA usage and also on their functional and taxonomic abundances considering the respective frequency vectors. To access the statistical significance of the clustering, we performed ANOSIM test, a non-parametric test widely used in ecological studies to compare distances between two or more groups. As a measure of the extent of differences, ANOSIM provides “R” co-efficient (and associated P-value) which is the ratio of between groups distances to within groups distances. The R coefficient ranges from -1 to 1, where positive and higher R-values suggest greater separation between the test groups than within the test groups, while negative R-value suggests the reverse. P-value, in the test, is determined by permuting the grouping vectors considering empirical distributions of R under the null model. PCA was performed using the “prcomp” function and visualized using the “ggbiplot” library of statistical package R (version 3. 2. 1)69 and the ANOSIM test was executed using its vegan package considering the EU distance and BC dissimilarity method separately with 10,000 permutation values.

Permutation test for comparing codon/ AA usage distances

For each group of samples under different environmental categories, we calculated an index value we called as Clustering Index (CI) which is the ratio of average distance in codon or AA usage among the samples from the same environmental group to that of samples from all other environmental groups. Here we explain this method by considering an environmental group “digestive system” as an example. We first calculated the average pairwise distances in codon/ AA usage considering all possible combinations of samples where both the samples belong to the “digestive system” group. Next, we calculated average pairwise distances considering all possible combinations of samples where one sample is from the “digestive system” group and the other sample is from any other environmental group. The index is the ratio of these two average distances. CI value < 1 suggests that samples from the same environmental origin are close in their codon/AA usage relative to that of samples from different environmental origins and vice-versa.

To test the significance level, for each environmental group, we generated 100 random datasets by randomly assigning the samples in different environmental groups keeping the number of samples in each group unchanged. CI values were calculated for the random datasets following the same approach as the original dataset. P-value was defined as the number of times the CI values calculated from the random datasets were lower than that calculated from the corresponding real dataset divided by the number of random datasets (100).

Calculation of CUB indices

To estimate the extent of bias in different metagenomic samples, we considered three commonly used metrics of CUB namely CAI39, ENC40, and DCBS41,42. These indices try to capture differences in observed codon usage frequencies from uniform distribution (ENC, DCBS) or from distribution of reference set of highly expressed genes (CAI)14. However, they employ different scales to quantify the deviation. The scales of CAI and ENC values range from 0-1 and 20-61 respectively, while DCBS scores can be any positive value14. Presence of strong CUB reflected in higher CAI or DCBS values but lower ENC values. These indices were calculated for each sample separately considering all predicted CDS sequences for the sample as follows:

Calculation of CAI values

CAI values of each gene (g) of each sample was calculated using the equation as described in39.

$${{{mathrm{CAI}}}}_g = left( {mathop {prod}limits_{i = 1}^L {w_i} } right)^{!!{1/L}}$$

Where “L” is the length of the gene in the number of codons. For a codon “i”, wi represents its codon weightage which is calculated based on the observed frequency of that codon relative to the frequencies of all its synonymous codons from a reference set of highly expressed genes.

$$w_i = frac{{f_i}}{{maxleft( {f_i} right)}}$$

Where fi is the observed frequency of the codon and max(fi) is the maximum observed frequency among all its synonymous codons in the reference set.

The reference set of highly expressed genes was generated for each sample separately, measuring the transcript abundance of the predicted CDS sequences using Kallisto algorithm70. Kallisto is a sequence-alignment algorithm widely used for the quantification of RNA sequences in ribo-seq data analysis70. It aligns sequences against the reference contig or CDS sequences and calculates transcript abundance as the number of reads successfully mapped to the reference sequence normalized by their length and other parameters. We ran Kallisto using predicted CDS sequences of each sample as input and the corresponding contig files as reference sequences following default parameters. For the reference set of highly expressed genes, we considered top 5% CDS sequences of each sample sorted according to their abundance values. To get CAI values sample-wise, we calculated average CAI values of all genes in each sample.

Calculation of ENC values

ENC value of each sample was calculated considering all the genes in the sample together following the equation described in the reference40.

$$N_c = 2 + frac{9}{{F_2}} + frac{1}{{F_5}} + frac{5}{{F_4}} + frac{3}{{F_6}}$$

Where F is called homozygosity index and is calculated for a group of AAs having the same number of codons. For instance, there are 9 AAs with 2 codons. F value is calculated for all these 9 AAs together and represented as F2. Homozygosity index F for a group of AA is calculated as follows

$${{{mathrm{F}}}} = frac{{{sum} {n_{i{sum} {left( {p_i^2 – 1} right)} }} }}{{{sum} {n_i – 1} }}$$

For an AA with k number of synonymous codons, each with counts n1, n2,…, nk

$${{{mathrm{N}}}} = mathop {sum}limits_{i = 1}^k {n_i} ,{{{mathrm{and}}}},p_i = n_i/{{{mathrm{n}}}}{{{mathrm{.}}}}$$

Calculation of DCBS values

DCBS value of each gene (g) in each sample was calculated following41,42. Briefly, for each codon in a sample, we calculated a score dxyz using the following equation

$$d_{xyz} = frac{{fleft( {x,y,z} right) – f_1left( x right){cdot}f_2left( y right){cdot}f_3left( z right)}}{{f_1left( x right){cdot}f_2left( y right){cdot}f_3left( z right)}}$$

Where f(x,y,z) are the frequencies of the codon xyz, and f1(x), f2(y), and f3(z) are the observed frequencies of nucleotides x, y, and z at the first, second, and the third codon position, respectively. These frequencies were calculated for each sample separately considering all the predicted CDS sequences in the sample. DCBS value of each gene (g) in a sample was calculated as geometric mean of dxyz values over all its codons (except stop codons)

$${{{mathrm{DCBS}}}} = frac{{mathop {sum}limits_{i = 1}^L {d_{xyz}} }}{L}$$

Where ‘L’ is the length of the gene in the number of codons. Finally, the DCBS value of each sample was estimated as the average DCBS values of all genes in the sample.

Generation of random sequences and calculation of Z-scores

To test whether there is any kind of selection on the codon usage of the samples, we compared the codon usage bias of real sequences with that of their random sequences. To generate random sequences, coding sequences of a sample were randomly permuted 20 times using the SPARCS algorithm71. Thus, 20 sets of random sequences were generated from the real sequences of each sample. We specifically chose this algorithm to generate random sequences because this algorithm preserves the encoded protein sequence and the dinucleotide frequencies hence effects of other parameters such as mRNA secondary structure, and GC content, etc. are expected to be minimum71. Because of high computational cost, random sequences were generated for 417 samples (out of 422 samples in our dataset) excluding 5 samples with a very large number of predicted CDS sequences. We compared the codon usage scores (average CAI, DCBS, and ENC values) of real sequences of each test sample with that of their randomized version through the Z-score approach following previous studies55,56. For each sample, Z-score was calculated as:

$$Z_{score} = frac{{Average,codon,usage,scoreleft( {CAIorDCBSorENC} right)ofreal,sequences – Average,codo,nusage,score,of,20,randomizedgenomes}}{{standarddeviation,of,codonusagescore,of,the,20,randomizedgenomes}}$$

For the parameters like CAI and DCBS where higher scores indicate more bias, positive Z-scores denote that real sequences are more biased than their corresponding random sequences, while for ENC (lower value indicates higher bias) a negative Z-score denotes more bias in the real sequences.

Taxonomic analysis of the metagenomic samples

For the taxonomic identification, we considered a well-known taxonomic abundance estimation tool for metagenomic samples namely Kraken72. Kraken is an ultrafast algorithm specifically designed to predict taxonomic classifications based on the exact alignment of k-mers72. Thus Kraken was shown to provide more accurate results (higher sensitivity and precision) than other commonly used taxonomic annotators at a much faster rate and using much lesser memory72. Here we estimated abundance values of different taxonomic ranks in our test samples by employing all the predicted CDS sequences in each sample against the standard Kraken databases. CDS sequences of each sample were annotated at six taxonomic ranks: (i) phylum, (ii) class, (iii) order, (iv) family, (v) genus, and (vi) species levels. However, for the calculation of abundance values, we considered annotation up to the genus level (species-level annotations were discarded for low confidence identification). Further, we also discarded annotation at the class level due to the relatively lower number of annotated taxons under this rank. To account for the variation in the sample sizes we calculated their relative abundance in each sample. Relative abundance values were calculated by normalizing their count in each sample by the number of sequences with identified taxonomy in the sample. Distances in taxonomic abundance among the samples were calculated for each taxonomic rank separately using the EU and BC methods and considering the frequencies of taxons that were detected at least once in the majority (300 out of 422) of the test samples.

Functional analysis

For functional information, proteins predicted for each sample were subjected to InterProScan functional annotation tool (version-v83.0)73. InterProScan provides comprehensive information about protein families, domains, and functional sites by integrating signatures from several protein annotation servers including Pfam, CDD, PRINTS, PROSITE, SMART, ProDom, SUPERFAMILY, PANTHER, Gene3D, TIGRFAMs and HAMAP, etc.73. Functional annotations retrieved from these databases are highly redundant. Considering a large number of protein sequences in most of the samples, InterProScan was run against a subset of these databases namely Pfam, TIGRFAM, PROSITE, CDD, and GENE3D. Similar to taxonomic abundance, here we calculated the relative abundance of three GO categories: BP, CC, and MF in each sample. The relative abundance of each GO term in a sample was calculated by dividing the total count of the term in the sample by the total number of proteins in the sample for which GO annotations can be found. Altogether, 3702 different GO terms were found in all the test samples, most of which appeared only in few samples. Therefore, to calculate distances in GO term frequencies among the samples we considered abundance matrices of 500 GO BP, 500 GO MF and 100 GO CC terms respectively sorted according to the number of samples in which they were found. All of these terms were found at least once in more than 50% of test samples. Distances in GO term frequencies among the samples were calculated for each GO category (BP, MF, and CC) separately following the EU and BC dissimilarity methods.

Clusters of Orthologous Groups (COGs) are another important functional classification scheme for prokaryotic sequences74. The most recent update of the COG database includes 4877 different COGs definitions, which are categorized into 23 broad functional categories74. For our analysis, we retrieved the COG definition of the predicted protein sequences of each test sample by utilizing the standalone RPS-BLAST algorithm from NCBI CD-Search utilities75 against the pre-compiled Conserved Domain Database (CDD) database using P-value cutoff 10−2 and default settings for all other parameters. The results of the RPS-BLAST search were processed through the rpsbproc algorithm. RPS-BLAST is a variant of the PSI-BLAST algorithm that provides domain-level annotations imported from several external sources, including COG annotations75. To group sequences into the COG functional categories, we discarded the sequences that were annotated with multiple COGs from different functional categories and considered sequences annotated with COGs specific to that category only.

Analysis of codon/amino acid usage patterns of microorganisms collected from the fusionDB database

To study the codon and amino acid usage pattern of microorganisms at the species level we collected habitat information of 1,374 microorganisms from the fusionDB database28. The fusionDB database provides metadata such as their habitat information, preferred temperature, oxygen requirements, etc., for 1,374 taxonomically distinct bacteria. Protein coding CDS sequences and amino acid sequences of these microorganisms were retrieved from the NCBI GenBank database76 following their taxonomic identifiers and species names. Thus we could find CDS and amino acid sequences for 1270 out of 1374 bacteria in this database. For the environmental classifications of these bacteria, we broadly followed the habitat information as provided in this database. Briefly, when specific information regarding the habitat preference of a bacterium was available we consider the bacterium into that specific environmental category otherwise we broadly followed the habitat classification as mentioned in the fusionDB database. For reliable statistical tests, in this analysis, we mainly considered codon/amino acid usage frequencies of 925 bacteria from 7 environmental biomes namely: “Nasopharyngeal microflora” (24 bacteria), “Multi” (198 bacteria), “Fresh water” (110 bacteria), “Soil” (107 bacteria), “Intestinal microflora” (66 bacteria), “Marine” (85 bacteria), and “nonspecific host” (335 bacteria). There are at least 20 bacteria under each of these 7 biomes. Codon usage frequencies were calculated following absCUFs and synCUFs methods as described in our main text for the metagenomic dataset. Distances in codon/amino acid usage frequencies were calculated and compared within and between the groups following similar approaches as described in our main text for the metagenomic dataset.

Identification of ribosomal protein-coding genes in test samples

To identify potential ribosomal protein-coding genes we considered three approaches: (i) from their COG annotations: any sequence belonging to COGs defined as “ribosomal proteins” (Supplementary Table 20) were considered as potential ribosomal protein-coding genes, (ii) from InterPro73 annotation: for this, we considered a list 478 different InterPro identifiers (Supplementary Table 20) that were defined as ribosomal proteins. Any proteins with these annotations were considered as probable ribosomal proteins (iii) through blast: we collected known ribosomal protein sequences of 13 prokaryotic organisms from “Ribosomal protein Dataverse” (;jsessionid=e498052c86fa6e8eff1fc0d0ef23) and also known ribosomal protein sequences from Ribosomal Protein Gene Database77. Sequences showing significant (P < 10−4) blast hit with any of these sequences through the bi-directional blast approach (interchanging query and target) were considered as potential ribosomal protein-coding sequences. Finally, in each sample, we considered a non-redundant set of sequences that are identified as ribosomal proteins by any one of the three approaches.

Estimation of k-mer frequencies

k-mer frequencies in the CDS sequences of the test samples were estimated using the Jellyfish algorithm78. Jellyfish provides very fast and accurate k-mer counting of DNA sequences using an order of magnitude less memory as compared to other algorithms78. Estimation of k-mer frequencies for higher k is very computationally intensive47, specifically for large datasets like metagenomic samples. Therefore, here we considered mainly short k-mers for lengths of k ranging from k = 2 to k = 10 except k = 3 (which are equivalent to codons). In each sample, we estimated the frequencies of all possible k-mers for a given k by normalizing their count by the total occurrence of all the possible k-mers for that k in the sample. Divergence in the k-mer frequencies among the samples (within or between the environmental groups) was estimated using the distance matrices as described in the main text. For shorter k (k < = 5), we considered the frequencies of all possible k-mers in the calculation of distance matrices. However, for longer k (k > = 6), where the number of possible k-mers is very high, we first selected 2000 highly abundant (based on their frequencies in the test samples) k-mers for each k (k = 6,..,10). Next, we calculated distance matrices based on the frequencies of these selected k-mers in each sample.

Statistical analysis and visualizations

All statistical tests were performed using R (The R Project for Statistical Computing)69. For correlation analysis, we calculated the non-parametric Spearman’s Rank correlation coefficient ρ, where significant correlations were denoted by P-values. To compare the distribution of test variables (such as codon usage distances or CUB indices) among different groups of samples we considered the Mann-Whitney U test or the Kruskal-Wallis H test (an extended version of Mann-Whitney U tests) depending upon the number of groups to compare. P values were adjusted for multiple comparisons using the Bonferroni post hoc test when samples from more than two environmental groups were considered for the test. For regression analysis, we considered the multiple linear regression model where the dependent and explanatory variables are assumed to be related linearly. Plots are generated using the ‘ggplot2” library of R.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Leave a Reply

Your email address will not be published. Required fields are marked *