# Author names and e-mail address Andrés Gustavo Jacquat (andresgjacquat@gmail.com or agjacquat@imbiv.unc.edu.ar) [1,2] * Martín Gustavo Theumer (mgtheumer@unc.edu.ar) [3,4] José Sebastián Dambolena (jdambolena@imbiv.unc.edu.ar)[1,2] 1 Facultad de Ciencias Exactas Físicas y Naturales (FCEFyN), Universidad Nacional de Córdoba (UNC), Avenida Vélez Sarsfield 299, Córdoba 5000, Argentina 2 Instituto Multidisciplinario de Biología Vegetal (IMBIV), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Avenida Vélez Sarsfield 1611, Córdoba 5000, Argentina. 3 Departamento de Bioquímica Clínica, Facultad de Ciencias Químicas (FCQ), Universidad Nacional de Córdoba (UNC), Córdoba 5000, Argentina 4 Centro de Investigaciones en Bioquímica Clínica e Inmunología (CIBICI), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Córdoba 5000, Argentina. * Correspondence: Mailing address: Avenida Velez Sarsfield 1611, Córdoba 5000, Argentina # Title of study Selective and non-selective evolutionary signatures found in the simplest replicative biological entities. # Study summary In the present study, we aimed to quantitatively describe the genomes of mitoviruses (Mitoviridae) and identify variables that could be used as classification criteria at the genus level, in addition to protein phylogeny. These variables could be referred to as genomic signatures. Specifically, we analyzed the mononucleotide and dinucleotide composition, the synonymous codon usage bias, the amounts of purines and pyrimidines in the first (P1), second (P2), and third (P3) nucleotides of the codons, and the minimum free energy (MFE) values predicted from the optimized secondary structure. Furthermore, we discussed the interaction between neutral and natural selection based on the various patterns observed in the aforementioned quantitative variables of genomic composition. Our attempts to identify attributes or characteristics that could serve as quantitative classification criteria, in addition to protein phylogeny, were unsuccessful. On the other hand, one of the most important discoveries of this descriptive study was the structural divergence evidenced in Kvaramitovirus. We hypothesize that a single evolutionary circularization event occurred in the last common ancestor of all members of the genus Kvaramitovirus. This event could have potentially altered the evolutionary trajectory. It is possible that new structural constraints emerged, and that natural selection played a crucial role in preserving the reproductive fitness, stability, and genomic integrity of these newly emerged circular mitovirus populations. We conclude that both neutral and natural selection influence genome composition, with natural selection likely being the most significant evolutionary force in shaping the nucleotide sequence of mitovirus genomes. # Responsible for collecting data Andrés Gustavo Jacquat ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name RdRp_nt_ORF_seq_all_Mitoviridae_data_file.txt ## Brief overall description This file contains the open reading frame (ORF) sequences used throughout the study, and from which all the quantitative genomic data were obtained. The ORF, including the first in-frame AUG codon and the in-frame TAA or TAG stop codons, was predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; accessed in August 2023). ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name Genomic_quantitative_data.txt ## Brief overall description This two-entry data table contains all genomic quantitative variables generated from the nucleotide sequences to generate the results analyzed in the main text. ### Column names and meaning #### Column number 1 Name: "Clade" Unity: N/A Meaning: Name of the main taxonomic group of members of Mitoviridae: Unuamitovirus (ICTV-accepted genus), Duamitovirus (ICTV-accepted genus), Triamitovirus (ICTV-accepted genus), Kvaramitovirus (ICTV-accepted genus), Kvinmitovirus (proposed genus-level clade of putative invertebrates infecting mitovirus), and Arkeomitovirinae (proposed subfamily-level clade of putative mitoviruses with unknown hosts) #### Column number 2 Name: "NCBI accession" Unity: N/A Meaning: National Center for Biotechnology Information (NCBI, U.S. National Library of Medicine, Bethesda MD, USA) digital repository code for nucleotide sequence public library. #### Column number 3 Name: "ORF name" Unity: N/A Meaning: The name of the open reading frame (ORF) was assigned deliberately and in accordance with the name of the virus registered under the NCBI accession code. #### Column number 4-7 Names: "%A(3)", "%T(3)", "%G(3)" and "%C(3)" Unity: percentage (%) Meaning: Fractional mononucleotide content of each base (A, G, C and U=T, respectively). Values determined in the EMBOSS-compseq program on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/; accessed on August 2023) from ORF sequences detailed in columns 15. #### Column number 8 Name: "%GC(all)" Unity: percentage (%) Meaning: Sum of the fractional mononucleotide content of "C" and "G" in the first, second and third positions of the codons (Sequences: ORF sequences detailed in columns 15) #### Column number 9-11 Name: "%GC(1)", "%GC(2)" amd "%GC(3)" Unity: percentage (%) Meaning: Fractional mononucleotide content of "C" and "G" in the first, second and third positions of the codons, respectively (Sequences: ORF sequences detailed in columns 15) #### Column number 12 Name: "%GC(12)" Unity: percentage (%) Meaning: Sum of the fractional mononucleotide content of "C" and "G" in the first and second positions of the codons (Sequences: ORF sequences detailed in columns 15) #### Column number 13 Name: "MFE(Kcal/mol)" Unity: Kcal/mol Meaning: Value of the minimum free energy (MFE) predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) #### Column number 14 Name: "MFE(Kcal/mol/nt)" Unity: Kcal/mol/nt Meaning: Value of the minimum free energy (MFE) divided by ORF lenght. The MFE was predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) #### Column number 15 Name: "ORF_length" Unity: nt Meaning: The open reading frame (ORF) length includes the first in-frame AUG codon and the in-frame TAA or TAG stop codons. ORFs were predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; accessed in August 2023). #### Column number 16 Name: "AT3-Bias" Unity: percentage (%) Meaning: Value is calculated as ƒA3 / (ƒA3 + ƒT3), where ƒA3 represents the fractional mononucleotide content of "A" in the third positions of the codons, and ƒT3 represents the fractional mononucleotide content of "T=U" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 15) #### Column number 17 Name: "GC3-Bias" Unity: percentage (%) Meaning: Value is calculated as ƒG3 / (ƒG3 + ƒC3), where ƒG3 represents the fractional mononucleotide content of "G" in the third positions of the codons, and ƒC3 represents the fractional mononucleotide content of "C" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 15) #### Column number 18 Name: "GC3-Bias-50" Unity: percentage (%) Meaning: Arithmetic difference between GC-bias value (Column number 17) and 50%. #### Column number 19 Name: "rSCUB" Unity: percentage (%) Meaning: Residual (r) synonymous codon usage bias (SCUB). Relative difference between the observed Nc index (Column number 20) and the theoretically expected Nc index under complete absence of selection pressure, and for the corresponding GC3 (Column number 20) value. #### Column number 20 Name: "Nc_index" Unity: none Meaning: Effective Number of Codons (Nc) index. The Nc index for each sequence was determined using the EMBOSS-chips v. 6.6.0 program on the EMBOSSexplore website (Acceded date: August 2023). (Sequences: ORF sequences detailed in columns 15) #### Column number 21-36 Name:"p(AA)" "p(AC)" "p(AG)" "p(AT)" "p(CA)" "p(CC)" "p(CG)" "p(CT)" "p(GA)" "p(GC)" "p(GG)" "p(GT)" "p(TA)" p(TC)" "p(TG)" "p(TT)" Unity: none Meaning: Dinucleotide relative frequencies (ρXY), also referred to as the “odds” ratio, which is defined as the ƒobs(XY) (Column number 37-52) divided by the product of the corresponding fractional mononucleotide contents, ƒobs(X) and ƒobs(Y) (Column number 4-7) #### Column number 37-52. Name: "f(AA)" "f(AC)" "f(AG)" "f(AT)" "f(CA)" "f(CC)" "f(CG)" "f(CT)" "f(GA)" "f(GC)" "f(GG)" "f(GT)" "f(TA)" "f(TC)" "f(TG)" f(TT)" Unity: none Meaning: Dinucleotide frequency (ƒXY) of every dimer (X – phosphodiester bond – Y). ƒXY were determined using EMBOSS-compseq v. 6.6.0 (Rice et al., 2000) on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/; accessed on August 2023). ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name RdRp_MSSA_3DPROMALS_output.txt ## Brief overall description Multiple Sequence Structure Alignment (MSSA) FASTA format. This MSSA was carried out at the PROMALS3D multiple sequence and structure alignment server (http://prodata.swmed.edu/promals3d/promals3d.php). The program parameters were used by default. The alignment generated from AA sequences was used to estimate the phylogenetic tree. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name RdRp_aa_seq_ML-Tree_IQTREE_output.txt ## Brief overall description ModelFinder output + ML tree reconstruction output + ultra-fast bootstrap output. Analyzes were performed in IQTREE software through the Los Alamos Lab web server.: https://www.hiv.lanl.gov/content/sequence/IQTREE/iqtree.html. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name RdRp_aa_seq_all_Mitoviridae_data_file.txt ## Brief overall description Amino acid sequences of RdRp encoded by Mitoviridae members were included in the phylogenetic study. The sequences are in FASTA format. The NCBI accession code and the abbreviated name of the virus are indicated in the header. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name RdRp_aa_seq_Narnaviridae_data_file.txt ## Brief overall description Amino acid sequence of the ICTV-accepted narnaviruses included as an outgroup in the phylogenetic analysis. The sequences are in FASTA format. The NCBI accession code and the abbreviated name of the virus are indicated in the header. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file Name QDA_R-script.txt ## Brief overall description The file contains the script used to perform the Quadratic discriminant analysis (QDA) in the R software environment. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- # Data file QDA_data_file.txt ## Brief overall description This two-entry data table contains genomic quantitative variables considered as possible genetic signs and was used for QDA analysis. ### Column names and meaning #### Column number 1 Name: "Clade" Unity: N/A Meaning: Name of the main taxonomic group of members of Mitoviridae: Unuamitovirus (ICTV-accepted genus), Duamitovirus (ICTV-accepted genus), Triamitovirus (ICTV-accepted genus), Kvaramitovirus (ICTV-accepted genus), Kvinmitovirus (proposed genus-level clade of putative invertebrates infecting mitovirus), and Arkeomitovirinae (proposed subfamily-level clade of putative mitoviruses with unknown hosts) #### Column number 2 Name: "NCBI accession" Unity: N/A Meaning: National Center for Biotechnology Information (NCBI, U.S. National Library of Medicine, Bethesda MD, USA) digital repository code for nucleotide sequence public library. #### Column number 3 Name: "ORF name" Unity: N/A Meaning: The name of the open reading frame (ORF) was assigned deliberately and in accordance with the name of the virus registered under the NCBI accession code. #### Column number 4 Name: "MFE(Kcal/mol)" Unity: Kcal/mol Meaning: Value of the minimum free energy (MFE) predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) #### Column number 5 Name: "MFE(Kcal/mol/nt)" Unity: Kcal/mol/nt Meaning: Value of the minimum free energy (MFE) divided by ORF lenght. The MFE was predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) #### Column number 6 Name: "ORF_length" Unity: nt Meaning: The open reading frame (ORF) length includes the first in-frame AUG codon and the in-frame TAA or TAG stop codons. ORFs were predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; accessed in August 2023). #### Column number 7-22 Name:"p(AA)" "p(AC)" "p(AG)" "p(AT)" "p(CA)" "p(CC)" "p(CG)" "p(CT)" "p(GA)" "p(GC)" "p(GG)" "p(GT)" "p(TA)" p(TC)" "p(TG)" "p(TT)" Unity: none Meaning: Dinucleotide relative frequencies (ρXY), also referred to as the “odds” ratio, which is defined as the ƒobs(XY) (Column number 37-52) divided by the product of the corresponding fractional mononucleotide contents, ƒobs(X) and ƒobs(Y) #### Column number 23-38. Name: "f(AA)" "f(AC)" "f(AG)" "f(AT)" "f(CA)" "f(CC)" "f(CG)" "f(CT)" "f(GA)" "f(GC)" "f(GG)" "f(GT)" "f(TA)" "f(TC)" "f(TG)" f(TT)" Unity: none Meaning: Dinucleotide frequency (ƒXY) of every dimer (X – phosphodiester bond – Y). ƒXY were determined using EMBOSS-compseq v. 6.6.0 (Rice et al., 2000) on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/; accessed on August 2023). #### Column number 39 Name: "AT3-Bias" Unity: percentage (%) Meaning: Value is calculated as ƒA3 / (ƒA3 + ƒT3), where ƒA3 represents the fractional mononucleotide content of "A" in the third positions of the codons, and ƒT3 represents the fractional mononucleotide content of "T=U" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 6) #### Column number 40 Name: "GC3-Bias" Unity: percentage (%) Meaning: Value is calculated as ƒG3 / (ƒG3 + ƒC3), where ƒG3 represents the fractional mononucleotide content of "G" in the third positions of the codons, and ƒC3 represents the fractional mononucleotide content of "C" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 6) #### Column number 41 Name: "rSCUB" Unity: percentage (%) Meaning: Residual (r) synonymous codon usage bias (SCUB). Relative difference between the observed Nc index (Column number 20) and the theoretically expected Nc index under complete absence of selection pressure, and for the corresponding GC3 (Column number 42) value. #### Column number 42 Name: "Nc_index" Unity: none Meaning: Effective Number of Codons (Nc) index. The Nc index for each sequence was determined using the EMBOSS-chips v. 6.6.0 program on the EMBOSSexplore website (Acceded date: August 2023). (Sequences: ORF sequences detailed in columns 6)