# Author names and e-mail address 
Andrés Gustavo Jacquat (andresgjacquat@gmail.com or agjacquat@imbiv.unc.edu.ar) [1,2] *
Martín Gustavo Theumer (mgtheumer@unc.edu.ar) [3,4]
José Sebastián Dambolena (jdambolena@imbiv.unc.edu.ar)[1,2]
1 Facultad de Ciencias Exactas Físicas y Naturales (FCEFyN), Universidad Nacional de Córdoba (UNC), Avenida Vélez Sarsfield 299, Córdoba 5000, Argentina  
2 Instituto Multidisciplinario de Biología Vegetal (IMBIV), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Avenida Vélez Sarsfield 1611, Córdoba 5000, Argentina.
3 Departamento de Bioquímica Clínica, Facultad de Ciencias Químicas (FCQ), Universidad Nacional de Córdoba (UNC), Córdoba 5000, Argentina
4 Centro de Investigaciones en Bioquímica Clínica e Inmunología (CIBICI), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Córdoba 5000, Argentina.
*  Correspondence: Mailing address: Avenida Velez Sarsfield 1611, Córdoba 5000, Argentina

# Title of study
Selective and non-selective evolutionary signatures found in the simplest replicative biological entities.

# Study summary
In the present study, we aimed to quantitatively describe the genomes of mitoviruses (Mitoviridae) and identify variables that 
could be used as classification criteria at the genus level, in addition to protein phylogeny. These variables could be referred 
to as genomic signatures. Specifically, we analyzed the mononucleotide and dinucleotide composition, the synonymous codon usage 
bias, the amounts of purines and pyrimidines in the first (P1), second (P2), and third (P3) nucleotides of the codons, and the 
minimum free energy (MFE) values predicted from the optimized secondary structure. Furthermore, we discussed the interaction 
between neutral and natural selection based on the various patterns observed in the aforementioned quantitative variables of 
genomic composition. Our attempts to identify attributes or characteristics that could serve as quantitative classification 
criteria, in addition to protein phylogeny, were unsuccessful. On the other hand, one of the most important discoveries of this 
descriptive study was the structural divergence evidenced in Kvaramitovirus. We hypothesize that a single evolutionary 
circularization event occurred in the last common ancestor of all members of the genus Kvaramitovirus. This event could have 
potentially altered the evolutionary trajectory. It is possible that new structural constraints emerged, and that natural selection 
played a crucial role in preserving the reproductive fitness, stability, and genomic integrity of these newly emerged circular 
mitovirus populations. We conclude that both neutral and natural selection influence genome composition, with natural selection 
likely being the most significant evolutionary force in shaping the nucleotide sequence of mitovirus genomes.

# Responsible for collecting data
Andrés Gustavo Jacquat 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------


# Data file Name
RdRp_nt_ORF_seq_all_Mitoviridae_data_file.txt

## Brief overall description
This file contains the open reading frame (ORF) sequences used throughout the study, and from which all 
the quantitative genomic data were obtained. The ORF, including the first in-frame AUG codon and the in-frame TAA or TAG stop codons, 
was predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; accessed in August 2023). 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
Genomic_quantitative_data.txt

## Brief overall description
This two-entry data table contains all genomic quantitative variables generated from the nucleotide sequences to generate the results analyzed in the main text.

### Column names and meaning

#### Column number 1
Name: "Clade"
Unity: N/A
Meaning: Name of the main taxonomic group of members of Mitoviridae: Unuamitovirus (ICTV-accepted genus), 
Duamitovirus (ICTV-accepted genus), Triamitovirus (ICTV-accepted genus), Kvaramitovirus (ICTV-accepted 
genus), Kvinmitovirus (proposed genus-level clade of putative invertebrates infecting mitovirus), 
and Arkeomitovirinae (proposed subfamily-level clade of putative mitoviruses with unknown hosts)

#### Column number 2
Name: "NCBI accession"
Unity: N/A
Meaning: National Center for Biotechnology Information (NCBI, U.S. National Library of Medicine, Bethesda MD, USA) digital repository code for nucleotide sequence public library. 

#### Column number 3
Name: "ORF name"
Unity: N/A
Meaning: The name of the open reading frame (ORF) was assigned deliberately and in accordance with the name of the virus registered under the NCBI accession code.

#### Column number 4-7
Names: "%A(3)", "%T(3)", "%G(3)" and "%C(3)"
Unity: percentage (%)
Meaning: Fractional mononucleotide content of each base (A, G, C and U=T, respectively). Values determined in the EMBOSS-compseq program on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/;
accessed on August 2023) from ORF sequences detailed in columns 15.

#### Column number 8
Name: "%GC(all)"
Unity: percentage (%)
Meaning: Sum of the fractional mononucleotide content of "C" and "G" in the first, second and third positions of the codons (Sequences: ORF sequences detailed in columns 15)

#### Column number 9-11
Name: "%GC(1)", "%GC(2)" amd "%GC(3)"
Unity: percentage (%)
Meaning: Fractional mononucleotide content of "C" and "G" in the first, second and third positions of the codons, respectively (Sequences: ORF sequences detailed in columns 15)

#### Column number 12
Name: "%GC(12)"
Unity: percentage (%)
Meaning: Sum of the fractional mononucleotide content of "C" and "G" in the first and second positions of the codons (Sequences: ORF sequences detailed in columns 15)

#### Column number 13
Name: "MFE(Kcal/mol)"
Unity: Kcal/mol
Meaning: Value of the minimum free energy (MFE) predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA 
package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) 

#### Column number 14
Name: "MFE(Kcal/mol/nt)"
Unity: Kcal/mol/nt
Meaning: Value of the minimum free energy (MFE) divided by ORF lenght. The MFE was predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA 
package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) 

#### Column number 15
Name: "ORF_length"
Unity: nt
Meaning: The open reading frame (ORF) length includes the first in-frame AUG codon and the in-frame TAA or TAG stop codons. ORFs were predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; 
accessed in August 2023). 

#### Column number 16
Name: "AT3-Bias"
Unity: percentage (%)
Meaning: Value is calculated as ƒA3 / (ƒA3 + ƒT3), where ƒA3 represents the fractional mononucleotide content of "A" in the third positions of the codons, and ƒT3 represents the fractional mononucleotide 
content of "T=U" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 15)

#### Column number 17
Name: "GC3-Bias"
Unity: percentage (%)
Meaning: Value is calculated as ƒG3 / (ƒG3 + ƒC3), where ƒG3 represents the fractional mononucleotide content of "G" in the third positions of the codons, and ƒC3 represents the fractional mononucleotide 
content of "C" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 15)

#### Column number 18
Name: "GC3-Bias-50"
Unity: percentage (%)
Meaning: Arithmetic difference between GC-bias value (Column number 17) and 50%.

#### Column number 19
Name: "rSCUB"
Unity: percentage (%)
Meaning: Residual (r) synonymous codon usage bias (SCUB). Relative difference between the observed Nc index (Column number 20) and the theoretically expected Nc index under complete absence of selection pressure, and for the corresponding GC3 (Column number 20) value.  

#### Column number 20
Name: "Nc_index"
Unity: none 
Meaning: Effective Number of Codons (Nc) index. The Nc index for each sequence was determined using the EMBOSS-chips v. 6.6.0 program on the EMBOSSexplore website (Acceded date: August 2023). (Sequences: ORF sequences detailed in columns 15)

#### Column number 21-36
Name:"p(AA)" "p(AC)" "p(AG)" "p(AT)" "p(CA)" "p(CC)" "p(CG)" "p(CT)" "p(GA)" "p(GC)" "p(GG)" "p(GT)" "p(TA)" p(TC)" "p(TG)" "p(TT)" 
Unity: none
Meaning: Dinucleotide relative frequencies (ρXY), also referred to as the “odds” ratio, which is defined as the ƒobs(XY) (Column number 37-52) divided by the product of the corresponding fractional mononucleotide contents, ƒobs(X) and ƒobs(Y) (Column number 4-7)

#### Column number 37-52.
Name: "f(AA)" "f(AC)" "f(AG)" "f(AT)" "f(CA)" "f(CC)" "f(CG)" "f(CT)" "f(GA)" "f(GC)" "f(GG)" "f(GT)" "f(TA)" "f(TC)" "f(TG)" f(TT)"
Unity: none
Meaning: Dinucleotide frequency (ƒXY) of every dimer (X – phosphodiester bond – Y). ƒXY were determined using EMBOSS-compseq v. 6.6.0 (Rice et al., 2000) on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/; accessed on August 2023). 


----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
RdRp_MSSA_3DPROMALS_output.txt

## Brief overall description
Multiple Sequence Structure Alignment (MSSA) FASTA format. This MSSA was carried out at the PROMALS3D multiple sequence 
and structure alignment server (http://prodata.swmed.edu/promals3d/promals3d.php). The program parameters were used by default.
The alignment generated from AA sequences was used to estimate the phylogenetic tree. 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
RdRp_aa_seq_ML-Tree_IQTREE_output.txt

## Brief overall description
ModelFinder output + ML tree reconstruction output + ultra-fast bootstrap output. Analyzes were performed in IQTREE software 
through the Los Alamos Lab web server.: https://www.hiv.lanl.gov/content/sequence/IQTREE/iqtree.html.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
RdRp_aa_seq_all_Mitoviridae_data_file.txt

## Brief overall description
Amino acid sequences of RdRp encoded by Mitoviridae members were included in the phylogenetic study. The sequences are in FASTA format. The NCBI accession code and the abbreviated name of the virus are indicated in the header.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
RdRp_aa_seq_Narnaviridae_data_file.txt

## Brief overall description
Amino acid sequence of the ICTV-accepted narnaviruses included as an outgroup in the phylogenetic analysis. The sequences are in FASTA format. The NCBI accession code and the abbreviated name of the virus are indicated in the header.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file Name
QDA_R-script.txt

## Brief overall description
The file contains the script used to perform the Quadratic discriminant analysis (QDA) in the R software environment.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Data file
QDA_data_file.txt

## Brief overall description
This two-entry data table contains genomic quantitative variables considered as possible genetic signs and was used for QDA analysis.

### Column names and meaning

#### Column number 1
Name: "Clade"
Unity: N/A
Meaning: Name of the main taxonomic group of members of Mitoviridae: Unuamitovirus (ICTV-accepted genus), 
Duamitovirus (ICTV-accepted genus), Triamitovirus (ICTV-accepted genus), Kvaramitovirus (ICTV-accepted 
genus), Kvinmitovirus (proposed genus-level clade of putative invertebrates infecting mitovirus), 
and Arkeomitovirinae (proposed subfamily-level clade of putative mitoviruses with unknown hosts)

#### Column number 2
Name: "NCBI accession"
Unity: N/A
Meaning: National Center for Biotechnology Information (NCBI, U.S. National Library of Medicine, Bethesda MD, USA) digital repository code for nucleotide sequence public library. 

#### Column number 3
Name: "ORF name"
Unity: N/A
Meaning: The name of the open reading frame (ORF) was assigned deliberately and in accordance with the name of the virus registered under the NCBI accession code.

#### Column number 4
Name: "MFE(Kcal/mol)"
Unity: Kcal/mol
Meaning: Value of the minimum free energy (MFE) predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA 
package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) 

#### Column number 5
Name: "MFE(Kcal/mol/nt)"
Unity: Kcal/mol/nt
Meaning: Value of the minimum free energy (MFE) divided by ORF lenght. The MFE was predicted for the two-dimensional conformation of ORF sequences. Values obtained by RNAfold program from the ViennaRNA 
package as implemented by default at http://rna.tbi.univie.ac.at/ web server (accessed on August 2023). (Sequences: ORF sequences detailed in columns 15. ORF < 1300 nt were removed for MFE calculation) 

#### Column number 6
Name: "ORF_length"
Unity: nt
Meaning: The open reading frame (ORF) length includes the first in-frame AUG codon and the in-frame TAA or TAG stop codons. ORFs were predicted by the NCBI-ORFfinder software (https://www.ncbi.nlm.nih.gov/orffinder/; 
accessed in August 2023). 

#### Column number 7-22
Name:"p(AA)" "p(AC)" "p(AG)" "p(AT)" "p(CA)" "p(CC)" "p(CG)" "p(CT)" "p(GA)" "p(GC)" "p(GG)" "p(GT)" "p(TA)" p(TC)" "p(TG)" "p(TT)" 
Unity: none
Meaning: Dinucleotide relative frequencies (ρXY), also referred to as the “odds” ratio, which is defined as the ƒobs(XY) (Column number 37-52) divided by the product of the corresponding fractional mononucleotide contents, ƒobs(X) and ƒobs(Y)

#### Column number 23-38.
Name: "f(AA)" "f(AC)" "f(AG)" "f(AT)" "f(CA)" "f(CC)" "f(CG)" "f(CT)" "f(GA)" "f(GC)" "f(GG)" "f(GT)" "f(TA)" "f(TC)" "f(TG)" f(TT)"
Unity: none
Meaning: Dinucleotide frequency (ƒXY) of every dimer (X – phosphodiester bond – Y). ƒXY were determined using EMBOSS-compseq v. 6.6.0 (Rice et al., 2000) on the EMBOSSexplore website (https://www.bioinformatics.nl/emboss-explorer/; accessed on August 2023). 

#### Column number 39
Name: "AT3-Bias"
Unity: percentage (%)
Meaning: Value is calculated as ƒA3 / (ƒA3 + ƒT3), where ƒA3 represents the fractional mononucleotide content of "A" in the third positions of the codons, and ƒT3 represents the fractional mononucleotide 
content of "T=U" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 6)

#### Column number 40
Name: "GC3-Bias"
Unity: percentage (%)
Meaning: Value is calculated as ƒG3 / (ƒG3 + ƒC3), where ƒG3 represents the fractional mononucleotide content of "G" in the third positions of the codons, and ƒC3 represents the fractional mononucleotide 
content of "C" in the third positions of the codons. (Sequences: ORF sequences detailed in columns 6)

#### Column number 41
Name: "rSCUB"
Unity: percentage (%)
Meaning: Residual (r) synonymous codon usage bias (SCUB). Relative difference between the observed Nc index (Column number 20) and the theoretically expected Nc index under complete absence of selection pressure, and for the corresponding GC3 (Column number 42) value.  

#### Column number 42
Name: "Nc_index"
Unity: none 
Meaning: Effective Number of Codons (Nc) index. The Nc index for each sequence was determined using the EMBOSS-chips v. 6.6.0 program on the EMBOSSexplore website (Acceded date: August 2023). (Sequences: ORF sequences detailed in columns 6)