The complexity landscape of viral genomes
Overview

Resources

Supplementary Material To The Article.

Model
Selection

Selection of the best compression model.

Synthetic
Sequences

Detection of inverted repetitions in synthetic sequences using NC and NBDM.

Genomes
Caracterization

Complexity characterization and inverted repetition ambulance of different viral genomes.

Viral
Analysis

Viral analysis and inverted repeats detection.

Phylogenetic
Tree

Phylogenetic tree of viral genomes and IR detection.

Research Pipeline

Description of the Exposed work

Model Selection

We made use of the state-of-the-art genomic compressor GeCo3 (Code, Article) to perform a complexity analysis. To determine which of its models should be used to perform this analysis we compressed each viral sequence using 19 different models and its normalized compression (NC). The sum of the NC for each model are depicted below. Sum of the NC from the compression of all reference genomes. Overall model 16 achieved the best compression results (lowest NC) and thus was used in this analysis.

Synthetic Sequences Analysis

To verify that the program was able to correctly identify inverted repeats in genomic sequences, we created a genomic sequence of 10,000 in which the last 5,000 nucleotides were inverted repeats of the first 5,000. This sequence was mutated incrementally from 1% to 100% and for each sequence, Normalized Compression was measured using GeCo3, with and without IR detection program (IR 0 and IR 1, respectively) as well as the Normalized Block Decomposition Method (NBDM). The results of the first sequences of up to 10% mutation are shown below, as well as the difference between IR 0 and IR 1. The results of the first sequences up to 10% mutation are shown below, as well as the difference between IR 0 and IR 1. These results demonstrate that the program is capable of detecting IR even when substantial mutations in the sequences occur (5% of mutation) since up to that point it compresses the sequence better than programs that do not take IRs into account.

Viral Genome Characterization

For all genomes in the database, we computed NC using GeCo3 with different configurations of the subprogram that addresses the inverted repetitions. IR 0 does not contemplate its use (Inverted repeats detection is off), IR 1 uses the sub-program of IR detection using at the same time the regular context model; and IR 2 uses IR detection sub-program without the usage of regular context models.

Virus Groups Analysis

The same computation was done on different taxonomic groups of viruses. The results, as well as their closest phylogenetic tree, are shown below.

Adnaviria
Duplodnaviria
Monodnaviria
Riboviria
Ribozyviria
Varidnaviria

Bamfordvirae
Helvetiavirae
Heunggongvirae
Loebvirae
Orthornavirae
Pararnavirae
Sangervirae
Shotokuvirae
Trapavirae
Zilligvirae

Artverviricota
Cossaviricota
Cressdnaviricota
Dividoviricota
Duplornaviricota
Hofneiviricota
Kitrinoviricota
Lenarviricota
Negarnaviricota
Nucleocytoviricota
Peploviricota
Phixviricota
Pisuviricota
Preplasmiviricota
Saleviricota
Taleaviricota
Uroviricota

Allassoviricetes
Alsuviricetes
Amabiliviricetes
Arfiviricetes
Caudoviricetes
Chrymotiviricetes
Chunqiuviricetes
Duplopiviricetes
Ellioviricetes
Faserviricetes
Flasuviricetes
Herviviricetes
Howeltoviricetes
Huolimaviricetes
Insthoviricetes
Laserviricetes
Magsaviricetes
Malgrandaviricetes
Maveriviricetes
Megaviricetes
Miaviricetes
Milneviricetes
Monjiviricetes
Mouviricetes
Naldaviricetes
Papovaviricetes
Pisoniviricetes
Pokkesviricetes
Quintoviricetes
Repensiviricetes
Resentoviricetes
Revtraviricetes
Stelpaviricetes
Tectiliviricetes
Tokiviricetes
Tolucaviricetes
Vidaverviricetes
Yunchangviricetes

Algavirales
Amarillovirales
Articulavirales
Asfuvirales
Baphyvirales
Belfryvirales
Blubervirales
Bunyavirales
Caudovirales
Chitovirales
Cirlivirales
Cremevirales
Cryppavirales
Durnavirales
Geplafuvirales
Ghabrivirales
Goujianvirales
Halopanivirales
Haloruvirales
Hepelivirales
Herpesvirales
Imitervirales
Jingchuvirales
Kalamavirales
Lefavirales
Levivirales
Ligamenvirales
Martellivirales
Mindivirales
Mononegavirales
Mulpavirales
Muvirales
Nidovirales
Nodamuvirales
Ortervirales
Ourlivirales
Patatavirales
Petitvirales
Piccovirales
Picornavirales
Pimascovirales
Polivirales
Priklausovirales
Primavirales
Reovirales
Rowavirales
Sepolyvirales
Serpentovirales
Sobelivirales
Stellavirales
Tolivirales
Tubulavirales
Tymovirales
Vinavirales
Wolframvirales
Zurhausenvirales

Ackermannviridae
Adenoviridae
Alloherpesviridae
Alphaflexiviridae
Alphasatellitidae
Alphatetraviridae
Alvernaviridae
Amalgaviridae
Amnoonviridae
Ampullaviridae
Anelloviridae
Arenaviridae
Arteriviridae
Artoviridae
Ascoviridae
Asfarviridae
Aspiviridae
Astroviridae
Autographiviridae
Avsunviroidae
Bacilladnaviridae
Baculoviridae
Barnaviridae
Benyviridae
Betaflexiviridae
Bicaudaviridae
Bidnaviridae
Birnaviridae
Bornaviridae
Botourmiaviridae
Bromoviridae
Caliciviridae
Carmotetraviridae
Caulimoviridae
Chaseviridae
Chrysoviridae
Chuviridae
Circoviridae
Closteroviridae
Coronaviridae
Corticoviridae
Cruliviridae
Cystoviridae
Deltaflexiviridae
Demerecviridae
Dicistroviridae
Drexlerviridae
Endornaviridae
Euroniviridae
Filoviridae
Fimoviridae
Finnlakeviridae
Flaviviridae
Fuselloviridae
Gammaflexiviridae
Geminiviridae
Genomoviridae
Globuloviridae
Gresnaviridae
Guttaviridae
Halspiviridae
Hantaviridae
Hepadnaviridae
Hepeviridae
Herelleviridae
Herpesviridae
Hypoviridae
Hytrosaviridae
Iflaviridae
Inoviridae
Iridoviridae
Kitaviridae
Kolmiovirida
Lavidaviridae
Leviviridae
Lipothrixviridae
Lispiviridae
Luteoviridae
Malacoherpesviridae
Marnaviridae
Marseilleviridae
Matonaviridae
Mayoviridae
Megabirnaviridae
Mesoniviridae
Metaviridae
Microviridae
Mimiviridae
Mitoviridae
Mononiviridae
Mymonaviridae
Myoviridae
Mypoviridae
Nairoviridae
Nanghoshaviridae
Nanhypoviridae
Nanoviridae
Narnaviridae
Nodaviridae
Nudiviridae
Nyamiviridae
Olifoviridae
Orthomyxoviridae
Ovaliviridae
Papillomaviridae
Paramyxoviridae
Partitiviridae
Parvoviridae
Peribunyaviridae
Permutotetraviridae
Phasmaviridae
Phenuiviridae
Phycodnaviridae
Picobirnaviridae
Picornaviridae
Plasmaviridae
Plectroviridae
Pleolipoviridae
Pneumoviridae
Podovirid
Polycipiviridae
Polydnaviridae
Polyomaviridae
Portogloboviridae
Pospiviroidae
Potyviridae
Poxviridae
Qinviridae
Quadriviridae
Reoviridae
Retroviridae
Rhabdoviridae
Roniviridae
Rudiviridae
Secoviridae
Sinhaliviridae
Siphoviridae
Smacoviridae
Solemoviridae
Solinviviridae
Sphaerolipoviridae
Spiraviridae
Sunviridae
Tectiviridae
Tobaniviridae
Togaviridae
Tolecusatellitidae
Tombusviridae
Tospoviridae
Totiviridae
Tristromaviridae
Turriviridae
Tymoviridae
Virgaviridae
Wupedeviridae
Xinmoviridae
Yueviridae

Viral Complexity Tree

Phylogenetic tree showing average NC of each viral group (TOP), and NC using inverted repetition detection program (BOTTOM).The colour red depicting the highest complexity, and the blue the lowest.

Note that low NC values indicate a genome with higher compression. Conversely, high values on the bottom tree indicate that more inverted repeats are present in the viral genome.

Research Team

Researchers involved in this project.

Jorge Miguel Silva

PhD Student

Diogo Pratas

Researcher

Tânia Caetano

Researcher

Sérgio Matos

Assistant Professor

Code of The Project

Funding

This work was funded by National Funds through the FCT in the context of the project UID/CEC/00127/2019 and the research grant SFRH/BD/141851/2018. D.P. is funded by national funds through FCT - Fundação para a Ciência e a Tecnologia, I.P., under the Scientific Employment Stimulus - Institutional Call - CI-CTTI-94-ARH/2019.T.C. is funded by national funds (OE), through FCT – Fundação para a Ciência e a Tecnologia, I.P., in the scope of the framework contract foreseen in the numbers 4, 5 and 6 of the article 23, of the Decree-Law 57/2016, of August 29, changed by Law 57/2017, of July (CEECIND/01463/2017). Thanks are due to FCT/MCTES for the financial support to CESAM (UIDP/50017/2020+UIDB/50017/2020), through national funds.

ALL RIGHTS RESERVED © 2021 UA.PT