En savoir plus

Notre utilisation de cookies

« Cookies » désigne un ensemble d’informations déposées dans le terminal de l’utilisateur lorsque celui-ci navigue sur un site web. Il s’agit d’un fichier contenant notamment un identifiant sous forme de numéro, le nom du serveur qui l’a déposé et éventuellement une date d’expiration. Grâce aux cookies, des informations sur votre visite, notamment votre langue de prédilection et d'autres paramètres, sont enregistrées sur le site web. Cela peut faciliter votre visite suivante sur ce site et renforcer l'utilité de ce dernier pour vous.

Afin d’améliorer votre expérience, nous utilisons des cookies pour conserver certaines informations de connexion et fournir une navigation sûre, collecter des statistiques en vue d’optimiser les fonctionnalités du site. Afin de voir précisément tous les cookies que nous utilisons, nous vous invitons à télécharger « Ghostery », une extension gratuite pour navigateurs permettant de les détecter et, dans certains cas, de les bloquer.

Ghostery est disponible gratuitement à cette adresse : https://www.ghostery.com/fr/products/

Vous pouvez également consulter le site de la CNIL afin d’apprendre à paramétrer votre navigateur pour contrôler les dépôts de cookies sur votre terminal.

S’agissant des cookies publicitaires déposés par des tiers, vous pouvez également vous connecter au site http://www.youronlinechoices.com/fr/controler-ses-cookies/, proposé par les professionnels de la publicité digitale regroupés au sein de l’association européenne EDAA (European Digital Advertising Alliance). Vous pourrez ainsi refuser ou accepter les cookies utilisés par les adhérents de l'EDAA.

Il est par ailleurs possible de s’opposer à certains cookies tiers directement auprès des éditeurs :

Catégorie de cookie

Moyens de désactivation

Cookies analytiques et de performance

Realytics
Google Analytics
Spoteffects
Optimizely

Cookies de ciblage ou publicitaires

DoubleClick
Mediarithmics

Les différents types de cookies pouvant être utilisés sur nos sites internet sont les suivants :

Cookies obligatoires

Cookies fonctionnels

Cookies sociaux et publicitaires

Ces cookies sont nécessaires au bon fonctionnement du site, ils ne peuvent pas être désactivés. Ils nous sont utiles pour vous fournir une connexion sécuritaire et assurer la disponibilité a minima de notre site internet.

Ces cookies nous permettent d’analyser l’utilisation du site afin de pouvoir en mesurer et en améliorer la performance. Ils nous permettent par exemple de conserver vos informations de connexion et d’afficher de façon plus cohérente les différents modules de notre site.

Ces cookies sont utilisés par des agences de publicité (par exemple Google) et par des réseaux sociaux (par exemple LinkedIn et Facebook) et autorisent notamment le partage des pages sur les réseaux sociaux, la publication de commentaires, la diffusion (sur notre site ou non) de publicités adaptées à vos centres d’intérêt.

Sur nos CMS EZPublish, il s’agit des cookies sessions CAS et PHP et du cookie New Relic pour le monitoring (IP, délais de réponse).

Ces cookies sont supprimés à la fin de la session (déconnexion ou fermeture du navigateur)

Sur nos CMS EZPublish, il s’agit du cookie XiTi pour la mesure d’audience. La société AT Internet est notre sous-traitant et conserve les informations (IP, date et heure de connexion, durée de connexion, pages consultées) 6 mois.

Sur nos CMS EZPublish, il n’y a pas de cookie de ce type.

Pour obtenir plus d’informations concernant les cookies que nous utilisons, vous pouvez vous adresser au Déléguée Informatique et Libertés de l’INRA par email à cil-dpo@inra.fr ou par courrier à :

INRA
24, chemin de Borde Rouge –Auzeville – CS52627
31326 Castanet Tolosan cedex - France

Dernière mise à jour : Mai 2018

Menu Logo Principal Société Française de Bio-Informatique GdR Bionformatique Moléculaire du CNRS

DECODAGE – Communauté d’Annotation des Génomes

Architecture

TriAnnot pipeline architecture

TriAnnot pipeline architecture

To be modified and updated !

The structural & functional automatic annotation pipeline TriAnnot is divided into five main panels. However, in practice and for description convenience, the TriAnnot pipeline is divided within several steps. Depending of the step used, it is possible to work on the initial or masked sequence, and fix several parameters.

Below are given the step numbers and names within the ‘step.xml’ file or template, needed by the TriAnnot pipeline, and which represents the list of programs/databanks/parameters corresponding to a default analysis which is available for wheat, barley, rice, maize and oak at present. We give a short explanation for each step. The different steps.xml used for each species are available here.

A given step may be identical for all the species, if it is not the case we describe the specific status for each species. For more details concerning software and databanks used see respectively Software and Databanks . In the later, we give for each databank the corresponding feature name used with GenomeView for manual curation.

Panel I

  • Step1 - ncRNAs
    • TriAnnot allows for the identification of other sequence features based on specific bioinformatics programs such as tRNAscan (Lowe and Eddy, 1997 Nucleic Acids Res. 25, 955-964)

*** the analysis of step2 is conducted on a sequence masked for ncRNAs from Step1 ***

Panel II

  • Step2- RepeatMasker
    • Transposable Elements, UniVec and E. coliannotation & masking (Ns/lower case) process using RepeatMasker against:
      • Different databanks for the 5 species
        • Wheat : Poaceae MIPS-repeats; NCBI UniVec and E. coli databanks
        • Barley: Poaceae MIPS-repeats; NCBI UniVec and E. coli databanks
        • Rice: Poaceae MIPS-repeats; NCBI UniVec and E. coli databanks
        • Maize: Poaceae MIPS-repeats; NCBI UniVec and E. coli databanks
        • Oak: Eudicots MIPS-repeats; NCBI UniVec and E. coli databanks
  • Step3 - BLASTx on initial sequence
    • Annotation of Transposable Elements using BLASTx against TREPprot (proteins)
      • BLASTx only for:
        • Wheat; Barley; Rice, Maize
  • Step4- Gtallymer- MDRindex
    • k-mer composition to identify repeated regions using an index of 17-mer frequency (called MDR for Mathematically Defined Repeats) that was computed with Tallymer (Kurtz et al., 2008 BMC Genomics 9, 517) using an Illumina read sample of sorted wheat chromosome var. Chinese Spring genome representing 1x coverage (Kumar et al. 2011 Current Genetics 100,455)

*** The analysis from Step2 to 8 are conducted on a sequence masked for ncRNAs & TEs from Step1 & 2 respectively ***

Panel III

  • Step5 - ab initio prediction
    • GeneID (wheat matrix) - (Guigo et al., 1992 J. Mol. Biol. 226, 141-157)
      • Only for
        • Wheat & Barley
    • Augustus (maize, wheat matrix) - (Stanke and Waack, 2003 Bioinformatics 19 Suppl 2, ii215-225)
      • Different matrix for:
        • Wheat & Barley whith the wheat matrix
        • Rice & Maize with the maize matrix
    • FGeneSH (dicot and monocot matrix) – SoftBerry
      • Different matrix for:
        • Wheat, Barley, Rice & Maize with the monocot matrix
        • Oak with the Dicot matrix
  • Step6 - BLASTn (BLAST legacy) followed by Exonerate (Slater and Birney, 2005 BMC Bioinformatics 6, 31) spliced alignments for gene search structure analysis by similarity against (see Databanks):
    • RNAseq assemblies
      • BLASTn (Evalue=1e-5 / 80% identity / 95% coverage) / Exonerate (score 500 / 95% coverage / percent=90)
        • Wheat - a comprehensive ensemble de novo transcriptome assembly of Illumina short RNA-seq reads sampled from five different tissues made by MIPS on December 2012 for IWGSC sequence survey project
        • Barley - barley RNA-seq contigs set representing 23,797 genes (91%) of the IBSC high confident gene set published by Mayer et al. Nature 2012 and made by MIPS
        • Oak - Singletons and Assemblies of 454/Sanger/Illumina reads from Quercus robur and Quercus petraea - provided by Isabelle Lesur
      • no RNAseq resources for rice and maize
    • EMBL release databanks
      • Full length cDNA - BLASTn (Evalue=1e-5 / 70% identity / 80% coverage) / Exonerate (score 300 / 80% coverage / percent=80)
        • Wheat: Triticeae + Riken & Poaceae FL-cDNA
        • Barley: Hordeum & Poaceae FL-cDNA
        • Rice: Oryza & Poaceae FL-cDNA
        • Maize: Zea & Poaceae FL-cDNA
        • Oak: Arabidopsis, Prunus, Populus, and Rosids FL-cDNA
      • ESTs - BLASTn (Evalue=1e-5 / 80% identity / 95% coverage) / Exonerate (score 500 / 95% coverage / percent=90)
        • Wheat: Triticeae ESTs
        • Barley: Hordeum ESTs
        • Rice: Oryza ESTs
        • Maize: Zea ESTs
        • Oak: Quercus ESTs
      • CDS derived from genome model annotations - BLASTn (Evalue=1e-5 / 70% identity / 80% coverage) / Exonerate (score 300 / 80% coverage / percent=80)
        • Wheat, Barley, Rice and Maize use:
          1. B. distachyon from Phytozome
          2. H. vulgare (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – High Confidence and Low Confidence predictions)
          3. O. sativa (Nipponbare) from IRGSP
          4. S. bicolor from Phytozome
          5. Z. mays from Phytozome
          6. Rice & Maize - in addition
          7. O. sativa (Nipponbare) from Phytozome
        • Oak uses:
          1. A. thaliana from TAIR
          2. P. trichocarpa from Phytozome
          3. P. persica from Phytozome
          4. V. vinifera from Phytozome
      • NCBI unigenes - BLASTn (Evalue=1e-5 / 70% identity / 80% coverage) / Exonerate (score 300 / 80% coverage / percent=80)
        • Wheat: Triticum aestivum
        • Barley: Hordeum vulgare
        • Rice: Oryza sativa
        • Maize: Zea mais
        • Oak: Quercus robur
    • Previous Annotation (only for wheat)
      • Wheat gene annotation on the IWGSCsurvey sequence from MIPS, Germany, on December 2012
      • Gene manually validated by S. Theil at GDEC for the 3BSEQ project assembly 4_1, Annotation 4_2 (Automatic annotation made with TriAnnot 3.5 modified)
  • Step7 - BLASTx (BLAST legacy) followed by Exonerate spliced alignments for gene search structure analysis by similarity against (see Databanks):
    • EMBLspecific species proteome
      • BLASTx (Evalue=1e-5 / 40% identity / 80% coverage) / Exonerate (score 300 / 80% coverage / percent=70)
        • Wheat: Triticeae proteom
        • Barley: Hordeum proteom
        • Rice: Oryza proteom
        • Maize: Zea proteom
        • Oak: Rosids proteom
      • BLASTx (Evalue=1e-5 / 40% identity / 40% coverage) / Exonerate (score 300 / 80% coverage / percent=70)
        • for Wheat, Barley, Rice, Maize & Oak against SIMprot (see Databanks)
    • Peptides derived from genome model annotations (CDS)
      • BLASTx (Evalue=1e-5 / 40% identity / 80% coverage) / Exonerate (score 300 / 80% coverage / percent=70)
        • Wheat & Barley use:
          1. B. distachyon from Phytozome
          2. H. vulgare (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – High Confidence and Low Confidence predictions)
          3. Aegilops tauschii from Jia et al (2013) Nature 496:91-95
          4. Triticum urartu from Ling et al (2013) Nature 496:87-90
          5. O. sativa (Nipponbare) from IRGSP
          6. O. sativa (Nipponbare) from Phytozome
          7. S. bicolor from Phytozome
          8. Z. mays from Phytozome
        • Rice & Maize use:
          1. B. distachyon from Phytozome
          2. H. vulgare (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – High Confidence predictions)
          3. O. sativa (Nipponbare) from IRGSP
          4. O. sativa (Nipponbare) from Phytozome
          5. S. bicolor from Phytozome
          6. Z. mays from Phytozome
        • Oak
          1. A. thaliana from TAIR
          2. P. trichocarpa from Phytozome
          3. P. persica from Phytozome
          4. V. vinifera from Phytozome
  • Step8 - Gene Modeling- The structural annotation is based on three complementary approaches followed with a selector filter or chooser. Five categories are proposed based on biological evidences:
    • The first approach is a DNA similarity based approach using only SIMsearch (developped by NIAS). Three annotation processes are launched:
      • SIMsearch - category 1 genes (CAT1) against FL-cDNA
        • Wheat: Triticeae + Riken FL-cDNA
        • Barley: HordeumFL-cDNA
        • Rice: OryzaFL-cDNA
        • Maize: ZeaFL-cDNA
        • Oak: Rosids FL-cDNA
      • SIMsearch - category 2 genes (CAT2) against predicted CDS from model plant genomes
        • Wheat, Barley, Rice, Maize use SIMnucWheat databank (see Databanks) [remark: SIMnucWheat doesn't mean in this case that SIMnuc is restricted to wheat]
        • Oak use SIMnucOak databank (see Databanks)
      • SIMsearch  - category 3 genes (CAT3) against predicted mRNA from EMBL magniolophyta (flowering plants) databank
        • SIMsearch follows three successsive analysis:
          1. BLASTn (≥ 80% nucleotide identity and ≥ 90% nucleotide coverage)
          2. BLASTn hit sequences are then retrieved and a spliced alignment against the sequence is produced with Exonerate (Slater and Birney, 2005 BMC Bioinformatics 6, 31)
          3. The Exonerate products are then used to make a BLASTx against the SIMprot databank (see Databanks) which is composed of protein public databanks and proteins derived from the annotation of annotated model genomes. The best hit is used by SIMsearch to define an Open Reading Frame (ORF). If start and/or stop codons cannot be found within the aligned region, the ORF is extended in both 5’ and 3’ directions as described by (Amano et al., 2010 DNA Res. 17, 271-279). If no protein hit is found, then SIMsearch can use a relevant ab initio predictions to predict the ORF. Homologous hits without initiation and/or termination codon or for which no ab initio prediction can be found are discarded
    • The second approach uses the gene combiner EuGene (Schiex et al., 2001 in Computational Biology. Gascuel, O., Sagot, M-F)
      • Wheat combines previous biological evidences obtained from:
        1. Augustus, GeneID and FGeneSH ab initio prediction results
        2. Wheat MIPS-RNAseq results
        3. Tritceae ESTs results
        4. H. vulgare CDS and protein-derived-CDS (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – only High Confidence predictions)
        5. Proteins derived from CDS annotation of B. distachyon, T. urartu & Ae. tauscchi (Nature paper) and Oryza sativa (IRGSP)
      • Barley combines previous biological evidences obtained from:
        1. Augustus, GeneID and FGeneSH ab initio prediction results
        2. Barley MIPS-RNAseq results
        3. Hordeum ESTs results
        4. H. vulgare CDS and protein-derived-CDS (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – only High Confidence predictions)
        5. Proteins derived from CDS annotation of B. distachyon and Oryza sativa (IRGSP)
      • Rice combines previous biological evidences obtained from:
        1. Augustus and FGeneSH ab initio prediction results
        2. Oryza ESTs results
        3. Oryza sativum CDS and protein-derived-CDS obtained from IRGSP and Phytozome
        4. Proteins derived from CDS annotation of B. distachyon
      • Maize combines previous biological evidences obtained from:
        1. Augustus and FGeneSH ab initio prediction results
        2. Zea ESTs results
        3. Oryza sativum CDS and protein-derived-CDS obtained from IRGSP
        4. Sorghum bicolor, Zea mays CDS and protein-derived-CDS obtained from Phytozome
      • Oak combines previous biological evidences obtained from:
        1. FGeneSH ab initio prediction results
        2. Oak RNAseq results
        3. Quercus ESTs results
        4. Proteins derived from CDS annotation of Arabidopsis thaliana, Prunus persica, Populus trichocarpa, Vitis vinifera
    • The third approach is base only on ab inito prediction
      • Wheat and Barley are based on Augustus using a wheat matrix
      • Rice and Maize are based on Augustus using a maize matrix
      • Oak is based on FGeneSH using a dicot matrix
  • Step9
    • Hidden
  • Step10- Selector or chooser
    • At a given locus the best modeling of protein-coding genes is performed by calculating a score and enables:
      • the automated validation of gene structure with the definition of a confidence level:
        • HC - High Confidence
          • clear biological evidences for start & stop codons; and intron-exon junctions
          • a protein hit coverage > 70%
        • LC - Low Confidence
          • no clear biological evidences for one of these case: start or stop codons or intron-exon junction
          • a protein hit coverage > 70%
      • identification of pseudogenes
        • pseudogenes could be HC or LC but this is not shown
          • protein hit coverage < 70%
      • the exclusion of ab-initio predictions without any similarity with plant proteins or with similarity only with TE-encoded protein
    • To provide users with a representation of the quality index for the gene prediction, TriAnnot displays a new color coded system in which each of the above mentioned categories is symbolized with a specific color:
      • Red for High Confidence gene models
      • Blue for Low Confidence gene models
      • Gray for pseudogenes
    • The gene models are masked with lower-case for further analysis in Panel IV
Color code systeme for gene models

Color code systeme for gene models

  • Step11- Functional Annotation
    • Putative function for the gene models are assigned via a combination of similarity search (BLASTP) against several protein databanks and against the Pfam (Sammut et al., 2008 Brief Bioinform. 9, 210-219; Finn et al., 2010 Nucleic Acids Res. 38, D211-222) protein domain collection with HMMER 3.0. TriAnnot follows a nomenclature based on the guideline established in 2006 by the IWGSC annotation working group. Therefore, the functional annotation is a step by step process and stop at a given step when a result is obtained. Six levels of annotation are defined a follow:
      • known function”: when >80% identity over >80% of the protein length is found with a known protein in UniProtKB/Swiss-Prot. This category reflects the highest quality for functional annotation
      • putative function”: when >45% similarity over >50% of the protein length is found with a known protein in EMBL magnolophyta proteom
      • domain containing protein”: when there is no significant BLASTP hit with a known or putative function in the previous steps, but one or more Pfam domains (Sammut et al., 2008; Finn et al., 2010) are identified
      • expressed sequence”: based on TBLASTN against a specific plant EST databanks (EMBL) with >45% identity and >50% coverage
        • Wheat: Triticeae ESTs
        • Barley: Hordeum ESTs
        • Rice: Oryza ESTs
        • Maize: Zea ESTs
        • Oak: Quercus ESTs
      • conserved unknown function”: when no expressed sequence is found, and when >45% similarity over >50% of the protein length is found only with an unknown function (i.e. a protein annotated as “putative” or “hypothetical” or "predicted") in EMBL magnolophyta proteom
      • hypothetical protein”: when no similarity is found, either in UniProtKB/Swiss-Prot or magnolophyta or Pfam domain or ESTs

*** The analysis of Step12 is conducted on protein derived from gene prediction in Step10 ***

  • Step12 - Best hits searches
    • Comparative sequence analysis of genomic regions from related species can greatly support gene identification in the annotation process. For all proteins derived from TriAnnot predicted gene models, TriAnnot searches for the best BLASTP hit (Evalue=1e-5 / 40% identity / 70% coverage) followed by Exonerate (Slater and Birney, 2005 BMC Bioinformatics 6, 31) spliced alignments (score 300 / 70% coverage / percent = 60). When available, the TriAnnot pipeline uses the proteome derived from genome annotation. If it is not the case, TriAnnot uses the proteome of a species genus obtained from EMBL databanks querying. The same approach is applied for every species (wheat, barley, rice, amize and oak).
      • Peptides derived from genome model annotations
        • Aegilops tauschii from Jia et al (2013) Nature 496:91-95
        • Arabidopsis thaliana from TAIR
        • Brachypodium distachyon from Phytozome
        • Brassica rapa from Phytozome
        • Glycine max from Phytozome
        • Hordeum vulgare (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543 – High Confidence and Low Confidence predictions)
        • Medicago trunculata from Phytozome
        • Oryza sativa (Nipponbare) from IRGSP
        • Oriza sativa (Nipponbare) from Phytozome
        • Prunus persica from Phytozome
        • Populus trichocarpa from Phytozome
        • Solanum lycopersicum from Phytozome
        • Sorghum bicolor from Phytozome
        • Theobroma cacao from Phytozome
        • Triticum urartu from Ling et al (2013) Nature 496:87-90
        • Setaria italica from Phytozome
        • Vitis vignifera from Phytozome
        • Zea mays from Phytozome
      • EMBL plant proteoms
        • EMBL Hordeum proteom
        • EMBL Saccharum proteom
        • EMBL Triticum proteom
      • UniProtKB
      • Other
  • Step13 - Domains & Gene Ontology
    • TriAnnot provides Gene Ontology (GO) terms for each gene model and protein domain predictions based on InterProScan (Zdobnov and Apweiler, 2001 Bioinformatics 17, 847-848). Searches are done against Pfam (Sammut et al., 2008 Brief Bioinform. 9, 210-219; Finn et al., 2010 Nucleic Acids Res. 38, D211-222), Prosite (Sigrist et al., 2010 Nucleic Acids Res. 38, D161-166) and SMART (Letunic et al., 2009 Nucleic Acids Res. 37, D229-232)

Panel IV

TriAnnot is also seeking for other sequence features based on comparative genomics using BLASTN search similarities against major plant genomes. This similarity search is performed on un-annotated portions of the query sequence (Ns masked for ncRNAs, TEs and gene models). This module also allows identifying hits against plastids and mitochondrial genomes to identify fragment of such sequences integrated into the nuclear genomes (or contaminations)

  • Step14
    • Organelles annotation
      • BLASTn (Evalue=1e-1 / 40% identity / 60% coverage)
        • NCBI RefSeq plant mitochondrial genomes
        • NCBI RefSeq plant plastid genomes

*** The analysis from Step14 (except organelles annotation) to 15 are conducted on a sequence masked for ncRNAs, TEs an predicted gene from Step1,2 & 10 respectively ***

    • Conserved non-coding sequences(CNSs)
      • BLASTn (Evalue=1e-3)
        • Wheat, Barley, Rice and Maize
          • Aegilops tauschii genome from Jia et al (2013) Nature 496:91-95
          • Aegilops tauschii genome from TGAC
          • Aegilops sharonesis genome from TGAC
          • Brachypodium distachyon genome from Phytozome
          • Hordeum vulgare genome (MIPS, The Genome Paper in Nature: doi:10.1038/nature11543)
          • Setaria italica genome from Phytozome
          • Sorghum bicolor genome from Phytozome
          • Triticum durum genome from TGAC
          • Triticum mococcum genome from TGAC
          • Triticum speltoïdes genome from TGAC
          • Triticum urartu genome from TGAC
          • Triticum urartu var. strongfield genome (Clarke et al. 2005 Can. J. Plant Sci. 85:651-654) from TGAC
          • Triticum urartu genome from Ling et al (2013) Nature 496:87-90
          • Zea mays genome from Phytozome
        • Oak

Panel V

  • Step15 - SSRs or microsatellites
    • Simple Sequence Repeats (SSR) or microsatellites have been extensively used for molecular marker design in plants (Paux and Sourdille, 2009 in Genetics and Genomics of the Triticeae, eds. C. Feuillet & J.G. Muehlbauer. (New York: Springer), 255-284). In wheat, their density was estimated to one SSR every 13.1 kb (Choulet et al., 2010 Plant Cell 22, 1686-1701). TriAnnot uses the TRF program (Tandem Repeats Finder) (Benson, 1999 Nucleic Acids Res. 27, 573-580) with specific parameters to enhance the finding of such repeats.