Triticum aestivum Assembly and Gene Annotation

IWGSCThe bread wheat genome in Ensembl Plants is the Chromosome Survey Sequence (CSS) for Triticum aestivum cv. Chinese Spring, combined with the reference sequence of chromosome 3B, both generated by the International Wheat Genome Sequencing Consortium. The CSS assemblies have been further refined into chromosomal pseudomolecules using POPSEQ data generated by Chapman et al. The gene models for 3B were annotated by the GDEC group at INRA. For all other chromosomes, gene models were generated by PGSB (version 2.2). The data presented here has been prepared as part of the Triticeae Genomics for Sustainable Agriculture project, funded by the U.K. Biotechnology and Biological Sciences Reseach Council.

See also the wheat homepage at URGI logo

Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gbp, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridization events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridization event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.


Ordered pseudomolecules from the IWGSC Chromosome Survey Sequence (CSS) 1.0

The bread wheat genome assembly presented here was produced by the International Wheat Genome Sequencing Consortium (IWGSC) [1], ordered into chromosomal pseudomolecules using population sequencing (POPSEQ) data generated by Chapman et al. [2].

The IWGSC Chromosome survey sequence has been generated using the Illumina platform from flow-sorted chromosome arms. The resulting assemblies are fragmented and resolution of repetitive regions is still limited. Nonetheless, assembly of gene-containing regions is reasonably good (N50 of 2.5kb) and the predicted gene models are close in terms of length and exon count to those previously predicted for other closely related species.

With the exception of chromosome 3B, 1,290,751 scaffolds from the CSS were anchored into chromosomal pseudomolecules, for a total length of 4,237,502,413 bp. In addition, Ensembl Plants also incorporates a set of unanchored scaffolds included if they contain a gene model, a sequence variant, or alignment. Additionally, any scaffold longer than 3kb was included, making it a total of 261,251 unanchored scaffolds, with cumulative length of 1,307,508,887 bp.

The complete set of survey sequences may be downloaded from The Genome Analysis Centre, and may be searched using the TGAC blast server. The data are also available in the archives of the International Nucleotide Sequence Database Consortium, under the PRJEB3955 project.

In addition to sequence assemblies and gene models, a number of additional data sets have been aligned to the survey sequence, including the complete genomes of Brachypodium distachyon, rice (Oryza sativa), and barely (Hordeum vulgare), as well as wheat UniGene clusters from NCBI, and wheat RNA-seq data deposited in the INSDC archives.

Chromosome 3B [3]

The reference sequence of the 1-gigabase chromosome 3B of hexaploid bread wheat was produced by sequencing 8452 bacterial artificial chromosomes in pools, and assembled the reads into a sequence of 774 megabases.

In addition to the reference sequence of 3B chromosome, 1450 unanchored scaffolds are present. Read more.

This reference sequence from the IWGSC replaces the CSS-derived assembly of 3B.

Chloroplast and mitochondrial genome components

The chloroplast and mitochondrial genome components and their gene annotation were imported from their respective ENA entries, KC912694 and AP008982.


IWGSC gene predictions on the Chromosome Survey Sequence (CSS), PGSB/MIPS version 2.2 [1]

Gene models were derived from the spliced-alignment of publicly available wheat fl-cDNAs and the protein sequences of related grass species, barley, Brachypodium, rice and Sorghum. A large RNA-Seq dataset, covering different tissues and different developmental stages, was used to identify wheat specific genes and additional splice variants. Redundant transcript structures from these different sources were merged.

A total of 99,386 protein-coding genes were predicted, with 193,667 transcripts and splice variants. To simplify display in the genome browser, splice variants are shown on a separate track, which is off by default.

IWGSC gene predictions on chromosome 3B, GDEC/INRA version 1.0 [3]

A total of 5,326 protein-coding genes, 1,938 pseudogenes, and 85% of transposable elements were generated on the 3B chromosome by the GDEC group at INRA. An additional 251 gene models and 188 pseudogenes were annotated on un-anchored 3B scaffolds.

The GDEC gene set replaces the PGSB/MIPS gene set for the purpose of functional annotation and comparative genomics within Ensembl Plants. However, the PGSB/MIPS genes have been projected from the CSS onto the chromosome 3B assembly, and are shown on a separate track.

Chloroplast and mitochondrial genes

The chloroplast and mitochondrial gene annotation were imported from their respective ENA entries, KC912694 and AP008982.

Triticeae-CAP predicted transcripts set - Krasileva et al. [4]

Predicted transcripts have been inferred from Exonerate alignments of wheat coding sequences (CDS) from two sets of transcripts: Triticum turgidum assembled RNAseq data (Krasileva et al., Genome Biology 2013, 14:R66, Supplemental dataset 7) and a collection of publicly available wheat transcripts filtered to exclude pseudogenes, sequences shorter than 90 bp, and ORFs similar to those present in the T. turgidum set. Click here for example. The program findorf was used to predict the CDS within these transcripts as described in Krasileva et al. [4]. See Triticeae-CAP project page for more information.

Repeat feature and non-coding RNA annotation

Repbase repeats as well as Triticeae repeats from TREP were aligned to the T. aestivum genome using RepeatMasker as part of our standard repeat feature annotation pipeline.

Non-coding RNA genes have been annotated using tRNAScan-SE (Lowe, T.M. and Eddy, S.R. 1997), RFAM (Griffiths-Jones et al 2005), and RNAmmer (Lagesen K.,et al 2007) as part of our standard non-coding RNA annotation pipeline.

Sequence alignments

Transcriptome mappings

Wheat RNA-Seq, ESTs, and UniGene datasets have been aligned to the Triticum aestivum genome:

Analysis of the bread wheat genome using comparative whole genome shotgun sequencing - Brenchley et al. [6]

The wheat genome assemblies previously generated by Brenchley et al. (PMID:23192148) have also been aligned to the survey sequence, Brachypodium, barley and the wild wheat progenitors (Triticum urartu and Aegilops tauschii). Homoeologous variants inferred between the three wheat genomes (A, B, and D) are displayed in the context of the gene models of these five genomes.

Sequences of diploid progenitor and ancestral species permitted homoeologous variants to be classified into two groups, 1) SNPs that differ between the A and D genomes (where the B genome is unknown) and, 2) SNPs that are the same between the A and D genomes, but differ in B.

The wheat gene alignments and the projected wheat SNPs are available on the Location view of the Triticum aestivum, Brachypodium distachyon and Hordeum vulgare genomes, as additional tracks under the "Wheat SNPs and alignments" section of the "Configure This page" menu. Click here for a bread wheat example. Click here for a Brachypodium example. Click here for a barley example.

Transcriptome assembly in diploid einkorn wheat Triticum monococcum - Fox et al. [9]

Genome-wide transcriptomes of two Triticum monococcum subspecies were constructed, the wild winter wheat T. monococcum ssp. aegilopoides (accession G3116) and the domesticated spring wheat T. monococcum ssp. monococcum (accession DV92) by generating de novo assemblies of RNA-Seq data derived from both etiolated and green seedlings. Assembled data is available from the Jaiswal lab and raw reads are available from INSDC projects PRJNA203221 and PRJNA195398.

The de novo transcriptome assemblies of DV92 and G3116 represent 120,911 and 117,969 transcripts, respectively. They were mapped to the bread wheat, barley and Triticum urartu genomes using STAR. Click here for a bread wheat example.


Data from CerealsDB [10]

~900,000 SNP markers provided by CerealsDB, from the University of Bristol, were mapped to the IWGSC Chromosome survey sequence using Exonerate, running on ungapped model, with the following filtering criteria, 100% coverage, and 100% identity match. Also, marker sequences mapping to more than 3 loci were discarded, making a total of ~600,000 SNP markers successfully mapped to the survey sequence, for a total of ~725,000 non-redundant SNP loci.

These SNPs can be part of the following platforms:

  • The Axiom 820K SNP Array
  • The Axiom Array contains ~820,000 SNP markers of which ~547,000 have been mapped.
  • The iSelect 80K Array [11]
  • The iSelect Array contains ~81,000 SNP markers of which ~43,500 have been mapped.
  • The KASP probeset [12]
  • The KASP set contains ~9000 markers of which ~3,284 have been mapped.

Note that a SNP marker can be part of more than one platform.

Data from the Wheat HapMap project [13]

The data were generated by re-sequencing 62 diverse wheat lines using whole exome capture (WEC) and genotyping-by-sequencing (GBS) approaches. 1.57 million SNPs and 161,719 small indels, distributed across all 21 chromosomes, were identified.

Inter-homoeologous variants

Whenever there is a 1-to-1 homoeology relationship between genes on the different bread wheat component genomes, we report sequence differences between them as a distinct variation dataset. These variant entities are called inter-homoeologous variants, and have been computed first using the Compara pipelines for detecting protein orthology, and then parsing supporting whole genome alignments to find sequnece differences at the nucleotide level..

Over 10 million variant features (insertions, deletions and substitutions) have been generated through this approach. Click here for example.

SIFT scores

SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. SIFT can be applied to naturally occurring nonsynonymous polymorphisms and laboratory-induced missense mutations.

SIFT scores and predictions (whether it is 'tolerated' or 'deleterious') have been calculated for all missense variants across all bread wheat variation datasets. See SIFT predictions for missense variants present in psbO gene, as an example.

We used all protein sequences available from UniRef90 (release 2015_04) as the protein database.

Wheat sequence search v2.0 online

Full sequence-based searching of the wheat genome is now available within the standard Ensembl Genomes sequence search facilities (ENA search and BLAST). The previous custom wheat-only search has now been discontinued.



  1. A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.
    2014. Science. 345:1251788.
  2. A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome.
    Chapman JA, Mascher M, Bulu A, Barry K, Georganas E, Session A, Strnadova V, Jenkins J, Sehgal S, Oliker L et al. 2015. Genome Biol. 16:26.
  3. Structural and functional partitioning of bread wheat chromosome 3B.
    Choulet F, Alberti A, Theil S, Glover N, Barbe V, Daron J, Pingault L, Sourdille P, Couloux A, Paux E et al. 2014. Science. 345:1249721.
  4. Separating homeologs by phasing in the tetraploid wheat transcriptome.
    Krasileva KV, Buffalo V, Bailey P, Pearce S, Ayling S, Tabbita F, Soria M, Wang S, Consortium I, Akhunov E et al. 2013. Genome Biol. 14:R66.
  5. Homoeolog-specific transcriptional bias in allopolyploid wheat.
    Akhunova AR, Matniyazov RT, Liang H, Akhunov ED. 2010. BMC Genomics. 11:505.
  6. Analysis of the bread wheat genome using whole-genome shotgun sequencing.
    Brenchley R, Spannagl M, Pfeifer M, Barker GL, D'Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D et al. 2012. Nature. 491:705-710.
  7. Genome interplay in the grain transcriptome of hexaploid bread wheat.
    Pfeifer M, Kugler KG, Sandve SR, Zhan B, Rudi H, Hvidsten TR, , Mayer KF, Olsen OA. 2014. Science. 345:1250091.
  8. TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics.
    Mochida K, Yoshida T, Sakurai T, Ogihara Y, Shinozaki K. 2009. Plant Physiol. 150:1135-1146.
  9. De Novo Transcriptome Assembly and Analyses of Gene Expression during Photomorphogenesis in Diploid Wheat Triticum monococcum.
    Fox SE, Geniza M, Hanumappa M, Naithani S, Sullivan C, Preece J, Tiwari VK, Elser J, Leonard JM, Sage A et al. 2014. PLoS ONE. 9:e96855.
  10. CerealsDB 2.0: an integrated resource for plant breeders and scientists.
    Wilkinson PA, Winfield MO, Barker GL, Allen AM, Burridge A, Coghill JA, Edwards KJ. 2012. BMC Bioinformatics. 13:219.
  11. Characterization of polyploid wheat genomic diversity using a high-density 90000 single nucleotide polymorphism array.
    Wang S, Wong D, Forrest K, Allen A, Chao S, Huang BE, Maccaferri M, Salvi S, Milner SG, Cattivelli L et al. 2014. Plant Biotechnol. J..
  12. Transcript-specific, single-nucleotide polymorphism discovery and linkage analysis in hexaploid bread wheat (Triticum aestivum L.).
    Allen AM, Barker GL, Berry ST, Coghill JA, Gwilliam R, Kirby S, Robinson P, Brenchley RC, D'Amore R, McKenzie N et al. 2011. Plant Biotechnol. J. 9:1086-1099.
  13. A haplotype map of allohexaploid wheat reveals distinct patterns of selection on homoeologous genomes.
    Jordan KW, Wang S, Lun Y, Gardiner LJ, MacLachlan R, Hucl P, Wiebe K, Wong D, Forrest KL, et al. 2015. Genome Biol. 16:48.

More information

General information about this species can be found in Wikipedia.



Assembly: IWGSC1+popseq, Nov 2014
Database version: 84.2
Base Pairs: 6,591,259,146
Golden Path Length: 6,483,288,884
Data source: IWGSC
Genebuild version: 2.2
Genebuild method: Imported from IWGSC

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

Non coding genes: 9,993
    Small non coding genes

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline. Please note that tRNAs are annotated separately using tRNAscan. tRNAs are included as 'simple fetaures', not genes, because they are not annotated using aligned sequence evidence.

    Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

    Misc non coding genes: 8

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.: 112,496

Feature counts

T. Aestivum Rna-Seq Alignments: 39,237
T. Turgidum Rna-Seq Alignments: 83,160
Short Variants: 9,433,868

Coordinate Systems

23 sequences
SequenceLength (bp)
scaffold 1610065 sequences