This month marks a major event in the realm of human genome research: the release of a new assembly of the genome, GRCh38. It has been over four years since the last major release (GRCh37 in March 2009), and we are going to explore several aspects of this new assembly in a series of blog posts over the coming weeks. In this initial post, we will give an overview of the data flow so that you will understand how NCBI received the data, where the data are at NCBI and what genome annotations you can expect from NCBI in the near future.
GRC: the data source
If you’re interested in the human genome and are not already familiar with the Genome Reference Consortium (GRC), it’s worth your time to visit their site, read their blog and become familiar with the genome assemblies they provide. Since the release of GRCh37 in 2009, the GRC has become the primary data source for the human genome assembly, which now includes sequences not only for the 24 chromosomes but also for alternate sequences of several chromosomal regions that have sufficient variability that they cannot be represented by a single sequence. When ready, the GRC submits the unannotated sequences to GenBank.
The genome in GenBank: the raw sequences
GRC recently submitted the data for GRCh38 to GenBank, and the assembly is available with accession GCA_000001405.15. These data are also available by FTP at ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38. As with any other GenBank sequence, only the submitter (GRC in this case) can update these records, so they will remain without annotation. Any group can then download and annotate these sequences as they wish.
The genome in RefSeq: NCBI’s annotation
Now that the GRC sequences are in GenBank, NCBI will run them through our eukaryotic annotation pipeline, which will produce a set of Reference Sequences (RefSeqs) that contain the resulting annotations. The chromosome sequences will continue to have accessions NC_000001-NC_000024, but their versions will update as GRCh38 includes a sequence change for all chromosomes. This process generally takes about 2 weeks, and when that is done we will incorporate these sequences into various analysis and display tools, such as genomic BLAST and genome viewers. Thus, at the end of this process each chromosome will be represented by both an unannotated sequence in GenBank (the original GRC data) and an annotated sequence in the RefSeq collection.
Please check back frequently for updates on our news and social media sites (NCBI Twitter Channel, NCBI Facebook Page, NCBI Announce RSS Feed, NCBI Announce Email ListServ) as this process unfolds. In future posts, we’ll cover additional topics such as remapping existing annotations on GRCh37 to GRCh38 and also some particular loci that have changed significantly in the new assembly.
For more information