Category: What’s New

Major update for the NCBI RefSeq mouse GRCm38.p6 annotation

We have updated our annotation for the mouse reference genome, GRCm38.p6. It includes:

  • Markup for RefSeq Select, which identifies one representative transcript and protein for every protein-coding gene. Find features with the ‘tag=RefSeq Select’ attribute in GFF3 for those analyses where you need just a single transcript or protein for each coding gene. You can also find these RefSeqs in Entrez using the query ‘refseq_select[filter].’
  • Annotation updates made in the last year for over 2000 genes, including over 4000 new or revised curated transcripts. This includes targeted curation to ensure we are representing well-expressed and conserved transcripts for inclusion in RefSeq Select.
  • Annotation of over 2300 regulatory and other functional element features from over 900 biological regions. These are now identified with the source “RefSeqFE” in GFF3 column 2 for easy parsing.

When citing, please refer to this annotation as NCBI Mus musculus Annotation Release 108.20200622. You can find the data in:

This is our last update before upgrading to the new major assembly version just released by the Genome Reference Consortium, GRCm39. We expect to be cranking up our compute farm in the next few weeks to produce a full annotation based on our latest curation and extensive short (Illumina) and long (PacBio IsoSeq and nanopore) RNA-seq data, which should be released later this summer. Stay tuned!

Updated protein family models used by PGAP available for download

Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0,  we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Prot_evidenceFigure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801,  PMID 9618447) providing a unified nomenclature for this secretion system.  Continue reading “Updated protein family models used by PGAP available for download”

GenBank release 238 is available

GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.

The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.

Continue reading “GenBank release 238 is available”

Non-human variation data from EVA now available in the Genome Data Viewer

You can now view SNP variation data for many commonly studied animals and plants – including mouse, cow, Drosophila, Arabidopsis, maize, cabbage, and many more – in the Genome Data Viewer (GDV) and other graphical sequence viewers. This data is streamed from the European Variation Archive (EVA)  at the European Bioinformatics Institute (EBI).

On any NCBI graphical sequence view you can use the Configure Tracks menu and the Track Configuration Panel to add the track for the EVA RefSNP data. This track is available through the left-hand tab for Remote Variation Data (Figure 1).  The EVA RefSNP track displayed on the pig (Sus scrofa) chromosome 12 graphical view is shown in Figure 2.

Config_tracksFigure 1. The Track Configuration panel showing the Remote Variation Data tab and he EVA RefSNP Release 1 track. Select the track checkbox and click Configure to load the track.

pig_snpsFigure 2. The graphical sequence viewer showing the region of the growth hormone gene on pig chromosome 12 (NC_010454.4) with the EVA RefSNP Release 1 track at the bottom.  The track header has an (R) and a green highlight to indicate that it is remote data streamed from an external website. NCBI is not responsible for the content or availability of these data. 

The EVA SNP FTP site has more information about the EVA SNP data release.

Please contact us using the Feedback link on the graphical view to let us know what you think and how we can further improve your experience with the NCBI genome browsers and graphical sequence viewers

 

RefSeq release 201 is public

RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.

Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.

This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.

NIH’s COVID-focused Sequence Read Archive (SRA) datasets are now open access on AWS!

While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.

Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.

Continue reading “NIH’s COVID-focused Sequence Read Archive (SRA) datasets are now open access on AWS!”

dbVar clinical and common structural variants track hub now available

dbVar, NCBI’s database of large-scale genetic variants, has a new track hub for viewing and downloading structural variation (SV) data in popular genome browsers. Initial tracks include Clinical and Common SV datasets. dbVar’s new track hub can be viewed using NCBI’s Genome Data Viewer through the “User Data and Track Hubs” feature (Figure 1) and other genome browsers by selecting “dbVar Hub” from the list of public tracks or by specifying the following URL.

https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/dbvarhub/hub.txt

Main_Track_Hub_Dial

Figure 1. Loading the dbVar track hub in the Genome Data Viewer. The Track Hubs feature on the left-hand column of the browser allow you to add the track by searching for it or by entering the direct URL. You can select the specific tracks —  for example, “NCBI curated common SVs: All populations” — to load from the Configure Track Hubs dialog. Continue reading “dbVar clinical and common structural variants track hub now available”

The BLAST Docker and databases are now ready to use on Google and Amazon clouds

As announced in a previous post, we offer a Docker version of NCBI BLAST that you can use locally or on the Google Cloud where we have pre-loaded BLAST databases.  We are happy to announce that the same functionality is also available on the Amazon Cloud.  In addition, we now offer 23 different BLAST databases at each cloud platform.

Continue reading “The BLAST Docker and databases are now ready to use on Google and Amazon clouds”

We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage

RFI_SRA_largeNIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.

Sra_growthFigure 1. SRA data has grown exponentially over the last decade.

NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.

It is critical that as an SRA user, you  participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.

Continue reading “We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage”

Improved access to SARS-CoV-2 data

NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

comm-1318_fig1
Figure 1 – SARS-CoV-2 page within NCBI Datasets showing statistics as of June 16, 2020.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

comm-1318_fig2
Figure 2 – SARS-CoV-2 protein page within NCBI Datasets showing annotations on the SARS-CoV-2 reference genome.

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.

We appreciate your feedback. Try NCBI Datasets and let us know what you think!