We have updated our annotation for the mouse reference genome, GRCm38.p6. It includes:
Markup for RefSeq Select, which identifies one representative transcript and protein for every protein-coding gene. Find features with the ‘tag=RefSeq Select’ attribute in GFF3 for those analyses where you need just a single transcript or protein for each coding gene. You can also find these RefSeqs in Entrez using the query ‘refseq_select[filter].’
Annotation updates made in the last year for over 2000 genes, including over 4000 new or revised curated transcripts. This includes targeted curation to ensure we are representing well-expressed and conserved transcripts for inclusion in RefSeq Select.
Annotation of over 2300 regulatory and other functional element features from over 900 biological regions. These are now identified with the source “RefSeqFE” in GFF3 column 2 for easy parsing.
When citing, please refer to this annotation as NCBI Mus musculus Annotation Release 108.20200622. You can find the data in:
This is our last update before upgrading to the new major assembly version just released by the Genome Reference Consortium, GRCm39. We expect to be cranking up our compute farm in the next few weeks to produce a full annotation based on our latest curation and extensive short (Illumina) and long (PacBio IsoSeq and nanopore) RNA-seq data, which should be released later this summer. Stay tuned!
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Figure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801, PMID 9618447) providing a unified nomenclature for this secretion system. Continue reading “Updated protein family models used by PGAP available for download”→
GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.
The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.
You can now view SNP variation data for many commonly studied animals and plants – including mouse, cow, Drosophila, Arabidopsis, maize, cabbage, and many more – in the Genome Data Viewer (GDV) and other graphical sequence viewers. This data is streamed from the European Variation Archive (EVA) at the European Bioinformatics Institute (EBI).
On any NCBI graphical sequence view you can use the Configure Tracks menu and the Track Configuration Panel to add the track for the EVA RefSNP data. This track is available through the left-hand tab for Remote Variation Data (Figure 1). The EVA RefSNP track displayed on the pig (Sus scrofa) chromosome 12 graphical view is shown in Figure 2.
Figure 1. The Track Configuration panel showing the Remote Variation Data tab and he EVA RefSNP Release 1 track. Select the track checkbox and click Configure to load the track.
Figure 2. The graphical sequence viewer showing the region of the growth hormone gene on pig chromosome 12 (NC_010454.4) with the EVA RefSNP Release 1 track at the bottom. The track header has an (R) and a green highlight to indicate that it is remote data streamed from an external website. NCBI is not responsible for the content or availability of these data.
Please contact us using the Feedback link on the graphical view to let us know what you think and how we can further improve your experience with the NCBI genome browsers and graphical sequence viewers
RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.
Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.
This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
dbVar, NCBI’s database of large-scale genetic variants, has a new track hub for viewing and downloading structural variation (SV) data in popular genome browsers. Initial tracks include Clinical and Common SV datasets. dbVar’s new track hub can be viewed using NCBI’s Genome Data Viewer through the “User Data and Track Hubs” feature (Figure 1) and other genome browsers by selecting “dbVar Hub” from the list of public tracks or by specifying the following URL.
As announced in a previous post, we offer a Docker version of NCBI BLAST that you can use locally or on the Google Cloud where we have pre-loaded BLAST databases. We are happy to announce that the same functionality is also available on the Amazon Cloud. In addition, we now offer 23 different BLAST databases at each cloud platform.