We are happy to announce an updated bacterial and archaeal representative genomes collection. The current collection contains a total of 15,507 assemblies selected from 236,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has grown by five percent since August 2021. A total of 685 species are represented for the first time. In addition, 370 species are represented by a better assembly, and 84 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.
If you’re curious about genome annotation beyond the genes, then read on! We previously blogged about our RefSeq Functional Elements resource, which provides annotation of experimentally validated, non-genic functional elements in human and mouse. Now, to kick off 2022, we’re delighted to announce a new publication in the January issue of Genome Research:
Farrell CM, Goldfarb T, Rangwala SH, Astashyn A, Ermolaeva OD, Hem V, Katz KS, Kodali VK, Ludwig F, Wallin CL, Pruitt KD, Murphy TD. RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse.Genome Res. 2022 Jan;32(1):175-188. doi: 10.1101/gr.275819.121. Epub 2021 Dec 7. PMID: 34876495.
Figure 1. Workflow for production of the RefSeq Functional Elements dataset. Full cylinders represent databases, the half-cylinder represents the indicated data source, and rectangles represent actions. Further details can be found in the publication.
RefSeq Release 210 is now available online, from the FTP site and through NCBI’s Entrez
programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of January 3, 2022, and contains 302,482,881 records, including 220,595,192 proteins, 42,453,222 transcripts, and sequences from 115,929 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 210 is available”→
Introducing the NIH Comparative Genomics Resource (CGR)
NCBI is looking forward to seeing you in person at the International Plant and Animal Genome Conference (PAG XXIX), January 8-12, 2022 in San Diego, California. We’re especially excited to introduce our newest endeavor – the NLM initiative known as the NIH Comparative Genomics Resource (CGR)– a platform we are developing to support comparative analyses of sequenced eukaryotic research organisms. Understanding and supporting the needs of researchers is a fundamental element in the development of CGR and is critical to its future success in supporting a large and diverse collection.
Please join NCBI for the following events to learn more about CGR and how you can inform its development:
RefSeq release 209 is now available online, from the FTP site and through NCBI’s Entrez
programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of November 1, 2021, and contains 296,293,486 records, including 215,655,378 proteins, 41,751,205 RNAs, and sequences from 114,396 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 209 is available”→
NCBI Gene has added Ensembl Rapid Releases to the calculation of matching annotations between NCBI RefSeq and Ensembl. This has resulted in the inclusion of over 60 additional assemblies for a total of 241 organisms represented in the set. Matches are made based on transcript and CDS comparisons, and Ensembl gene, transcript, and protein identifiers for annotations similar to the NCBI RefSeq annotations are reported in NCBI Gene and in the gene2ensembl file on the Gene FTP site. The Ensembl annotation is also available in the graphical view and in NCBI’s Genome Data Viewer to give you a side-by-side view of how the annotations compare. Check out blue whale E2F1 for an example.
Figure 1. Balaenoptera musculus E2F transcription factor 1 in Genome Data Viewer
The bacterial and archaeal representative genome collection has been updated! We selected a total of 14,912 of the 224,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has grown by 8% since April 2021 and now includes Candidatus and endosymbiont species (Figure 1), which constitute 303 and 140 respectively of the 1,077 newly added species. In addition, 719 species are represented by a better assembly, and 70 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.
Figure 1. Graphical view of a portion of the RefSeq Representative assembly for the bedbug endosymbiont Candidatus Wolbachia massiliensis isolate PL13.
The National Center for Biotechnology Information (NCBI) has several speakers at the upcoming Biodiversity Genomics Conference from September 27 to October 1, 2021.
Valerie Schneider, head of NCBI’s SeqPlus Program and Deputy Director for Sequence Offerings, will present a poster discussing how NCBI’s new comparative genome research focus will enable researchers to explore all eukaryotic research organisms, find related organisms and support additional organism-specific resources that a specific community may have or wish to develop.
Nuala O’Leary, Product Owner, NCBI Datasets will present the latest developments for Datasets, a beta resource that supports intuitive and flexible access to genome data for a broad range of taxa via a redesigned website and command-line tools.
Adelaide Rhodes, Cloud Subject Matter Expert in Education, will present two case studies that emphasize the ease of navigating the new Datasets website as well as the use of command line tools to speed up data discovery for genes and genomes of interest.
Terence Murphy, Product Owner, NCBI RefSeq will present a new tool for genome providers to identify contamination in newly assembled sequences with high sensitivity, specificity, and performance.
The Biodiversity Genomics Conference brings together a global audience to celebrate achievements in genome sequencing across the eukaryotic tree of life, explore current challenges and solutions, and to develop strategies for sequencing and data sharing in the upcoming decade of biodiversity genomics. NCBI has several programs that support the needs of this scientific research group.