RefSeq release 214 is now available online, from the FTP site, and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of September 12, 2022, and contains 328,588,569 records, including 239,609,016 proteins, 47,387,931 RNAs, and sequences from 123,394 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Foreign contamination screening
Introducing the new Foreign Contamination Screen (FCS) tool! If you produce assembled genomes, check out FCS, a tool you can run yourself to improve your genome assemblies and facilitate high-quality data submissions to GenBank. FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. See our previous blog post to learn how FCS enhances contaminant detection sensitivity. Continue reading “RefSeq release 214 is available!”→
Learn about the NIH Comparative Genomics Resource (CGR) Project
The Biodiversity Genomics conference will take place virtually, October 2-7, 2022. This event is hosted by the Earth BioGenome Project and is open and free for all to attend.
NCBI staff will present a variety of recorded talks and posters highlighting various elements of the NIH Comparative Genomics Resource (CGR), including NCBI Datasets and the Comparative Genome Viewer (CGV). CGR is a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. NCBI is charged with leading CGR development and engaging genomics communities. The CGR project will facilitate reliable comparative genomics analyses for all eukaryotic organisms in collaboration with the genomics community.
Release 10.0 of the NCBI Hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 10.0 release contains 15,360 models maintained by NCBI, including 228 that are new since 9.0, 99 that were modified significantly, and 205 that were assigned better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.
GO terms associated with HMMs are now propagated to CDSs and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.
PubMed will be moving to an updated version of the E-utilities API on November 15, 2022. As previously announced, this updated version of E-utilities will use the same technology as the web version of PubMed released in 2020. So, search results returned by the updated ESearch E-utility will now match those of the PubMed.gov website.
Maps clinically significant variants by gene and position!
ClinVar is a freely accessible, public archive of reports of the relationships between human variations and phenotypes, with supporting evidence at NLM/NCBI. To help you access your variants of interest quickly, ClinVar is introducing an all-new visualization tool in the search results. This graphical display provides an overview of variants when you search by gene or genomic region (Figures 1 and 2). You can only get the graphical display with gene or genomic region searches. For other types of searches, you will see the table only.
Gene search display
The display for a gene search highlights small variants within the gene. Large structural variants are also marked as a single dot in the middle of the variation. The interactive display shows the placement of variants on the gene and their clinical significance and allows you to zoom in or pan right / left and limit results to variants in a chosen gene. Figure 1 shows the graphical display as it appears at the top of the search results for the desmoglein 2 (DSG2) gene and how to filter and navigate to variants of interest (Search ClinVar: DSG2[gene]).
B. You can zoom in by mousing over the 8th exon in the gene diagram, which activates a pop-up menu that allows you to re-display only this region by following the link (red box).
C. Refreshed result for the 8th exon of DSG2 showing a number of variants including pathogenic, benign, and ones with conflicting interpretations of pathogenicity. You can select the filters on the left-hand side of the ClinVar result to limit to variants with characteristics of interest, for example Conflicting Interpretations of pathogenicity.
D. Variants in exon 8 of DSG2 filtered for conflicting interpretations of pathogenicity. You can retrieve individual variants by mousing over the graphic to activate the pop-up menu and following the link (red box).
We launched the NCBI Virtual Outreach Event series in the fall of 2021 to expand our online outreach to a worldwide audience of people who use NCBI resources for biological/biomedical research, science education, and clinical applications. Our virtual outreach events include interactive workshops, webinars, and codeathons. In the past year, we have hosted 34 virtual events and served over 1,600 participants (Figure 1).
ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:
1. ElasticBLAST can handle much LARGER queries!
ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.
2. ElasticBLAST is FASTER
ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.
3. ElasticBLAST is EASY to run on the cloud
ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP))andallows you to leverage the power of the cloud. Once configured, itmanages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.
GenBank release 251.0 (8/15/2022) is now available on the NCBI FTP site. This release has 19.55 trillion bases and 2.94 billion records. The current release has 239,915,786 traditional records containing 1,492,800,704,497 base pairs of sequence data. There are also 2,024,099,677 WGS records containing 17,511,809,676,629 base pairs of sequence data, 560,196,830 bulk-oriented TSA records containing 497,501,380,386 base pairs of sequence data, and 115,103,527 bulk-oriented TLS records containing 43,852,280,645 base pairs of sequence data.
Conserved Domain Search (CD Search) results now show domain architecture information and other annotations that further characterize predicted domain and protein function. These include links to PubMed, Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and the SPARCLE Domain Architecture Viewer. You can use these links on the results to find literature (PubMed), assign biological roles and protein function (GO and EC), and find proteins with the same domain architecture (Domain Architecture Viewer). These annotations are currently available for a limited number of architectures, but we will continue to add them as part of our curation effort.