We are excited to announce the NCBI Allele Frequency Aggregator (ALFA) Release 2 (version 20201027095038) as one of the largest and most comprehensive aggregated variant datasets with allele frequency available as open-access. This release contains 79 dbGaP studies that included 192 thousand subjects and 5.8 trillion combined genotypes that generated allele frequency for 904 million variants with 316 million novel ones, previously unknown in dbSNP (Build 154).
Author: NCBI Staff
It’s time we do another roundup of what’s been happening on YouTube!
First up, the NCBI YouTube channel has merged with the NLM YouTube channel. You’ll now be able to find diverse content all on one channel, from tips on using resources to fascinating moments in the history of medicine and more!
This full release incorporates genomic, transcript, and protein data available as of January 4, 2021, and contains 262,714,372 records, including 191,411,721 proteins, 35,353,412 RNAs, and sequences from 106,581 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Updated human genome Annotation Release 109.20201120
Updated Annotation Release 109.20201120 is an update of NCBI Homo sapiens Annotation Release 109.
The annotation report for 109.20201120 is available here. The annotation products are available in the sequence databases and on the FTP site. Continue reading “RefSeq release 204 is now available”
We have updated the bacterial and archaeal representative genome collection! The current collection contains over 13,000 assemblies selected from the 203,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has increased by 11% since August 2020. We’ve included about 1,400 species for the first time, have used better assemblies for 1,177 species, and have removed 65 species because of changes in NCBI Taxonomy or uncertainty in their species assignment.
We have also updated the Representative Genomes Database on the Microbial Nucleotide BLAST page as well as the RefSeq Representative Genome Database on basic nucleotide BLAST, to reflect these changes. Continue reading “Prokaryotic representative genomes updated — now over 13 thousand assemblies!”
The current release has 221,467,827 traditional records containing 723,003,822,007 base pairs of sequence data. There are also 1,517,995,689 WGS records containing 11,830,842,428,018 base pairs of sequence data, 446,397,378 bulk-oriented TSA records containing 392,206,975,386 base pairs of sequence data, and 88,039,152 bulk-oriented TLS records containing 33,036,509,446 base pairs of sequence data. Continue reading “GenBank release 241.0”
You can now retrieve genome data using the NCBI Datasets command-line tool and API by simply providing a BioProject accession. You can go directly from a BioProject accession to genome data even when the BioProject accession is the parent of multiple BioProjects (Figure 1).
Figure 1. Command-lines using BioProject accessions with the datasets command-line tool and sample metadata. Top panel: command-line for downloading genome metadata for the Sanger 25 Genomes Project (PRJEB33226). Middle panel: a portion of the metadata in JSON format for the 25 Genomes Project. Bottom panel: command-line for downloading sequence data and annotation metadata for a component BioProject for the king scallop (PRJEB35331). Continue reading “Retrieve genome data by BioProject using the Datasets command-line tool”
Do you login to NCBI to use MyNCBI, SciENcv, or MyBibliography? Do you submit data to NCBI? If so, you’ll want to read further to get a first glimpse at some important changes to NCBI accounts that will be coming in 2021.
In brief, NCBI will be transitioning to federated account credentials. NCBI-managed credentials are the username and password you set at NCBI — these will be going away. Federated account credentials are those set through eRA Commons, Google, or a university or institutional point of access.
Why is this happening?
NIH, NLM, and NCBI take your privacy and security very seriously. As part of our normal reviews we have determined that making this change will increase the security of your accounts to a level that we feel is necessary.
When is this happening?
After June 1, 2021, you will no longer be able to use NCBI-managed credentials to login to NCBI.
One important way the National Library of Medicine (NLM) is responding to the ongoing public health emergency is through the COVID-19 Initiative. This public-private cooperation between NLM and more than 50 scholarly publishers and societies allows you to access over 100,000 articles on COVID-19, SARS-CoV-2 and other coronaviruses through PubMed Central (PMC). This collection includes recently published discoveries, a history of coronavirus reports for comparison, international (globally comprehensive) content, and captures the breadth of research, analysis, and commentary. We make these articles available in human- and machine-readable formats to support public accessibility and analysis by researchers.
You can search this public health emergency collection in PMC or download the collection through the PMC Open Access Subset. The collection spans:
- More than half a century of research, including articles from the 1960s through the present (more than 60% of the articles included thus far were published in 2020 (Figure 1, top panel);
- Several languages, including content in English (~95%), German, French, and Spanish;
- Many publication types, more than half of them research or review articles (Figure 1, bottom panel).
Figure 1. The Public Health Emergency Collection articles by decade of publication (top panel) and by publication type (bottom panel).
People have viewed or downloaded articles in this PMC collection more than 80 million times since March reflecting the great demand for such an open and centralized collection. Artificial intelligence organizations, such as the Allen Institute for AI — builders of the COVID-19 Research Dataset (CORD-19), have also used the collection to develop new text and data mining techniques that can help answer high-priority scientific questions related to COVID-19.
Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.
The new Protein Family Model resource (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters. Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs. The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel. Home page. Middle panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.