NCBI introduces Datasets, a new resource that lets you easily gather data from across NCBI databases. Our first release allows you to find and download genomic sequence and annotation data for all eukaryotic organisms through our user-friendly web interface.
Our web interface also provides an interactive taxonomy tree that lets you browse for your favorite organism. We are currently testing the web interface in the NCBI labs environment. To try it out, enter a taxonomic name or assembly accession and click on the ‘Get Data’ button in the search results panel.
Our new, responsive PubMed site replaces PubMed Mobile. You now have the full PubMed experience on any size screen, including the ability to save and email citations, use the Clipboard, and send citations to My NCBI Collections on your mobile device.
Also, the new, responsive PubMed will replace the legacy desktop site for PubMed in late spring 2020. NLM will continue adding features and improving the user experience, ensuring that PubMed remains a trusted and accessible source of biomedical literature today and in the future.
For more information about the development of the new PubMed, please see the NLM Technical Bulletin.
NIH’s data sharing policy now allows unrestricted access to genomic summary results for data from NCBI’s Database of Genotypes and Phenotypes (dbGaP). Pooled allele frequency data from dbSNP and the dbGaP summary results are available as the new Allele Frequency Aggregator (ALFA) dataset. The ALFA dataset includes aggregated and harmonized array chip genotyping, exome, and genome sequencing data. The ALFA data are open access and freely available for you to incorporate into your workflows and applications from the dbSNP web pages (Figure 1), through FTP,and the Variation Services API. dbGaP currently has data for more than 2 million study subjects, approximately 1 million of whom have genotype data that is suitable for input into the ALFA dataset. The first release of ALFA contains data on about 100,000 subjects, and we hope to complete processing of data on the other 925,000 subjects within the next year. This volume and variety of data promises unprecedented opportunities to identify genetic factors that influence health and disease. Register to attend our April 22 webinar and read on to learn more.
Figure 1. ALFA allele frequencies for a variant (rs4988235) in the promotor of the lactase gene showing frequency differences across populations.
Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).
CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!
On Wednesday, April 8, 2019 at 12 PM, NCBI staff will show you how to leverage the cloud to speed up your research and discovery. You’ll be introduced to new and existing tools and data including BigQuery, SRA Toolkit, and more. You’ll hear about real workflows in the cloud featuring an example of the work NCBI was able to accomplish in the cloud using SRA data and a case study from an SRA cloud customer
By the end of this webinar, you will know where to look for new cloud products from NCBI, access help information to get you started, and will see how to run your analyses efficiently in the cloud.
Date and time: Wed, Apr 8, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.
The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).
The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.
85% of models were assigned a product name that can be transferred to proteins hit by the model.
7702 models have gene symbols.
14508 are supported by a least one publication.
6266 are assigned an Enzyme Commission number.
617 represent anti-microbial resistance proteins.
Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.
A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.
RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
NLM staff will participate in the next American Chemical Society webinar for the chemical information and cheminformatics community: An Overview of NLM’s Post-TOXNET Resources. TOXNET (the TOXicology Data NETwork) was retired in December 2019 as part of the reorganization associated with the NLM Strategic Plan. Most of TOXNET’s databases have been incorporated into other NLM resources such as PubChem and Bookshelf, or continue to be available elsewhere. This webinar will show you where to go now for TOXNET information.