About NCBI Staff

The National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine, provides access to scientific and biomedical databases, software tools for analyzing molecular data, and performs research in computational biology.

Announcing NCBI Datasets – try it out!

NCBI introduces Datasets, a new resource that lets you easily gather data from across NCBI databases. Our first release allows you to find and download genomic sequence and annotation data for all eukaryotic organisms through our user-friendly web interface.

Our web interface also provides an interactive taxonomy tree that lets you browse for your favorite organism. We are currently testing the web interface in the NCBI labs environment. To try it out, enter a taxonomic name or assembly accession and click on the ‘Get Data’ button in the search results panel.

Here’s what it looks like when you search ‘apes’:

Continue reading

Try the new PubMed on your mobile device

Our new, responsive PubMed site replaces PubMed Mobile. You now have the full PubMed experience on any size screen, including the ability to save and email citations, use the Clipboard, and send citations to My NCBI Collections on your mobile device.

pubmed 2

Figure 1. The new PubMed on mobile.

Also, the new, responsive PubMed will replace the legacy desktop site for PubMed in late spring 2020. NLM will continue adding features and improving the user experience, ensuring that PubMed remains a trusted and accessible source of biomedical literature today and in the future.

For more information about the development of the new PubMed, please see the NLM Technical Bulletin.

 

The ALFA dataset: New aggregated allele frequency from dbGaP and dbSNP now available

NIH’s data sharing policy now allows unrestricted access to genomic summary results for data from NCBI’s Database of Genotypes and Phenotypes (dbGaP).  Pooled allele frequency data from dbSNP and the dbGaP summary results are available as the new Allele Frequency Aggregator (ALFA) dataset. The ALFA dataset includes aggregated and harmonized array chip genotyping, exome, and genome sequencing data. The ALFA data are open access and freely available for you to incorporate into your workflows and applications from the dbSNP web pages (Figure 1), through FTP,and the Variation Services API. dbGaP currently has data for more than 2 million study subjects, approximately 1 million of whom have genotype data that is suitable for input into the ALFA dataset. The first release of ALFA contains data on about 100,000 subjects, and we hope to complete processing of data on the other 925,000 subjects within the next year. This volume and variety of data promises unprecedented opportunities to identify genetic factors that influence health and disease.  Register to attend our April 22 webinar and read on to learn more.

ALFAFigure 1.  ALFA allele frequencies for a variant (rs4988235) in the promotor of the lactase gene showing frequency differences across populations.

Continue reading

CORD-19: A New Machine Readable COVID-19 Literature Dataset

Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).

CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!

April 8 Webinar: Accelerate genomics discovery with SRA in the cloud

On Wednesday, April 8, 2019 at 12 PM, NCBI staff will show you how to leverage the cloud to speed up your research and discovery. You’ll be introduced to new and existing tools and data including BigQuery, SRA Toolkit, and more. You’ll hear about real workflows in the cloud featuring an example of the work NCBI was able to accomplish in the cloud using SRA data and a case study from an SRA cloud customer

By the end of this webinar, you will know where to look for new cloud products from NCBI, access help information to get you started, and will see how to run your analyses efficiently in the cloud.

  • Date and time: Wed, Apr 8, 2020 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Protein family models used by PGAP are now available for download

A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).

The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.

  • 85% of models were assigned a product name that can be transferred to proteins hit by the model.
  • 7702 models have gene symbols.
  • 14508 are supported by a least one publication.
  • 6266 are assigned an Enzyme Commission number.
  • 617 represent anti-microbial resistance proteins.
  • Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.

A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.

Fifteen new NCBI annotations in RefSeq: flies, harbor seal and more

In January and February, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Aythya fuligula (tufted duck)
  • Camelus ferus (Wild Bactrian camel)
  • Corvus moneduloides (New Caledonian crow)
  • Coturnix japonica (Japanese quail)
  • Drosophila ananassae (fly)
  • Drosophila virilis (fly)
  • Etheostoma spectabile (orangethroat darter)
  • Hylobates moloch (silvery gibbon)
  • Mustela erminea (ermine)
  • Nematostella vectensis (starlet sea anemone)
  • Nomia melanderi (Alkali bee)
  • Phoca vitulina (harbor seal)
  • Sapajus apella (Tufted capuchin)
  • Thamnophis elegans (Western terrestrial garter snake)
  • Xiphophorus hellerii (green swordtail)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

RefSeq Release 99 is public

RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: Continue reading

Webinar on current access to TOXNET resources

NLM staff will participate in the next American Chemical Society webinar for the chemical information and cheminformatics community: An Overview of NLM’s Post-TOXNET Resources. TOXNET (the TOXicology Data NETwork) was retired in December 2019 as part of the reorganization associated with the NLM Strategic Plan. Most of TOXNET’s databases have been incorporated into other NLM resources such as PubChem and Bookshelf, or continue to be available elsewhere. This webinar will show you where to go now for TOXNET information.

  • Date and Time: Tuesday, March 17 at 1:00pm EDT.
  • Register 

A live Q&A session will follow the webinar.

GenBank release 236 is available

GenBank release 236.0 (2/20/2020) is now available on the NCBI FTP site. This release has over 7.72 trillion bases and 1.84 billion records.

The release has 216,214,215 traditional records containing 399,376,854,872 base pairs of sequence data. There are also 1,206,720,688 WGS records containing 6,968,991,265,752 base pairs of sequence data, 386,644,871 bulk-oriented TSA records containing 340,994,289,065 base pairs of sequence data, and 34,037,371 bulk-oriented TLS records containing 13,669,678,196 base pairs of sequence data.

During the 70 days between the close dates for GenBank Releases 235.0 and 236.0, the ‘traditional’ portion of GenBank grew by 10,959,596,863 base pairs and by 881,195 sequence records. During that same period, 62,552 records were updated. An average of 13,482 ‘traditional’ records were added and/or updated per day.

Between releases 235.0 and 236.0, the WGS component of GenBank grew by 691,440,065,062 base pairs and by 79,696,818 sequence records. The TSA component of GenBank grew by 15,561,272,936 base pairs and by 19,451,027 sequence records. The TLS component of GenBank grew by 2,389,081,582 base pairs and by 5,810,191 sequence records. The VRT component of GenBank decreased due to the suppression of 40 chromosomal records for the Coregonus sp. ‘balchen’ genome, with 2.1Gbp of sequence data. This organism is already represented by underlying sequence contigs plus chromosomal CON-division/scaffold records built from those contigs. The 40 suppressed records are redundant with those scaffolds, and their suppression resulted in fewer VRT-division files.

The total number of sequence data files increased by 48 with this release. The divisions are as follows:

  • BCT: 17 new files, now a total of 418
  • CON: 4 new files, now a total of 216
  • ENV: 1 new file, now a total of 59
  • MAM: 10 new files, now a total of 49
  • PAT: 2 new files, now a total of 204
  • PLN: 18 new files, now a total of 204
  • VRL: 1 new file, now a total of 36
  • VRT: 5 fewer files, now a total of 161

For downloading purposes, the uncompressed GenBank release 236.0 flat files require roughly 1117 GB, including the sequence files and the *.txt files. 

More information about GenBank release 236.0 is available in the Release Notes, as well as in the README files in the GenBank and ASN.1 (ncbi-asn1) directories on FTP.