The National Library of Medicine (NLM) is pleased to announce that all controlled-access and publicly available data in SRA is now available through Google Cloud Platform (GCP) and Amazon Web Services (AWS). To access the data please visit our SRA in the Cloud webpage where you will find links to our new SRA Toolkit and other access methods.
The SRA data available in the two clouds currently totals more than 14 petabytes and consists of all data in the SRA format as well as some data in its original submission format. Since May 2019, NCBI has been putting all submitted SRA data on the GCP and AWS clouds in both the submitted format and our converted SRA format. We have also been moving previously submitted original format data to the clouds and expect to complete that process in 2021. Continue reading →
We have a curated set of ribosomal RNA (rRNA) reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)
Next week, NCBI staff will attend AGBT in Marco Island, Florida. On Tuesday, February 25, 2020, three posters from NCBI staff will be on display from 4:40 p.m. – 6:10 p.m. during the Poster Session and Wine Reception in the Banyan and Calusa Ballroom Foyers, Levels 1 and 3. Read on to learn a little bit about what we’ll be presenting.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) with several important features is now available on Github.
In response to several requests we have added the option of running PGAP with Singularity, Podman or any other Docker-compatible executable you wish to use.
We have also lifted the requirement for internet access in case you have privacy concerns. To run the pipeline without internet access, set the flag
Are you unsure about the identity of organism you sequenced? We’ve added the Taxonomy-Check module to help you. This module will confirm the organism name or suggest a new taxonomic assignment through average nucleotide identity comparison with type material assemblies from GenBank. The check is currently an optional validation step prior to PGAP.
Try these new features and let us know what you think! Or submit your PGAP-annotated assembly to GenBank. And remember that if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the --ignore-all-errors flag to get a preliminary annotation.
We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.
We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.
The National Library of Medicine (NLM) is pleased to announce the Data and Technology Advancement (DATA) National Service Scholar Program, a new opportunity for experienced data and computer scientists and engineers to tackle biomedical data challenges in partnership with NIH Office of Data Science Strategy (ODSS).
The one- to two-year positions will be based in NIH offices located in Bethesda. DATA Scholars will lead transformative NIH projects for architecting search across petabyte-Scale genomic Sequence Read Archive (SRA) data:
pioneer sequence search strategies against the entire corpus of NIH’s SRA data to stimulate novel approaches to advance data analysis and accelerate biological discoveries.
develop methods to execute sequence-based searches, including those that involve machine-learning or other AI approaches.
directly communicate technical and project-related information with NIH senior leadership.
collaborate with other DATA Scholars and the NIH data science community across broad disciplinary boundaries.
engage with policymakers, top researchers, and industry partners.
Applicants should possess technical skills in areas such as artificial intelligence, cloud computing, data engineering, data science, database management, project management, software design, supercomputing, and/or bioinformatics. Industry experience is desired. Applicants should have an M.D., Ph.D. or equivalent doctoral degree and have advanced experience in data science or related fields.
Applications are due April 30, 2020. For more information and details on how to apply, visit our full job announcement.
DHHS and NIH are Equal Opportunity Employers. Applications from women, minorities, and persons with disabilities are strongly encouraged.
NIH is pleased to announce a computational medicine-focused codeathon. To apply, please complete the application form by February 25, 2020. We will also be offering a free workshop, AWS Technical Essentials, the day before the codeathon. Read on for more information about the event. Continue reading →
We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.Figure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.
These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).
“Database resources of the National Center for Biotechnology Information”
by Eric W Sayers, Jeff Beck, J Rodney Brister, Evan E Bolton, Kathi Canese et al. (PMID: 31602479)
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. This article provides a brief overview of the NCBI Entrez system of databases, followed by a summary of resources that were either introduced or significantly updated in the past year, including PubMed, PMC, Bookshelf, BLAST databases and more!