Tag: Sequence Read Archive (SRA)

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) has released a new resource, called the SARS-CoV-2 Variants Overview, that aggregates data related to SARS-CoV-2 variants from sequences available in NCBI’s GenBank and Sequence Read Archive (SRA) databases.

SARS-CoV-2 Variants Overview, a freely available online dashboard, was developed with guidance from the TRACE Working Group as part of NLM’s participation in the National Institutes of Health (NIH) Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.

One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies. Continue reading “Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 “

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0  includes several important new features including SRA BLAST programs, ARM Linux executables, and the ability to produce database metadata as well as some important improvements, and a few bug fixes.  You can download the new BLAST release from the FTP site.

New features

SRA / WGS BLAST (blastn_vdb, tblastn_vdb)

Beginning with this release, the BLAST distribution now includes the SRA BLAST programs  blastn_vdb and tblastn_vdb that can directly search SRA and WGS projects without the need to build a BLAST database. See the BLAST documentation on how to use these programs with WGS projects.

ARM Linux executables

This release also includes executables compiled under ARM Linux for the first time. Please let us know if you find any issues with ARM Linux programs.

Database metadata in JSON format

Starting with BLAST+ 2.13.0, the makeblastdb program generates an additional file with the file extension .njs for nucleotide databases or .pjs  for protein databases. These files contain BLAST database metadata in JSON format. See the BLAST database metadata section in the BLAST User Manual for an example. This file can be easily read by many tools and makes the BLAST database more compliant with FAIR principles.

See the release notes for more details on improvements and bug fixes for the release.

Important reminder about usage reporting

As we announced previously, BLAST can report limited usage information back to NCBI. This information shows us whether BLAST+ is being used by the community, and therefore is worth being maintained and developed.  It also allows us to focus our development efforts on the most used aspects of BLAST+.  Please help us improve BLAST by allowing BLAST to share information about your search. The BLAST privacy statement  provides details on the information collected, how it is used, and how to opt-out of reporting if you don’t want to participate.

NCBI Trace database to be retired in June 2022. Data available in SRA.

NCBI Trace database to be retired in June 2022. Data available in SRA.

The Trace Archive at NCBI will be retired as of June 17, 2022. You may continue to retrieve Trace Archive content by searching the Sequence Read Archive (SRA) using TI number, organism, or center name at the time of retirement.

Continue reading “NCBI Trace database to be retired in June 2022. Data available in SRA.”

Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon Highlights

Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon Highlights

The National Institutes of Health (NIH) Office of Data Science Strategy (ODSS), the National Library of Medicine’s (NLM’s) National Center for Biotechnology and Information (NCBI), and the Department of Energy’s (DOE’s) Office of Biological and Environmental Research (BER) hosted scientists from around the world for a virtual Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon. The codeathon, held September 27-October 1, 2021, attracted experts from national laboratories including the Los Alamos National laboratory, research institutions including the Joint Genome Institute, and students from universities across the world to develop benchmarking approaches to address challenges in conducting large-scale analyses of metagenomic data.

Continue reading “Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon Highlights”

NCBI on YouTube: Customize MSA Viewer, SciENcv, plants and RNA-Seq data, Datasets and PubMed

Missed a few videos on YouTube? Here’s the latest from our channel.

Customize the MSA Viewer to Make Your Analysis Easier

We’re constantly improving the Multiple Sequence Alignment (MSA) Viewer. This video demonstrates several new and popular features, including the ability to change data columns, hide selected rows, analyze polymorphisms, and more.

Continue reading “NCBI on YouTube: Customize MSA Viewer, SciENcv, plants and RNA-Seq data, Datasets and PubMed”

View GEO, SRA, or dbGaP data tracks in NCBI’s Genome Data Viewer

Did you know that you can see epigenomic or other experimental data in NCBI’s Genome Data Viewer (GDV)?

You can easily add aligned study results from GEO, SRA, and dbGaP as data tracks to GDV browser view. Just go to the Tracks button on the toolbar and select the menu option to Configure Tracks. Navigate to the ‘Find Tracks’ tab on the pop-up Configure panel (Figure 1).

screenshot of genome data browser, showing 'Tracks' menu and 'Find Tracks' tab
Figure 1. Go to the ‘Tracks’ menu on the browser toolbar and select ‘Configure Tracks’ option. This will launch a panel where you can add, configure, remove, and search for data tracks. Go to the ‘Find Tracks’ tab to search for tracks to add to your browser view. Note: spaces act as AND operators in the search, and wildcards are accepted.

Continue reading “View GEO, SRA, or dbGaP data tracks in NCBI’s Genome Data Viewer”

The Sequence Read Archive slims down your data with SRA Lite

In response to your requests for compact and faster-to-deliver data, NIH’s Sequence Read Archive (SRA) now offers a new data format – SRA Lite (Figure 1).  SRA Lite supports reliable and faster data transfer, downloads, and analysis using current tools. SRA Lite replaces the submitted base quality score (BQS) with a simplified read quality score, reducing the average read size by ~60% for more efficient analysis and storage of large datasets. This format was designed to reflect improvements in next-generation sequencing that include increases in average read length and sequence coverage. Indeed, the data has improved enough that that removing some quality scores increase genotype accuracy (PMCID: PMC4439189).

Figure 1. FASTQ dumped from SRA Lite format and the SRA configuration dialog. The FASTQ has the quality score for each base set to 30 (‘?’ in the ASCII encoding).  Select “Prefer SRA Lite files with simplified base Quality scores” in the SRA configuration dialog to use SRA Lite. Continue reading “The Sequence Read Archive slims down your data with SRA Lite”

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms:  Amazon Web Services (AWS) and Google Cloud Platform (GCP).  Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.

Continue reading “NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket”

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

NCBI will present a workshop at the American Society for Human Genetics (ASHG) as part of their conference activities in 2021. The workshop is scheduled for Wednesday, September 15, 2021.

Register now!

Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).

Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.

Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.

Adam Stine, Ph.D., SRA Curator
Adelaide Rhodes, Ph.D., Customer Experience

 

Tackling Petabyte Scale Sequence Search Challenges

Tackling Petabyte Scale Sequence Search Challenges

The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. The National Institutes of Health’s (NIH) Sequence Read Archive (SRA), which is maintained by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), is a rapidly growing public database that researchers use to improve scientific discovery across all domains of life. As part of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, over 36 petabytes of “next generation” (raw and SRA-formatted) sequencing data is accessible to anybody via two cloud service providers.

To help address the challenges of conducting large-scale analysis of -omic data in the SRA and similar databases, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI, held a virtual workshop on June 8, 2021, on Emerging Solutions in Petabyte Scale Sequence Search. The workshop brought together experts from DOE national labs, research institutions, and universities across the world.

SRA data growth over time. Databases like the NIH Sequence Read Archive are growing rapidly and are used extensively by scientific communities. As these databases grow, so do their potential scientific value, but work must be done to ensure ease of access. 

Continue reading “Tackling Petabyte Scale Sequence Search Challenges”