NCBI staff will be presenting talks and a poster on accessing SARS-CoV-2 at NCBI and in the Cloud at the American Society of Virology 2021 virtual conference, July 19-23, 2021.
Tag: Sequence Read Archive (SRA)
We’re bringing exciting developments to our user community at the 2021 Galaxy Community Conference (GCC 2021), which is virtual this year!
We start with hosting NCBI’s first ever GCC training week tutorial co-written by Jon Trow, Ph.D. – Sequence Read Archive (SRA): Subject Matter Expert and Adelaide Rhodes, Ph.D. – Cloud: Subject Matter Expert. This tutorial will become a permanent addition to the Galaxy Training Network. The tutorial, “SRA Aligned Read Format (SARF) to Speed Up SARS-CoV-2 Data Analysis”, has detailed instructions and a video demonstration on how to search SRA metadata for SARFs and download them into Galaxy workflows. We will be available via Slack during Office Hours for live virtual interactions.
The NIH NCBI Sequence Read Archive (SRA) on AWS, containing all public SRA data, is now live! This data is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
We’ve just released a new version (1.6.0) of Magic-BLAST, the BLAST-based next-gen alignment tool, with these improvements:
- Usage reporting — you can help improve Magic-BLAST by sharing limited information about your search. The BLAST User Manual has details on the information collected, how it is used, and how to opt-out.
- Magic BLAST can access NCBI SRA next-gen reads from the cloud when you use the
-sra_batchoptions. See the Magic-BLAST cookbook for more details.
- NCBI taxonomy IDs are reported in SAM output if they are present in the target BLAST database.
- You can get unaligned reads reported separately from the aligned ones by using the
-out_unaligned <file name>option. You can also select the format ( SAM, tabular, or FASTA) with the
-unaligned_fmtoption. The default format is the same as one for the main report .
Join us on May 19, 2021 at 12PM eastern time to learn how to use the new RAPT pilot service to assemble and annotate public or private Illumina genomic reads sequenced from bacterial or archaeal isolates at the click of a button. RAPT consists of two major components, the genome assembler SKESA and the Prokaryotic Genome Annotation Pipeline (PGAP), and produces an annotated genome of quality comparable to RefSeq in a couple of hours.
- Date and time: Wed, May 19, 2021 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.
National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and Amazon Web Services (AWS) are happy to announce that the controlled- and public-access Sequence Read Archive (SRA)–one of the world’s largest repositories of raw next generation sequencing data–will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP) as of January 2021. The SRA is currently hosted by NLM at the National Institutes of Health (NIH).
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.
- Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
Now that the Sequence Read Archive (SRA) is publicly available on the cloud, you can harness the power of high-performance cloud computing to analyze all the data you wish without having to download a single byte. To help you programmatically find datasets of interest to you, we’ve loaded BigQuery with the SRA Metadata Table, which contains the descriptive information supplied at the time of sequence submission. Searches of the SRA Metadata Table are dependent on the quality and consistency of the metadata as submitted which means it can sometimes be a challenge to identify a complete and relevant set of suitable sequences. However, the Taxonomy Analysis Table can be a useful tool to overcome this challenge. Here’s why.
NCBI indexes SRA runs with one or more taxonomy terms when species-specific sequence k-mers are matched in the submitted sequences. The Taxonomy Analysis Table (tax_analysis) thus becomes a catalog of all taxonomic IDs detected in every run, based on the specificity and accuracy characteristics of these unique hashes sampled from reference genomes. We have now added the Taxonomy Analysis Table to BigQuery so you can filter hundreds of thousands of runs by this calculated taxonomic content to gather target datasets. Use this in conjunction with the BigQuery Taxonomy Table (which connects scientific names to taxonomic IDs) and link back to the BigQuery Metadata Table.
Explore/link to these four new tables in BigQuery:
- tax_analysis_info: a summary table for the results of the STAT tool
- tax_analysis: use the taxonomy analysis table to locate any number of runs based on kmer hits to a particular organism or branch in a taxonomic tree.
- taxonomy: NCBI Taxonomy database where you can locate the taxid based on organism names.
- kmer: contains kmers mapped to a particular organism and allows you to continue exploring organismal content further. You can leverage kmer tables in your downstream analysis by building custom kmer libraries.
Figure 1. SRA runs found using the taxonomy tables and BigQuery for taxid:694002, Betacoronavirus.
Check out our helpful summary information for additional information on taxonomic analysis.
We are actively working on new tools and ways to help you use the cloud to access and compute on SRA data. We are piloting this new feature in BigQuery, and plan to add this information to Amazon Cloud’s (AWS) Athena soon.
Contact us at email@example.com to let us know what you think!
If you need help getting started, refer to our tutorials and how-to video playlist on YouTube!