Tag: RNA-Seq

View GEO, SRA, or dbGaP data tracks in NCBI’s Genome Data Viewer

Did you know that you can see epigenomic or other experimental data in NCBI’s Genome Data Viewer (GDV)?

You can easily add aligned study results from GEO, SRA, and dbGaP as data tracks to GDV browser view. Just go to the Tracks button on the toolbar and select the menu option to Configure Tracks. Navigate to the ‘Find Tracks’ tab on the pop-up Configure panel (Figure 1).

screenshot of genome data browser, showing 'Tracks' menu and 'Find Tracks' tab
Figure 1. Go to the ‘Tracks’ menu on the browser toolbar and select ‘Configure Tracks’ option. This will launch a panel where you can add, configure, remove, and search for data tracks. Go to the ‘Find Tracks’ tab to search for tracks to add to your browser view. Note: spaces act as AND operators in the search, and wildcards are accepted.

Continue reading “View GEO, SRA, or dbGaP data tracks in NCBI’s Genome Data Viewer”

The Sequence Read Archive slims down your data with SRA Lite

In response to your requests for compact and faster-to-deliver data, NIH’s Sequence Read Archive (SRA) now offers a new data format – SRA Lite (Figure 1).  SRA Lite supports reliable and faster data transfer, downloads, and analysis using current tools. SRA Lite replaces the submitted base quality score (BQS) with a simplified read quality score, reducing the average read size by ~60% for more efficient analysis and storage of large datasets. This format was designed to reflect improvements in next-generation sequencing that include increases in average read length and sequence coverage. Indeed, the data has improved enough that that removing some quality scores increase genotype accuracy (PMCID: PMC4439189).

Figure 1. FASTQ dumped from SRA Lite format and the SRA configuration dialog. The FASTQ has the quality score for each base set to 30 (‘?’ in the ASCII encoding).  Select “Prefer SRA Lite files with simplified base Quality scores” in the SRA configuration dialog to use SRA Lite. Continue reading “The Sequence Read Archive slims down your data with SRA Lite”

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Join us on August 18, 2021 at 12PM eastern time for the second webinar on finding data for your non-model research organism. In this webinar, you will learn how to use NCBI’s web resources to get data for a plant species, the black cottonwood. You will see how to find, access, and analyze gene and sequence data from Datasets and other NCBI web resources, as well as sample metadata and gene expression RNA-Seq data from SRA and the SRA Run Selector. You will also see an example that highlights how to use and analyze these data in a typical workflow set up in a Jupyter notebook that uses the NCBI next-gen aligner Magic-BLAST to get relative gene expression levels across samples.

  • Date and time: Wed, August 18, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Magic-BLAST version 1.6.0 is here!

Magic-BLAST version 1.6.0 is here!

We’ve just released  a new version (1.6.0) of Magic-BLAST, the BLAST-based next-gen alignment tool, with these improvements:

  • Usage reporting — you can help improve Magic-BLAST by sharing limited information about your search. The BLAST User Manual has details on the information collected, how it is used, and how to opt-out.
  • Magic BLAST can access NCBI SRA next-gen reads from the cloud when  you use the -sra or -sra_batch options.  See the Magic-BLAST cookbook for more details.
  • NCBI taxonomy IDs are reported in SAM output if they are present in the target BLAST database.
  • You can get unaligned reads reported separately from the aligned ones by using the -out_unaligned <file name> option.  You can also select the format ( SAM, tabular, or FASTA) with the -unaligned_fmt option. The default format is the same as one for the main report .

The version 1.6.0 executables are available from the NCBI FTP site.  See the release notes , the NCBI GitHub site , and the Magic-BLAST publication for more information.

View intron feature evidence in the Genome Data Viewer and Sequence Viewer

Are you a researcher who works on gene biology and are interested in alternative splice patterns in your gene or genes of interest?  If so, be sure to explore the intron feature evidence available in graphics views of genome assemblies annotated by NCBI. You can view the NCBI evidence used for calling splice variant for genes, add other intron feature evidence tracks, and use new display and filter options that make it easier to interpret the data .

Figure 1. Graphical view of the monoamine oxidase gene (MAOA, MOAB) region on the human X  chromosome showing intron features tracks (‘RNA-seq intron features, aggregate’ and ‘Intropolis RNA-Seq intron features’). Mousing-over an intron feature activates a tooltip that shows details such as the number of reads with the splice site, the location on the chromosome, the length of the intron and the donor and acceptor bases at the splice site. The Intropolis track was added through the search feature of the Configure Tracks menu and configured (bottom menu) so that the features were sorted by strand and filtered so that only features with greater than 500 reads appear.

Continue reading “View intron feature evidence in the Genome Data Viewer and Sequence Viewer”

We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage

RFI_SRA_largeNIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.

Sra_growthFigure 1. SRA data has grown exponentially over the last decade.

NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.

It is critical that as an SRA user, you  participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.

Continue reading “We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage”

May 20 webinar: Exploring SRA metadata in the cloud with BigQuery

May 20 webinar: Exploring SRA metadata in the cloud with BigQuery

Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.

  • Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

April 8 Webinar: Accelerate genomics discovery with SRA in the cloud

April 8 Webinar: Accelerate genomics discovery with SRA in the cloud

On Wednesday, April 8, 2019 at 12 PM, NCBI staff will show you how to leverage the cloud to speed up your research and discovery. You’ll be introduced to new and existing tools and data including BigQuery, SRA Toolkit, and more. You’ll hear about real workflows in the cloud featuring an example of the work NCBI was able to accomplish in the cloud using SRA data and a case study from an SRA cloud customer

By the end of this webinar, you will know where to look for new cloud products from NCBI, access help information to get you started, and will see how to run your analyses efficiently in the cloud.

  • Date and time: Wed, Apr 8, 2020 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!

The National Library of Medicine (NLM) is pleased to announce that all controlled-access and publicly available data in SRA is now available through Google Cloud Platform (GCP) and Amazon Web Services (AWS). To access the data please visit our SRA in the Cloud webpage where you will find links to our new SRA Toolkit and other access methods.

The SRA data available in the two clouds currently totals more than 14 petabytes and consists of all data in the SRA format as well as some data in its original submission format.  Since May 2019, NCBI has been putting all submitted SRA data on the GCP and AWS clouds in both the submitted format and our converted SRA format. We have also been moving previously submitted original format data to the clouds and expect to complete that process in 2021. Continue reading “The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!”

Computational Medicine Codeathon and AWS workshop at Chapel Hill in March

Computational Medicine Codeathon and AWS workshop at Chapel Hill in March

NIH is pleased to announce a computational medicine-focused codeathon. To apply, please complete the application form by February 25, 2020. We will also be offering a free workshop, AWS Technical Essentials, the day before the codeathon. Read on for more information about the event. Continue reading “Computational Medicine Codeathon and AWS workshop at Chapel Hill in March”