Now that the Sequence Read Archive (SRA) is publicly available on the cloud, you can harness the power of high-performance cloud computing to analyze all the data you wish without having to download a single byte. To help you programmatically find datasets of interest to you, we’ve loaded BigQuery with the SRA Metadata Table, which contains the descriptive information supplied at the time of sequence submission. Searches of the SRA Metadata Table are dependent on the quality and consistency of the metadata as submitted which means it can sometimes be a challenge to identify a complete and relevant set of suitable sequences. However, the Taxonomy Analysis Table can be a useful tool to overcome this challenge. Here’s why.
NCBI indexes SRA runs with one or more taxonomy terms when species-specific sequence k-mers are matched in the submitted sequences. The Taxonomy Analysis Table (tax_analysis) thus becomes a catalog of all taxonomic IDs detected in every run, based on the specificity and accuracy characteristics of these unique hashes sampled from reference genomes. We have now added the Taxonomy Analysis Table to BigQuery so you can filter hundreds of thousands of runs by this calculated taxonomic content to gather target datasets. Use this in conjunction with the BigQuery Taxonomy Table (which connects scientific names to taxonomic IDs) and link back to the BigQuery Metadata Table.
Explore/link to these four new tables in BigQuery:
tax_analysis_info: a summary table for the results of the STAT tool
tax_analysis: use the taxonomy analysis table to locate any number of runs based on kmer hits to a particular organism or branch in a taxonomic tree.
taxonomy: NCBI Taxonomy database where you can locate the taxid based on organism names.
kmer: contains kmers mapped to a particular organism and allows you to continue exploring organismal content further. You can leverage kmer tables in your downstream analysis by building custom kmer libraries.
Figure 1. SRA runs found using the taxonomy tables and BigQuery for taxid:694002, Betacoronavirus.
We are actively working on new tools and ways to help you use the cloud to access and compute on SRA data. We are piloting this new feature in BigQuery, and plan to add this information to Amazon Cloud’s (AWS) Athena soon.
We’ve added a new feature (Max 3′ match), shown in Figure 1, to Primer-BLAST that limits the length of 3′ exon matches when designing exon-exon spanning primers. This makes it less likely that primers specifically designed to amplify transcripts will also amplify genomic DNA contamination in expression assays.
Figure 1. The new “Max 3′ match” option that limits the size of the 3′ match for exon-exon junction primers. This option helps avoid primers that may also produce product from genomic DNA. Continue reading →
Are you interested in comparative genomics or other studies using Drosophila genomics?
Then don’t miss our online poster#568A at TAGC 2020 Online (no meeting registration required). Also, tune in to the online Q&A session on Monday, April 27 at 12:00 – 12:30 pm EDT.
What’s happening? In coordination with FlyBase, we are transitioning almost all of the RefSeq Drosophila assemblies to annotation produced primarily by NCBI’s eukaryotic genome annotation pipeline. We’ll continue to use the FlyBase annotation for Drosophila melanogaster (soon to be updated to Release 6.32), but we’ll annotate the other species using available RNA-seq datasets and our latest software. This will allow us to provide consistent, high-quality annotations across the full spectrum of Drosophila species, and also rapidly provide annotations as new high-quality assemblies become available. Another benefit is that these annotations will be available in the full suite of NCBI resources, including nucleotide, protein, BLAST, Gene, Genome Data Viewer, Genomes, Assembly, and more. You can download these annotation data from the NCBI genomes FTP site or you can try the new NCBI Datasets tool. By special request, we’re making orthology data relative to D. melanogaster available on the Gene FTP site, and plan to expose that data in our public pages in the future.
We have updated the collection of representative and reference assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We chose the 11,478 representative assemblies in the new collection from the 180,000+ prokaryotic assemblies in RefSeq today. We have selected one representative or reference assembly for every species based on several criteria including contiguity, completeness and whether the assembly is from type material. We have also updated the reference and representative microbial Blast database to reflect these changes. This reference and representative set will be updated three times a year to reflect changes in RefSeq. In addition, as we announced on Feb 14, we have reduced the number of reference genome assemblies — the subset of representative assemblies with annotation provided by outside experts — to 15. See the list in our previous post . We have re-annotated the 104 assemblies that are no longer reference with or Prokaryotic Genome Annotations Pipel (PGAP).
Figure 1. The SARS-CoV-2 submission landing page, where you can submit to GenBank or SRA. You can also view other resources related to SARS-CoV-2.
Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. Our new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!
On Wednesday, April 22, 2020 at 12 PM, join NCBI staff to learn how results from the Allele Frequency Aggregator (ALFA) project will help you interpret the biological impact of common and rare sequence variants. ALFA’s initial release includes analysis of genotype data from ~100K unrestricted dbGaP subjects and provides high-quality allele frequency data now displayed on relevant dbSNP records. In this webinar, you will learn about the data in the recent ALFA release, see how to access the data from the web, FTP, and how to programmatically retrieve data by positions, genes, and other attributes using E-utilities and Variation Services API in Python.
Date and time: Wed, Apr 22, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
NCBI’s Reference Sequence (RefSeq) FTP release numbers will increment to 200 for the next release and skip over the numbers 100-199. The current, March 2020 release, is release 99. The next bi-monthly release in May 2020 will be release 200. This change is to avoid overlapping with the release numbers of the completely independent RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108. Continue reading →
We recently announced that we made all of the Sequence Read Archive (SRA) publicly available on two cloud platforms. This archive of genetic sequences is a treasure trove of information and the cloud environments provide high-performance computing capabilities via a GCP or AWS account – right from your own device. High-throughput sequencing has made generating data extremely fast and inexpensive, which has fueled the rapid growth of SRA. Putting it on the cloud makes it possible to analyze “the high-throughput, unassembled sequence data, across all such sequences”.
So, what are some of the potential discoveries that await? To investigate some of the possibilities, we have held a series of codeathons to see if known and unknown viruses could be found lurking within SRA cloud datasets. Spoiler alert – they are! And just recently, a team from Stanford reported that they were able to identify a 2019-nCoV-like Coronavirus in pangolins by examining data sets identified via a meta-metagenomic search of SRA and downloaded using the SRA Toolkit. One challenge this team faced was downloading the datasets: 2.5TB corresponding to approximately 1013 bases took over 48 hours to gather. How might cloud-based SRA tools have made this task easier/faster? Here’s how:
BigQuery: allows native cloud programmatic access to and search based on SRA metadata in the cloud. SRA Toolkit enables retrieval and reading of sequencing files from the SRA datasets in the cloud and writing files into the same format, respectively.
Coming soon to the cloud are tools for large scale BLAST processing for a Read Alignment and Annotation Pipeline Tool (RAPT). These tools allow the data to be analyzed directly in the cloud, eliminating the need for download to local storage for analysis.
Also in the works is a mechanism to provide better access to taxonomic content of SRA runs as calculated by NCBI tools.
We are continually adding new functionality to better support your cloud workflows and are happy to help! Contact us at email@example.com if you have questions or need help getting started. If you need assistance setting up GCP or AWS, please follow the steps in our how-to videos on YouTube.