NCBI introduces Datasets, a new resource that lets you easily gather data from across NCBI databases. Our first release allows you to find and download genomic sequence and annotation data for all eukaryotic organisms through our user-friendly web interface.
Our web interface also provides an interactive taxonomy tree that lets you browse for your favorite organism. We are currently testing the web interface in the NCBI labs environment. To try it out, enter a taxonomic name or assembly accession and click on the ‘Get Data’ button in the search results panel.
We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.
We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.
We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.Figure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.
These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).
We’re constantly making improvements to the NCBI genome Assembly resource. This post points out some recent advances, highlighted in Figure 1 and described in more detail below.Figure 1. New improvements to the Assembly web pages. The results page showing the surveillance project filter (lower left), which excludes 28,220 Klebsiella pneumoniae assemblies from the Pathogen Detection Project, and the Download Assemblies button with a link to the File type description (circled in red, upper right). For other improvements in the Download Assemblies menu see our recent post.
In late May, we introduced a new type of search experience in NCBI Labs that uses natural language queries to make common tasks easier. The experience at NCBI Labs – where we experiment with potential new features and tools – proved successful. We’re pleased to announce that we added this simplified search capability to NCBI’s global search page. Some natural language queries now work in the “All Databases” search from the NCBI home page!
As of March 2018, there were 141,000 prokaryotic genomes in the Assembly database. As this database grows, misassigned prokaryotic genomes becomes a serious problem. Taxonomy misassignment can occur through simple submission error or can accumulate as new information adds greater specification to the taxonomic tree.
A paper in the International Journal of Systematic and Evolutionary Microbiology presents the method NCBI scientists used to verify taxonomic identities in prokaryotic genomes. The authors used an Average Nucleotide Identity method with optimum threshold ranges for prokaryotic taxa to review all prokaryotic genome assemblies in GenBank. This method relies on Type strain information and is one outcome of a 2015 workshop involving several important parties in the bacteriology community.
We know it’s not always easy to find the sequence data you’re after at NCBI. Maybe it’s because you’re no expert at constructing queries, and you end up with no results or too many results. Or maybe you’re an Entrez wizard, but creating a query full of Booleans and filters seems like overkill when you could just write a short natural language query, like you’re used to doing in Google. The next time you search for a gene, transcript or genome assembly for a given organism, try the new search experience we’re piloting in NCBI Labs.
In NCBI Labs, you can now search for sequences using natural language and get the best results.
Figure 1. The new interface for specified transcript search.
The improved search experience now available in NCBI Labs addresses 3 types of queries that commonly fail in searches at NCBI: organism-gene (e.g. human BRCA1), organism-transcript (e.g. Mouse p53 transcripts) and organism-assembly (e.g. dog reference genome). For each of these query types in NCBI Labs, we now return NCBI’s highest quality sequence sets or reference and representative assemblies in an easy-to-view panel.
Example queries are shown below to get you started.
On Wednesday, November 1, 2017, we will present a webinar on GDV, NCBI’s full-featured genome browser. In this webinar, you’ll learn how to explore and analyze sequences and annotations for eukaryotic RefSeq genome assemblies. We’ll show you how to:
Search across the entire assembly for genes, products and other markers or jump to a specific position or range
Display any of seven preselected track sets highlighting various aspects of the assembly or create and load your own custom track sets from your NCBI account.
Load and display submitted alignment data from NCBI’s GEO or SRA.
Upload your own annotation and variant data
Display BLAST or Primer-BLAST results on the assembly in the browser.
Date and time: Wednesday, November 1, 2017 12:00-12:30PM EDT
A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.
For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files. Continue reading →