Join us on June 30, 2021 at 12PM eastern time to learn how to use the new NCBI Datasets resource to find and download gene, genome and SARS-CoV-2 sequence and annotation. You will learn how to access these datasets through either the web interface or the new command-line tools that allow you to incorporate these data in your bioinformatic workflows.
Date and time: Wed, June 30, 2021 12:00 PM – 12:45 PM EDT
NCBI Datasets, the new set of services for downloading genome assembly and annotation data (previous Datasets posts), has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.
You can now get gene ortholog data using the NCBI Datasetscommand-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool? Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).
Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.
NCBI Datasets now offers Gene tables: customizable tables of the genes you specify, with key gene information, and the ability to easily download a dataset of genomic, transcript and protein sequences.
Drag and drop a list of Gene IDs or gene symbols, and the data table shows your genes with up to 15 columns of metadata, including genomic coordinates, RefSeq transcript and protein accessions, Ensembl IDs and UniProt accessions, and other gene information. You can browse and select items in your table on the web, or download everything to your computer for later analysis (Figure 1).
Do you need to download a lot of genomic data? Maybe you need all primate reference genomes or maybe you need just a few really big genomes? Prior to the advent of NCBI Datasets, downloading such a large amount of data could be a frustrating and time consuming experience involving failed downloads and writing custom scripts.
NCBI Datasets makes large genome downloads simpler, faster, and more reliable. You don’t have to write a script. You can be sure you get all the data requested. And sharing the data is easier than ever. Figure 1 shows an example data download process using Datasets.
Figure 1. Downloading and processing genomic data using NCBI Datasets. The example shows downloading the set of RefSeq primate assemblies through the Datasets web interface. Since the downloaded files would exceed 15GB, the file comes as a “dehydrated bag” — a small, easily downloaded, zipped file with metadata and links to download the data. You can “rehydrate” the unzipped dehydrated files — fill them with the corresponding data — using the datasets command-line tool.
We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.Figure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.
These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).
You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times — files characterizing the feature-by-feature differences between the current and the previous annotation.
If you download data from the SRA (Sequence Read Archive) FTP site, we would encourage you to try the SRA Toolkit. This is particularly true if you use the SRA Fuse/FTP site at ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant, which the SRA team will decommission on December 1, 2019.
The SRA Toolkit offers several advantages for downloading SRA data, including greater flexibility in specifying the data you need as well as access to public SRA data in the cloud. If you’re new to the Toolkit, you may want to start with these instructions.
If you have any questions or concerns about downloading SRA data, please contact firstname.lastname@example.org. We’d love to hear from you!