Easily download large amounts of genomic data with NCBI Datasets

Do you need to download a lot of genomic data? Maybe you need all primate reference genomes or maybe you need just a few really big genomes? Prior to the advent of NCBI Datasets, downloading such a large amount of data could be a frustrating and time consuming experience involving failed downloads and writing custom scripts.

NCBI Datasets makes large genome downloads simpler, faster, and more reliable. You don’t have to write a script. You can be sure you get all the data requested. And sharing the data is easier than ever.  Figure 1 shows an example data download process using Datasets.Datasets download process

Figure 1. Downloading and processing genomic data using NCBI Datasets. The example shows downloading the set of RefSeq primate assemblies through the Datasets web interface. Since the downloaded files would exceed 15GB, the file comes as a “dehydrated bag” — a small, easily downloaded, zipped file with metadata and links to download the data. You can “rehydrate” the unzipped dehydrated files —  fill them with the corresponding data — using the datasets command-line tool.

Introducing download dehydration / rehydration

If your download exceeds 15 Gigabytes you will download a compact “dehydrated” file — a compressed file (.zip) with genome metadata and links to your sequence and annotation data. You can then use our command-line datasets tool to “rehydrate” your unzipped download to retrieve the data files when it’s convenient for you. You can download a dehydrated file through our website, our command-line tool, our jupyter notebooks, or our API .  For more information on the downloading and rehydrating see our help documentation.

If you need to share data with a colleague, just email them the dehydrated file. When they’re ready to get the data files, they can rehydrate and get the data from NCBI.

Please try it out and  let us know what you think!

Leave a Reply