You can now retrieve genome data using the NCBI Datasets command-line tool and API by simply providing a BioProject accession. You can go directly from a BioProject accession to genome data even when the BioProject accession is the parent of multiple BioProjects (Figure 1).
Figure 1. Command-lines using BioProject accessions with the datasets command-line tool and sample metadata. Top panel: command-line for downloading genome metadata for the Sanger 25 Genomes Project (PRJEB33226). Middle panel: a portion of the metadata in JSON format for the 25 Genomes Project. Bottom panel: command-line for downloading sequence data and annotation metadata for a component BioProject for the king scallop (PRJEB35331).
For example, you can use the Sanger 25 Genomes Project BioProject accession, PRJEB33226, to get genome metadata for all assemblies in the project, including NCBI annotation information, using the following datasets command-line:
datasets summary genome accession PRJEB33226 > sanger-25-metadata.json
The Metadata in JSON format (Figure 1, Middle panel) includes genome assembly statistics such as genome size, scaffold N50 and contig N50, submission date, chromosome information, taxonomic information, and annotation metadata.
Or you can use a BioProject accession to download genome data, including transcript and protein sequences, annotation, and metadata. The following command retrieves data for both the GenBank and RefSeq genome assemblies for the king scallop, Pecten maximus (BioProject: PRJEB35331), one of the organisms included in the 25 Genomes Project.
datasets download genome accession PRJEB35331 --filename kingscallop.zip
The data downloads as a zip archive that contains genome, transcript, and protein sequences in FASTA format, and genome annotation data in gff3 format. Metadata reports, in JSON lines format, include a sequence report for each genome assembly (GCA_902652895.1, GCA_902652985.1, and GCF_902652985.1) listing the component genome sequences, and a combined assembly data report with detailed metadata on the three genomes included in your download.
Please try out Datasets command-line access by BioProject access let us know what you think!