We’ve been making improvements to the contents of NCBI’s genomes FTP site. Highlights include:
- addition of new file types, including a feature_count.txt file with counts of gene, RNA, and CDS features of specific types and a translated_cds.faa file with conceptual translations of each CDS feature on the genome
- improvements to the Sequence Ontology feature types used in GFF3, including identification of pseudogene gene features as “pseudogene” instead of “gene” in column 3
- improvements to the gene_biotype calculation to categorize transcribed pseudogenes as transcribed_pseudogene instead of misc_RNA
- addition of the #!annotation-source unofficial pragma to GFF3 files with the annotation name, for assemblies where that information is available
- addition of an FTP directory for GenBank viral genomes that includes International Committee on Taxonomy of Viruses (ICTV) species exemplar virus genomes and a growing number of NCBI viral neighbor genomes
- expanded the UCSC sequence name mapping provided in the assembly report files to provide mappings between GenBank or RefSeq sequence accessions, chromosome or scaffold names, and the UCSC sequence name for most of the recent assemblies in the UCSC Genome Browser
The FTP files for all latest assemblies reported in the /genomes/genbank/, /genomes/refseq/ and /genomes/all/ FTP directories have been generated with these improvements.
The contents provided for eukaryotes annotated by NCBI under ftp://ftp.ncbi.nih.gov/genomes/<genus_species> will be incorporated into the /genomes/refseq/ path. Stay tuned for more announcements on this change.
Subscribe to the genomes-announce mail list to be informed of other changes to the NCBI genomes FTP site.