New download files and FTP directories for genome assemblies


You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times —  files characterizing the feature-by-feature differences between the current and the previous annotation.

Continue reading

December 11 Webinar: Running the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) on your own data


On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank.  Attend the webinar to learn more!

  • Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

New release of the Prokaryotic Genome Annotation Pipeline with updated tRNAscan and protein models


A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.

Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.

See our previous post and our documentation for details on how to set up and run PGAP yourself.

Try PGAP and let us know how you like it!

GenBank submitters, is your genome assembly within the expected size range?


Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.

The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:

  • incorrect organism assignment
  • metagenome submitted as an organism genome
  • targeted sub-genome assembly not flagged as partial genome representation
  • gross contamination with other sequences

You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!

Try the following examples:

https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=1773&length=4.41M
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=562&length=7221235
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=5476&length=5.72M

For more information, see the Genome Size Check documentation.

New release of the Prokaryotic Genome Annotation Pipeline now available


We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.

Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.

See our previous post and our documentation for details on how to obtain and run PGAP yourself.

Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!

Genome context graphic now in virus search results


We have a new and improved search experience for viral genes from select human pathogens. When you search  for a virus such as HIV-1 (more examples below),  you now get an interactive graphical representation of the viral genome where you can see all the annotated viral proteins in context. Clicking on the gene / protein objects allows you to access sequences, publications, and analysis tools for the selected protein. This new feature is designed to help you quickly find information relevant to your research on clinically important viruses.Virus_searchFigure 1. Top: The virus genome graphic result for a search with HIV-1 with access to analysis tools, downloads, and relevant results in the Genome and Virus resources. Bottom: The result obtained by clicking the env gene graphic, which provides links to protein and nucleotide sequences, the literature, analysis tools, and downloads.

Try it out using the following example searches and  let us know what you think!

Prokaryotic Genome Annotation Pipeline (PGAP) now produces results suitable for submission to GenBank


We are happy to announce that you can now submit your genome sequences annotated by  your own local copy of the standalone Prokaryotic Genome Annotation Pipeline (PGAP) to GenBank.

How does it work? Download PGAP from GitHub, provide some basic information and the FASTA sequences for your genome sequence, and run the pipeline on your own machine, compute farm or the cloud. PGAP will produce annotation consistent with NCBI’s internal PGAP. Submit the resulting annotated genome to GenBank through the genome submission portal, and get an accession back.

As with any other submitted assembly, PGAP-annotated genomes will be screened for foreign contaminants and vector sequences at submission.  Any annotated assemblies that don’t pass may need to be modified. We are developing an automated process to handle these edits!

We are also working on other  improvements to stand-alone PGAP such as a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned for new developments!

 

Proposed changes to AGP files for genome assemblies


If you are a consumer or producer of AGP (A Golden Path) files for genome assemblies, please read on.  We’d like your feedback on the proposed changes described here.

As you know, AGP files are used to describe the structure of certain genome assemblies. The AGP file format has not kept up with changes in sequencing technology or International Sequence Database Collaboration (INSDC) feature usage. NCBI is therefore proposing to extend the current AGP v2.0 specification to add new linkage evidence types and a gap type of “contamination” as detailed below and described in the AGP v2.1 proposed specification.

Continue reading

Important improvements on the genome Assembly pages


We’ve been making improvements to the NCBI genome Assembly resource. Highlights include:

  • Links added between members of a pair of genome assemblies derived from the same diploid individual
  • Additional filters now shown on the left-hand side bar
    • Annotation status
    • Assembly type, including the new types “Unresolved diploid” and “Alternate pseudohaplotype”
  • vhost filters on the Advanced page Search Builder that allow selection of virus assemblies with a particular host (e.g. “vhost human”)
  • Searching by assembly names with the version unspecified
  • Total ungapped length reported in the “Global statistics” table, replacing the less useful total gap length
  • Improved N50 & L50 statistics presentation for complex genome assemblies

Continue reading

First annotation of Pacific white shrimp


NCBI announces Annotation Release 100 of the Pacific white shrimp (Penaeus vannamei) genome in RefSeq, based on the assembly (GCF_003789085.1) submitted by the Institute of Oceanology, Chinese Academy of Sciences. The Pacific white shrimp is one of the most important shrimp species in fisheries and aquaculture and represents the first decapod to have its genome annotated by NCBI.  We predicted 24,987 protein coding genes with evidence from alignment of six billion RNA-Seq reads and homology with invertebrate proteins. This annotation will enable genomic research in this commercially important species.

You can download the annotated assembly or browse and search it in the Genome Data Viewer.

Please visit our Eukaryotic RefSeq Genome Annotation Status page to  see more annotations in progress.