We’re constantly making improvements to the NCBI genome Assembly resource. This post points out some recent advances, highlighted in Figure 1 and described in more detail below.Figure 1. New improvements to the Assembly web pages. The results page showing the surveillance project filter (lower left), which excludes 28,220 Klebsiella pneumoniae assemblies from the Pathogen Detection Project, and the Download Assemblies button with a link to the File type description (circled in red, upper right). For other improvements in the Download Assemblies menu see our recent post.
You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times — files characterizing the feature-by-feature differences between the current and the previous annotation.
We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.
Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.
Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!
There’s a new RefSeq annotation available for the human genome, and it’s quite an update!
About the release
Annotation release 109.20190607 is the first release of our new bimonthly annotation schedule as announced in a previous post. The annotated sequences are the latest sequences for the GRCh38, patch 13 assembly, GRCh38.p13 (GCF_000001405.39). The chromosome backbone sequences remain the same, but we’ve added 45 patch sequences representing novel and improved sequences that the Genome Reference Consortium will incorporate into the primary assembly in the future. The new annotation places the latest curated RefSeq transcripts and functional elements on the genome but keeps the same model dataset as in annotation release 109 except when the models have been replaced by curated RefSeqs or other review. We are also flagging MANE and other RefSeq Select transcripts. Continue reading for more details on these improvements below. You can download the updated annotation here!
We have a new and improved search experience for viral genes from select human pathogens. When you search for a virus such as HIV-1 (more examples below), you now get an interactive graphical representation of the viral genome where you can see all the annotated viral proteins in context. Clicking on the gene / protein objects allows you to access sequences, publications, and analysis tools for the selected protein. This new feature is designed to help you quickly find information relevant to your research on clinically important viruses.Top: The virus genome graphic result for a search with HIV-1 with access to analysis tools, downloads, and relevant results in the Genome and Virus resources. Bottom: The result obtained by clicking the env gene graphic, which provides links to protein and nucleotide sequences, the literature, analysis tools, and downloads.Figure 1.
Try it out using the following example searches and let us know what you think!
If you are a consumer or producer of AGP (A Golden Path) files for genome assemblies, please read on. We’d like your feedback on the proposed changes described here.
As you know, AGP files are used to describe the structure of certain genome assemblies. The AGP file format has not kept up with changes in sequencing technology or International Sequence Database Collaboration (INSDC) feature usage. NCBI is therefore proposing to extend the current AGP v2.0 specification to add new linkage evidence types and a gap type of “contamination” as detailed below and described in the AGP v2.1 proposed specification.
We’ve been making improvements to the NCBI genome Assembly resource. Highlights include:
- Links added between members of a pair of genome assemblies derived from the same diploid individual
- Additional filters now shown on the left-hand side bar
- Annotation status
- Assembly type, including the new types “Unresolved diploid” and “Alternate pseudohaplotype”
- vhost filters on the Advanced page Search Builder that allow selection of virus assemblies with a particular host (e.g. “vhost human”)
- Searching by assembly names with the version unspecified
- Total ungapped length reported in the “Global statistics” table, replacing the less useful total gap length
- Improved N50 & L50 statistics presentation for complex genome assemblies
NCBI announces Annotation Release 100 of the Pacific white shrimp (Penaeus vannamei) genome in RefSeq, based on the assembly (GCF_003789085.1) submitted by the Institute of Oceanology, Chinese Academy of Sciences. The Pacific white shrimp is one of the most important shrimp species in fisheries and aquaculture and represents the first decapod to have its genome annotated by NCBI. We predicted 24,987 protein coding genes with evidence from alignment of six billion RNA-Seq reads and homology with invertebrate proteins. This annotation will enable genomic research in this commercially important species.
Please visit our Eukaryotic RefSeq Genome Annotation Status page to see more annotations in progress.
If you’ve been searching in Gene, Nucleotide, Protein, Genome or Assembly databases, you’ve probably noticed the new search experience we introduced in September to interpret several common language searches and offer improved results. We’re excited to announce we’ve added as-you-type suggestions to the search bar in these databases.
Here’s a peek at the new menu in the NCBI Gene database.