Check out the latest videos on YouTube to learn how to best use NCBI graphical viewers, SRA, PGAP, and other resources.
Genome Data Viewer: Analyzing Remote BAM Alignment Files and Other Tips
This video shows you how to upload remote BAM files, and succinctly demonstrates handy viewer settings, such as Pileup display options, and highlights the very helpful tooltips in the Genome Data Viewer (GDV). There’s also a brief blog post on the same topic.
The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.
Figure 1. Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i. The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.
We’re constantly making improvements to the NCBI genome Assembly resource. This post points out some recent advances, highlighted in Figure 1 and described in more detail below.Figure 1. New improvements to the Assembly web pages. The results page showing the surveillance project filter (lower left), which excludes 28,220 Klebsiella pneumoniae assemblies from the Pathogen Detection Project, and the Download Assemblies button with a link to the File type description (circled in red, upper right). For other improvements in the Download Assemblies menu see our recent post.
If you’re interested in visualizing and analyzing genomic data, then you’ll want to check out a new way to run Genome Workbench: in the cloud! Genome Workbench is a desktop application (both Windows and Mac) that lets you analyze genomic data in one place. You can run tools such as BLAST and create views such as multiple sequence alignments, and much more. You can run Genome Workbench on a cloud environment from your local desktop computer. This manual will show you how.
There are many advantages to using Genome Workbench in the cloud:
You can easily compare your data to the complete GenBank and RefSeq datasets without needing to download them
You can run BLAST searches against standard databases or any custom databases you’ve assembled in the cloud
All of the data (e.g. FASTA, BAM, GFF files) remain in the cloud with no need for local copies
You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times — files characterizing the feature-by-feature differences between the current and the previous annotation.
On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank. Attend the webinar to learn more!
Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.
Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.
Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.
The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:
incorrect organism assignment
metagenome submitted as an organism genome
targeted sub-genome assembly not flagged as partial genome representation
gross contamination with other sequences
You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!
We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.
Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.
See our previous post and our documentation for details on how to obtain and run PGAP yourself.
Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!
We have a new and improved search experience for viral genes from select human pathogens. When you search for a virus such as HIV-1 (more examples below), you now get an interactive graphical representation of the viral genome where you can see all the annotated viral proteins in context. Clicking on the gene / protein objects allows you to access sequences, publications, and analysis tools for the selected protein. This new feature is designed to help you quickly find information relevant to your research on clinically important viruses.Figure 1. Top: The virus genome graphic result for a search with HIV-1 with access to analysis tools, downloads, and relevant results in the Genome and Virus resources. Bottom: The result obtained by clicking the env gene graphic, which provides links to protein and nucleotide sequences, the literature, analysis tools, and downloads.
Try it out using the following example searches and let us know what you think!