Check out the latest videos on YouTube to learn how to best use NCBI graphical viewers, SRA, PGAP, and other resources.
Genome Data Viewer: Analyzing Remote BAM Alignment Files and Other Tips
This video shows you how to upload remote BAM files, and succinctly demonstrates handy viewer settings, such as Pileup display options, and highlights the very helpful tooltips in the Genome Data Viewer (GDV). There’s also a brief blog post on the same topic.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.
Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.
How does it work? Download PGAP from GitHub, provide some basic information and the FASTA sequences for your genome sequence, and run the pipeline on your own machine, compute farm or the cloud. PGAP will produce annotation consistent with NCBI’s internal PGAP. Submit the resulting annotated genome to GenBank through the genome submission portal, and get an accession back.
As with any other submitted assembly, PGAP-annotated genomes will be screened for foreign contaminants and vector sequences at submission. Any annotated assemblies that don’t pass may need to be modified. We are developing an automated process to handle these edits!
We are also working on other improvements to stand-alone PGAP such as a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned for new developments!
You can now download PGAP from GitHub and run it on your machine, compute farm or the cloud, on any public or privately-owned genome. PGAP predicts genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. This is a great opportunity for you to try it now and send us comments (please use GitHub issues).
NCBI has been asked to take over the ownership and maintenance of the TIGRFAM collection of Hidden Markov Models (HMMs), which is widely used for the annotation of prokaryotic genomes. The TIGRFAMs are a collection of curated protein families started in 1998 at The Institute of Genomic Research (TIGR), precursor to the J. Craig Venter Institute (JCVI). This collection is publicly available under a Creative Commons license and downloadable from NCBI. We have already made hundreds of improvements to TIGRFAM names and descriptions and we will continue to make regular updates.\
We’ve completed the RefSeq reannotation of over 1,000 Streptomyces genomes! The genomes were reannotated using the Prokaryotic Genome Annotation Pipeline (PGAP). PGAP detected nearly 100% of ribosomally synthesized and post-translationally modified peptide natural products (RiPP)-encoding genes from known families, despite their small size, using a set of over 30 hidden Markov Models (HMMs) built by RefSeq biocurators. Over 70% (251) of lasso peptides now present in Streptomyces RefSeq genomes (354) were annotated for the first time.
If you are aware of any class of RiPP precursor in Streptomyces that was not found in our recent re-annotation, please contact us through the NCBI Help Desk, and we will add new HMMs to the rules we use to find and annotate RiPP precursor genes.
Do you ever want to see the flanking genes of a protein match from a BLAST search? On June 20th, we’ll show you how to see the genomic context of bacterial proteins using the identical protein report and the graphical sequence viewer. You will also learn to use these reports in detail and how to get these genomic contexts in batch for a set of protein matches using the identical proteins report and EDirect .
Date and time: Wed, June 20, 2018 12:00 PM – 12:30 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
The 2018 Nucleic Acids Research database issue features several papers from NCBI staff that cover the status and future of databases including CCDS, ClinVar, GenBank and RefSeq. These papers are also available on PubMed. To read an article, click on the PMID number listed below.
RefSeq release 82 is accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of May 8, 2017 and contains 127,098,289 records, including 84,756,971 proteins, 18,901,573 RNAs, and sequences from 69,035 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.