Tag: Genome annotation

RefSeq Release 205 is available!

RefSeq release 205 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Continue reading “RefSeq Release 205 is available!” →

New release of the Read Assembly and Annotation Pipeline Tool (RAPT), now 2X faster!

There is a new release of the Read assembly and Annotation Pipeline Tool (RAPT) available from our GitHub site. RAPT is a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates that can run on your local computer or the Google Cloud Platform (GCP). With this new release, jobs will run twice as fast as with the December release. For example, we have assembled and annotated a Salmonella enterica genome in under an hour on a 16-CPU machine with the new release.
We have also added several new features based on your feedback including:

The –stop-on-errors flag that will stop the process if there evidence from the average nucleotide identity check that there is sample mix-up or contamination by other bacteria.
The ability to accept forward and reverse reads of paired-end runs in separate files. These can be compressed (gzip) files.

Finally, thanks to all who came to our webinar in December and provided their comments! For these who couldn’t join us, you can now view the recording on our YouTube channel.

Contact us at prokaryote-tools@ncbi.nlm.nih.gov with any question and to let us know if you would like to become a beta-tester for RAPT.

Announcing the RefSeq annotation of rat mRatBN7.2!

NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!

Continue reading “Announcing the RefSeq annotation of rat mRatBN7.2!” →

View intron feature evidence in the Genome Data Viewer and Sequence Viewer

Are you a researcher who works on gene biology and are interested in alternative splice patterns in your gene or genes of interest? If so, be sure to explore the intron feature evidence available in graphics views of genome assemblies annotated by NCBI. You can view the NCBI evidence used for calling splice variant for genes, add other intron feature evidence tracks, and use new display and filter options that make it easier to interpret the data .

Figure 1. Graphical view of the monoamine oxidase gene (MAOA, MOAB) region on the human X chromosome showing intron features tracks (‘RNA-seq intron features, aggregate’ and ‘Intropolis RNA-Seq intron features’). Mousing-over an intron feature activates a tooltip that shows details such as the number of reads with the splice site, the location on the chromosome, the length of the intron and the donor and acceptor bases at the splice site. The Intropolis track was added through the search feature of the Configure Tracks menu and configured (bottom menu) so that the features were sorted by strand and filtered so that only features with greater than 500 reads appear.

Continue reading “View intron feature evidence in the Genome Data Viewer and Sequence Viewer” →

Human GRCh37 (hg19) RefSeq annotation update

The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.

With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?

Continue reading “Human GRCh37 (hg19) RefSeq annotation update “ →

Recent enhancements in Genome Workbench version 3.4.1

New Features

Version 3.4.1 of Genome Workbench, NCBI’s sequence annotation and analysis platform, includes new features for the Multiple Sequence Alignment View, the Graphical Sequence View and the Sequence Editing and Submission Package as well as a number of other improvements and bug fixes.

In the Multiple Sequence Alignment View, you can now export publication quality graphics (Save As PDF/SVG … , Figure 1). In the Graphical Sequence View you can now search by locus tag, use improved search capabilities for genes by locus and can better display the selected location in the feature editing dialog when annotating a sequence.

MSA Figure 1. A multiple alignment view in Genome Workbench highlighting the new ability to save presentation quality image files (Save As PDF and SVG formats).

In the Sequence Editing and Submission Package, we rearranged the controls in the Table Reader dialog to fit onto smaller screens and improved importing feature tables that contain mat-peptides (mature peptide) features.

Bug Fixes and Improvements

We have made a number of other fixes and improvements. For MacOS users we fixed blurry text in some dialogs, fixed the copy to clipboard problem, and improved support for the latest Catalina version. We also fixed a crashing problem in the Active Object Inspector interface. You should also see improvements in loading SNP data and better recovery in cases of power outages or other events causing local file corruption.

In the Sequence Editing and Submission Package, we fixed a bug that occurred when applying miscellaneous descriptors and structured comment fields using the Table Reader and an issue with using a PubMed ID to look up a publication.

Please see the extensive help documentation including FAQs, videos, and tutorials linked to the Genome Workbench homepage for more information and examples on how to use Genome Workbench in your research.

RefSeq release 201 is public

RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.

Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.

This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.

Improved access to SARS-CoV-2 data

NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

comm-1318_fig1 — Figure 1 – SARS-CoV-2 page within NCBI Datasets showing statistics as of June 16, 2020.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

comm-1318_fig2 — Figure 2 – SARS-CoV-2 protein page within NCBI Datasets showing annotations on the SARS-CoV-2 reference genome.

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.

We appreciate your feedback. Try NCBI Datasets and let us know what you think!

Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.

Fish

For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.

Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!” →