The new reference assembly for sheep is now annotated! Assembly ARS-UI_Ramb_v2.0 is made of 142 scaffolds, a drop from 2,640 in the 2017 assembly Oar_rambouillet_v1.0. With a contig N50 of 43 Mb, ARS-UI_Ramb_v2.0 is 15 times more contiguous than the first assembly of the Rambouillet breed.
Annotation Release 104 (AR 104) of ARS-UI_Ramb_v2.0 reflects these improvements. Nearly 200 more coding genes have a 1:1 ortholog in the human genome than in the annotation of Oar_rambouillet_v1.0 (AR 103). The number of coding models annotated as partial is down 35% from 165 to 107, and the number of coding models labeled low quality due to suspected indels or base substitutions in the underlying genomic sequence decreased by 51% (1646 to 796). Based on BUSCO analysis, 99.1% of the models (cetartiodactyla_odb10) are complete in AR 104 versus 98.8% in AR 103. Details of this annotation, including statistics on the annotation products, the input data used in the pipeline and intermediate alignment results, can be found here. Continue reading “Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!”→
We have re-annotated all RefSeq genomes for Escherichia coli, Mycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.
The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.
Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange).
RefSeq Release 206 is now available. This release includes the following:
Updated human genome Annotation Release 109.20210514
Updated Annotation Release 109.20210514 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here. The annotation products are available in the sequence databases and on the FTP site.
NCBI Datasets, the new set of services for downloading genome assembly and annotation data (previous Datasets posts), has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.
RefSeq release 205 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
There is a new release of the Read assembly and Annotation Pipeline Tool (RAPT) available from our GitHub site. RAPT is a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates that can run on your local computer or the Google Cloud Platform (GCP). With this new release, jobs will run twice as fast as with the December release. For example, we have assembled and annotated a Salmonella enterica genome in under an hour on a 16-CPU machine with the new release.
We have also added several new features based on your feedback including:
The –stop-on-errors flag that will stop the process if there evidence from the average nucleotide identity check that there is sample mix-up or contamination by other bacteria.
The ability to accept forward and reverse reads of paired-end runs in separate files. These can be compressed (gzip) files.
Finally, thanks to all who came to our webinar in December and provided their comments! For these who couldn’t join us, you can now view the recording on our YouTube channel.
NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!
Are you a researcher who works on gene biology and are interested in alternative splice patterns in your gene or genes of interest? If so, be sure to explore the intron feature evidence available in graphics views of genome assemblies annotated by NCBI. You can view the NCBI evidence used for calling splice variant for genes, add other intron feature evidence tracks, and use new display and filter options that make it easier to interpret the data .
Figure 1. Graphical view of the monoamine oxidase gene (MAOA, MOAB) region on the human X chromosome showing intron features tracks (‘RNA-seq intron features, aggregate’ and ‘Intropolis RNA-Seq intron features’). Mousing-over an intron feature activates a tooltip that shows details such as the number of reads with the splice site, the location on the chromosome, the length of the intron and the donor and acceptor bases at the splice site. The Intropolis track was added through the search feature of the Configure Tracks menu and configured (bottom menu) so that the features were sorted by strand and filtered so that only features with greater than 500 reads appear.