This release includes new annotations for human, zebra finch, golden eagle, sea urchin, snowfinch, Arctic fox, clawed frog, great white shark, and more:
This full release incorporates genomic, transcript, and protein data available as of July 12, 2021, and contains 285,425,070 records, including 209,035,492 proteins, 39,039,901 RNAs, and sequences from 112,462 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 207 is available!”
We have re-annotated all RefSeq genomes for Escherichia coli, Mycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.
The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.
Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange).
RefSeq Release 206 is now available. This release includes the following:
Updated human genome Annotation Release 109.20210514
Updated Annotation Release 109.20210514 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here. The annotation products are available in the sequence databases and on the FTP site.
Other new eukaryotic genome annotations
This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 45 additional species, including: Continue reading “Announcing RefSeq Release 206!”
NCBI staff will be presenting virtual posters at the Cold Spring Harbor Laboratory Biology of Genomes Meeting, May 11 -14, 2021. The posters will cover the following topics: 1) a cloud-ready suite of tools (PGAP, RAPT , and SKESA) for assembling and annotating prokaryotic genomes, 2) Datasets — a new set of services for downloading genome assemblies and annotations, and 3) updates on NCBI RefSeq eukaryotic genome annotation, and the Genome Data Viewer (GDV). Read more below for the full abstracts.
The virtual poster gallery opens Tuesday, May 11 at 9:00 a.m. with dedicated time for poster viewing and discussion at 1:00 to 2:00 p.m. through Slack each day. The poster gallery will be open for entire the conference and remain available for six weeks afterwards. Continue reading “NCBI at CSHL Biology of Genomes, May 11 – 14, 2021”
In March and April, the NCBI Eukaryotic Genome Annotation Pipeline released thirty-two new annotations in RefSeq for the following organisms: Continue reading “New RefSeq annotations for Siamese fighting fish, common toad, swan, platypus and more!”
- Benincasa hispida (wax gourd)
- Canis lupus familiaris (dog)
- Corvus cornix cornix (hooded crow)
- Crotalus tigris (tiger rattlesnake)
- Culex pipiens pallens (northern house mosquito)
- Dioscorea cayenensis subsp. rotundata (Guinea yam)
- Drosophila santomea (fly)
- Drosophila simulans (fly)
- Drosophila yakuba (fly)
- Eucalyptus grandis (rose gum)
- Hibiscus syriacus (Rose-of-Sharon)
- Hyaena hyaena (striped hyena)
- Maniola hyperantus (ringlet)
- Mauremys reevesii (Reeves’s turtle)
- Nilaparvata lugens (brown planthopper)
This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!