- Benincasa hispida (wax gourd)
- Canis lupus familiaris (dog)
- Corvus cornix cornix (hooded crow)
- Crotalus tigris (tiger rattlesnake)
- Culex pipiens pallens (northern house mosquito)
- Dioscorea cayenensis subsp. rotundata (Guinea yam)
- Drosophila santomea (fly)
- Drosophila simulans (fly)
- Drosophila yakuba (fly)
- Eucalyptus grandis (rose gum)
- Hibiscus syriacus (Rose-of-Sharon)
- Hyaena hyaena (striped hyena)
- Maniola hyperantus (ringlet)
- Mauremys reevesii (Reeves’s turtle)
- Nilaparvata lugens (brown planthopper)
This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!
Since October, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for a large number of organisms. We’ve separated them by group; click on “details” to see the full list for each.
- Artibeus jamaicensis (Jamaican fruit-eating bat)
- Arvicola amphibius (Eurasian water vole)
- Balaenoptera musculus (Blue whale)
- Cebus imitator (Panamanian white-faced capuchin)
- Chlorocebus sabaeus (green monkey)
- Homo sapiens (human)
- Manis javanica (Malayan pangolin)
- Manis pentadactyla (Chinese pangolin)
- Ochotona princeps (American pika)
- Peromyscus leucopus (white-footed mouse)
- Pipistrellus kuhlii (Kuhl’s pipistrelle)
- Sturnira hondurensis (bat)
- Talpa occidentalis (Iberian mole)
- Trichosurus vulpecula (common brushtail)
This full release incorporates genomic, transcript, and protein data available as of January 4, 2021, and contains 262,714,372 records, including 191,411,721 proteins, 35,353,412 RNAs, and sequences from 106,581 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Updated human genome Annotation Release 109.20201120
Updated Annotation Release 109.20201120 is an update of NCBI Homo sapiens Annotation Release 109.
The annotation report for 109.20201120 is available here. The annotation products are available in the sequence databases and on the FTP site. Continue reading “RefSeq release 204 is now available”
Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.
The new Protein Family Model resource (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters. Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs. The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel. Home page. Middle panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.
This full release incorporates genomic, transcript, and protein data available as of November 2, 2020, and contains 256,340,911 records, including 186,482,096 proteins, 34,176,314 RNAs, and sequences from 105,349 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
RefSeq annotation of mouse GRCm39
RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence.
The annotation report for annotation release 109 is available here.
The annotation products are available in the sequence databases and on the FTP site.
New eukaryotic genome annotations
In addition to mouse (GRCm39), this release contains new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:
- Pallas’s mastiff bat annotation release 100, based on the assembly mMolMol1.p (GCF_014108415.1)
- Myotis myotis bat annotation release 100, based on the assembly mMyoMyo1.p (GCF_014108235.1)
- southern grasshopper mouse annotation release 100, based on the new assembly mOncTor1.1 (GCF_903995425.1)
- American pika (pictured above) annotation release 102 based on new assembly OchPri4.0 (GCF_014633375.1)
- pharaoh ant annotation release 102 based on new assembly ASM1337386v2 (GCF_013373865.1)
- olive fruit fly annotation release 101, based on the assembly MU_Boleae_v2 (GCF_001188975.3)
Updated human genome Annotation Release 105.20201022 (GRCh37.p13)
Annotation Release 105.20201022 is an annotation update for the previous human reference assembly, GRCh37.p13 (hg19). This update is not a part of RefSeq FTP release but the annotation products are available in the sequence databases and on the genomes FTP site.
COVID-19 related human gene annotation now in NCBI RefSeq and Gene
The RefSeq group has compiled a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.
Matched Annotation by NCBI and EMBL-EBI (MANE) version 0.92
NCBI RefSeq and Ensembl/GENCODE announced MANE v0.92, which covers 16,865 genes or ~88% of known human protein-coding genes.
NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms.
The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.
With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?