Tag: RefSeq

RefSeq release 201 is public

RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.

Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.

This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.

New annotations in RefSeq: budgerigar, bony fish, fly and more

close-up-photo-of-white-and-blue-bird

In May, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Acipenser ruthenus (sterlet)
  • Arvicanthis niloticus (African grass rat)
  • Cannabis sativa (eudicot)
  • Crassostrea gigas (Pacific oyster)
  • Cyclopterus lumpus (lumpfish)
  • Drosophila albomicans (fly)
  • Drosophila guanche (fly)
  • Drosophila innubila (fly)
  • Esox lucius (northern pike)
  • Gymnodraco acuticeps (bony fish)
  • Hippoglossus hippoglossus (Atlantic halibut)
  • Marmota flaviventris (yellow-bellied marmot)
  • Melopsittacus undulatus (budgerigar)
  • Osmia lignaria (orchard mason bee)
  • Pangasianodon hypophthalmus (striped catfish)
  • Pantherophis guttatus (snake)
  • Periophthalmus magnuspinnatus (bony fish)
  • Prunus dulcis (almond)
  • Pseudochaenichthys georgianus (South Georgia icefish)
  • Setaria viridis (monocot)
  • Thalassophryne amazonica (bony fish)
  • Thrips palmi (thrip)
  • Trematomus bernacchii (emerald rockcod)
  • Zea mays (maize)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.

Fish

For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.

Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!”

RefSeq release 200 is public

RefSeq release 200 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of May 4, 2020, and contains 237,381,664 records, including 171,643,729 proteins, 31,244,247 RNAs, and sequences from 100,605 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements:

The number of organisms in RefSeq crosses 100,000!
The current RefSeq release contains 100,605 distinct species or taxons, with a net increase of 763 species since Release 99. This milestone coincides with the 100th release though the current release number is 200 (see below). Note that there is a decrease in the number of species for prokaryotes (bacteria and archaea) due to a clean-up that mainly removed unclassified bacteria, and assemblies from Metagenome-Assembled Genomes (MAGs).

The FTP release number has skipped to 200
As previously announced, NCBI’s Reference Sequence (RefSeq) FTP release number has incremented to 200 for this release, and skipped over the numbers 100-199. The previous, March 2020 release, was release 99. This change is to avoid overlapping with the release numbers of the independently numbered RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108.

NCBI Protein Families
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

Recalculation of Prokaryotic Reference and Representative Genome Assemblies
We have updated the collection of reference and representative assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We have selected one reference or representative assembly for every species based on several criteria including contiguity, completeness, and whether the assembly is from type material.

Future change: Mouse Reference Assembly Update
A full assembly update for the mouse GRCm38.p6 reference assembly is expected to be released in 2020 by the GRC. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly this summer, for either RefSeq FTP Release 201 or 202.

 

Flies Are A-buzzing in RefSeq!

Are you interested in comparative genomics or other studies using Drosophila genomics?

Then don’t miss our online poster #568A at TAGC 2020 Online (no meeting registration required). Also, tune in to the online Q&A session on Monday, April 27 at 12:00 – 12:30 pm EDT.

What’s happening? In coordination with FlyBase, we are transitioning almost all of the RefSeq Drosophila assemblies to annotation produced primarily by NCBI’s eukaryotic genome annotation pipeline. We’ll continue to use the FlyBase annotation for Drosophila melanogaster (soon to be updated to Release 6.32), but we’ll annotate the other species using available RNA-seq datasets and our latest software. This will allow us to provide consistent, high-quality annotations across the full spectrum of Drosophila species, and also rapidly provide annotations as new high-quality assemblies become available. Another benefit is that these annotations will be available in the full suite of NCBI resources, including nucleotide, protein, BLAST, GeneGenome Data Viewer, Genomes, Assembly, and more. You can download these annotation data from the NCBI genomes FTP site or you can try the new NCBI Datasets tool. By special request, we’re making orthology data relative to D. melanogaster available on the Gene FTP site, and plan to expose that data in our public pages in the future.

Continue reading “Flies Are A-buzzing in RefSeq!”

Recalculation of prokaryotic reference and representative genome assemblies

We have updated the collection of representative and reference assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq.  We chose the 11,478 representative assemblies in the new collection from the 180,000+ prokaryotic assemblies in RefSeq today.  We have selected one representative or reference assembly for every species based on several criteria including contiguity, completeness and whether the assembly is from type material.  We have also updated the reference and representative microbial Blast database to reflect these changes. This reference and representative set will be updated three times a year to reflect changes in RefSeq.  In addition, as we announced on Feb 14, we have reduced the number of reference genome assemblies — the subset of representative assemblies with annotation provided by outside experts —  to 15. See the list in our previous post .  We have re-annotated the 104 assemblies that are no longer reference with or Prokaryotic Genome Annotations Pipel (PGAP).

Recent RefSeq annotations: barn owl, monarch butterfly and more

800px-Barn_Owl,_Manchester_area,_UK,_by_Andy_Chilton_2016-07-06_(Unsplash)In February and March, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Amblyraja radiata (thorny skate)
  • Catharus ustulatus (Swainson’s thrush)
  • Chelonoidis abingdonii (Abingdon island giant tortoise)
  • Chiroxiphia lanceolata (lance-tailed manakin)
  • Danaus plexippus plexippus (monarch butterfly)
  • Daphnia magna (crustacean)
  • Drosophila grimshawi (fly)
  • Drosophila mojavensis (fly)
  • Drosophila sechellia (fly)
  • Homo sapiens (human)
  • Hylobates moloch (silvery gibbon)
  • Lontra canadensis (Northern American river otter)
  • Lynx canadensis (Canada lynx)
  • Nasonia vitripennis (jewel wasp)
  • Odontomachus brunneus (ant)
  • Petromyzon marinus (sea lamprey)
  • Phocoena sinus (vaquita)
  • Rattus rattus (black rat)
  • Rhinolophus ferrumequinum (greater horseshoe bat)
  • Strigops habroptila (Kakapo)
  • Taeniopygia guttata (zebra finch)
  • Tyto alba (Barn owl)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

The next RefSeq FTP release number will skip to 200

NCBI’s Reference Sequence (RefSeq) FTP release numbers will increment to 200 for the next release and skip over the numbers 100-199. The current, March 2020 release, is release 99. The next bi-monthly release in May 2020 will be release 200.  This change is to avoid overlapping with the release numbers of the completely independent RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108. Continue reading “The next RefSeq FTP release number will skip to 200”

Protein family models used by PGAP are now available for download

A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).

The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.

  • 85% of models were assigned a product name that can be transferred to proteins hit by the model.
  • 7702 models have gene symbols.
  • 14508 are supported by a least one publication.
  • 6266 are assigned an Enzyme Commission number.
  • 617 represent anti-microbial resistance proteins.
  • Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.

A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.