Tag: RefSeq

RefSeq Release 202 is public

RefSeq release 202 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 8, 2020, and contains 255,571,455 records, including 186,755,483 proteins, 33,077,068 RNAs, and sequences from 104,969  organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200815
Updated Annotation Release 109.2020815 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here.

The annotation products are available in the sequence databases and on the FTP site.

This update includes around 15,000 updated RefSeq transcripts revised to use CAGE and polyA data to define 5′ and 3′ ends, and match the reference GRCh38 sequence.

Coronavirus host gene regulatory elements now annotated by RefSeq Functional Elements
The RefSeq Functional Elements project at NCBI has prioritized curation of experimentally validated regulatory elements for human host genes associated with SARS-CoV-2 entry into cells. The annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types. We annotated 236 regulatory features for 27 distinct biological regions, including regulatory elements for the ABO, ACE2, ANPEP, CD209, CLEC4G, CLEC4M, CTSL, DPP4, and TMPRSS2 genes. More information can be found here.

New eukaryotic genome annotations
This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • maize annotation release 103, based on the new assembly Zm-B73-REFERENCE-NAM-5.0 (GCF_902167145.1)
  • marmoset annotation release 105, based on the new assembly Callithrix_jacchus_cj1700_1.1 (GCF_009663435.1)
  • Chinese hamster annotation release 104, based on the assembly CriGri_1.0 (GCF_000223135.1) and the new assembly CriGri-PICRH-1.0 (GCF_003668045.3)
  • Asian giant hornet annotation release 100, based on the new assembly V.mandarinia_Nanaimo_p1.0 (GCF_014083535.2)
  • Florida lancelet annotation release 100, based on the new assembly Bfl_VNyyK (GCF_000003815.2)
  • Anopheles stephensi annotation release 100, based on the new assembly UCI_ANSTEP_V1.0 (GCF_013141755.1)

Updated and improved collection of RefSeq representative genome assemblies now available
The collection of representative genome assemblies for Bacteria and Archaea contains 11,727 prokaryotic assemblies to represent their respective species. More information can be found here.

Updated protein family models used by PGAP available for download
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available.

This release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. More information can be found here.

Future change: Mouse Reference Assembly Update
RefSeq annotation of the new mouse GRCm39 assembly is in progress, and is expected to be included in the next release.

Updated and improved collection of RefSeq representative genome assemblies now available

We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.

We have updated the database on the Microbial Nucleotide BLAST page as well as the basic nucleotide BLAST RefSeq Representative Genome Database, to reflect these changes.

You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“.  A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!

New annotations in RefSeq: white-tufted-ear marmoset, ruddy duck, and more

New annotations in RefSeq: white-tufted-ear marmoset, ruddy duck, and more

In June and July, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

Acipenser ruthenus (sterlet)
Anguilla anguilla (European eel)
Aphantopus hyperantus (ringlet)
Callithrix jacchus (white-tufted-ear marmoset)
Chelonus insularis (wasp)
Cricetulus griseus (Chinese hamster)
Cygnus atratus (black swan)
Drosophila subobscura (fly)
Electrophorus electricus (electric eel)
Etheostoma cragini (Arkansas darter)
Hippoglossus stenolepis (Pacific halibut)
Mirounga leonina (Southern elephant seal)
Morone saxatilis (striped sea-bass)
Mus musculus (house mouse)
Oxyura jamaicensis (ruddy duck)
Pan paniscus (pygmy chimpanzee)
Populus alba (eudicot)
Scophthalmus maximus (turbot)
Spodoptera frugiperda (fall armyworm)
Stegodyphus dumicola (spider)
Vitis riparia (eudicot)
Zootoca vivipara (common lizard)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

New interaction data, downloads and track hub available for RefSeq Functional Elements 

We’ve added several new enhancements to the RefSeq Functional Elements dataset, which provides genome annotation and richly annotated RefSeq and Gene records for experimentally validated non-genic functional regions in human and mouse. Read on to see what we’ve done!

Continue reading “New interaction data, downloads and track hub available for RefSeq Functional Elements “

Major update for the NCBI RefSeq mouse GRCm38.p6 annotation

We have updated our annotation for the mouse reference genome, GRCm38.p6. It includes:

  • Markup for RefSeq Select, which identifies one representative transcript and protein for every protein-coding gene. Find features with the ‘tag=RefSeq Select’ attribute in GFF3 for those analyses where you need just a single transcript or protein for each coding gene. You can also find these RefSeqs in Entrez using the query ‘refseq_select[filter].’
  • Annotation updates made in the last year for over 2000 genes, including over 4000 new or revised curated transcripts. This includes targeted curation to ensure we are representing well-expressed and conserved transcripts for inclusion in RefSeq Select.
  • Annotation of over 2300 regulatory and other functional element features from over 900 biological regions. These are now identified with the source “RefSeqFE” in GFF3 column 2 for easy parsing.

When citing, please refer to this annotation as NCBI Mus musculus Annotation Release 108.20200622. You can find the data in:

This is our last update before upgrading to the new major assembly version just released by the Genome Reference Consortium, GRCm39. We expect to be cranking up our compute farm in the next few weeks to produce a full annotation based on our latest curation and extensive short (Illumina) and long (PacBio IsoSeq and nanopore) RNA-seq data, which should be released later this summer. Stay tuned!

Updated protein family models used by PGAP available for download

Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0,  we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Prot_evidenceFigure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801,  PMID 9618447) providing a unified nomenclature for this secretion system.  Continue reading “Updated protein family models used by PGAP available for download”

RefSeq release 201 is public

RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.

Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.

This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.

New annotations in RefSeq: budgerigar, bony fish, fly and more

close-up-photo-of-white-and-blue-bird

In May, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Acipenser ruthenus (sterlet)
  • Arvicanthis niloticus (African grass rat)
  • Cannabis sativa (eudicot)
  • Crassostrea gigas (Pacific oyster)
  • Cyclopterus lumpus (lumpfish)
  • Drosophila albomicans (fly)
  • Drosophila guanche (fly)
  • Drosophila innubila (fly)
  • Esox lucius (northern pike)
  • Gymnodraco acuticeps (bony fish)
  • Hippoglossus hippoglossus (Atlantic halibut)
  • Marmota flaviventris (yellow-bellied marmot)
  • Melopsittacus undulatus (budgerigar)
  • Osmia lignaria (orchard mason bee)
  • Pangasianodon hypophthalmus (striped catfish)
  • Pantherophis guttatus (snake)
  • Periophthalmus magnuspinnatus (bony fish)
  • Prunus dulcis (almond)
  • Pseudochaenichthys georgianus (South Georgia icefish)
  • Setaria viridis (monocot)
  • Thalassophryne amazonica (bony fish)
  • Thrips palmi (thrip)
  • Trematomus bernacchii (emerald rockcod)
  • Zea mays (maize)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.

Fish

For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.

Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!”