Tag: RefSeq

RefSeq Release 203 now available

RefSeq Release 203 now available

RefSeq release 203 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 2, 2020, and contains 256,340,911 records, including 186,482,096 proteins, 34,176,314 RNAs, and sequences from 105,349 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: 

RefSeq annotation of mouse GRCm39
RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence.

The annotation report for annotation release 109 is available here.

The annotation products are available in the sequence databases and on the FTP site.

New eukaryotic genome annotations
In addition to mouse (GRCm39), this release contains new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • Pallas’s mastiff bat annotation release 100, based on the assembly mMolMol1.p (GCF_014108415.1)
  • Myotis myotis bat annotation release 100, based on the assembly mMyoMyo1.p (GCF_014108235.1)
  • southern grasshopper mouse annotation release 100, based on the new assembly mOncTor1.1 (GCF_903995425.1)
  • American pika (pictured above) annotation release 102 based on new assembly OchPri4.0 (GCF_014633375.1)
  • pharaoh ant annotation release 102 based on new assembly ASM1337386v2 (GCF_013373865.1)
  • olive fruit fly annotation release 101, based on the assembly MU_Boleae_v2 (GCF_001188975.3)

Updated human genome Annotation Release 105.20201022 (GRCh37.p13)
Annotation Release 105.20201022 is an annotation update for the previous human reference assembly, GRCh37.p13 (hg19). This update is not a part of RefSeq FTP release but the annotation products are available in the sequence databases and on the genomes FTP site.

COVID-19 related human gene annotation now in NCBI RefSeq and Gene
The RefSeq group has compiled a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Matched Annotation by NCBI and EMBL-EBI (MANE) version 0.92
NCBI RefSeq and Ensembl/GENCODE announced MANE v0.92, which covers 16,865 genes or ~88% of known human protein-coding genes.

NCBI Datasets

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms.

Human GRCh37 (hg19) RefSeq annotation update 

The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.

With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?

Continue reading “Human GRCh37 (hg19) RefSeq annotation update “

New RefSeq annotations for mouse, maize, sunflower and more!

New RefSeq annotations for mouse, maize, sunflower and more!

In August and September, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Amphiprion ocellaris (clown anemonefish)
  • Anopheles stephensi (Asian malaria mosquito)
  • Aplysia californica (California sea hare)
  • Bactrocera oleae (olive fruit fly)
  • Branchiostoma floridae (Florida lancelet)
  • Egretta garzetta (little egret)
  • Folsomia candida (springtail)
  • Fundulus heteroclitus (mummichog)
  • Halichoerus grypus (gray seal)
  • Helianthus annuus (common sunflower)
  • Homo sapiens (human)
  • Lynx canadensis (Canada lynx)
  • Molossus molossus (Pallas’s mastiff bat)
  • Monomorium pharaonis (pharaoh ant)
  • Mus musculus (house mouse)
  • Myotis myotis (bat)
  • Neolamprologus brichardi (lyretail cichlid)
  • Oncorhynchus keta (chum salmon)
  • Onychomys torridus (southern grasshopper mouse)
  • Oryzias melastigma (Indian medaka)
  • Phyllostomus discolor (pale spear-nosed bat)
  • Rousettus aegyptiacus (Egyptian rousette)
  • Sander lucioperca (pike-perch)
  • Zea mays (maize)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Learn more about the annotation of the new mouse reference assembly, GRCm39, here. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38.

Announcing the RefSeq annotation of mouse GRCm39!

NCBI RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence. It’s a big deal!Figure 1. The Genome Data Viewer showing the annotation for the mouse pseudoautosomal region that includes annotations of four genes that were previously missing: Sts, Nlgn4l, Akap17a, and 2510022D24Rik

Continue reading “Announcing the RefSeq annotation of mouse GRCm39!”

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms

NCBI Datasets now offers Gene tables: customizable tables of the genes you specify, with key gene information, and the ability to easily download a dataset of genomic, transcript and protein sequences.

Drag and drop a list of Gene IDs or gene symbols, and the data table shows your genes with up to 15 columns of metadata, including genomic coordinates, RefSeq transcript and protein accessions, Ensembl IDs and UniProt accessions, and other gene information. You can browse and select items in your table on the web, or download everything to your computer for later analysis (Figure 1).

Figure 1. The Data tables web download. Top panel. Enter or upload a list of gene identifiers or symbols. Bottom panel. The resulting table display allows you to browse results, download the table or the sequence data for the genes (genomic, transcripts, proteins).  Continue reading “NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms”

The latest in COVID-19 related human gene annotation now in NCBI RefSeq and Gene

Interested in human genes involved in COVID-19 biology? NCBI’s RefSeq group has been hard at work compiling a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Figure 1. Top section of the human ACE2 record in the Gene database. COVID-19 information can be found in the Summary and Annotation information sections.

Continue reading “The latest in COVID-19 related human gene annotation now in NCBI RefSeq and Gene”

RefSeq Release 202 is public

RefSeq release 202 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 8, 2020, and contains 255,571,455 records, including 186,755,483 proteins, 33,077,068 RNAs, and sequences from 104,969  organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200815
Updated Annotation Release 109.2020815 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here.

The annotation products are available in the sequence databases and on the FTP site.

This update includes around 15,000 updated RefSeq transcripts revised to use CAGE and polyA data to define 5′ and 3′ ends, and match the reference GRCh38 sequence.

Coronavirus host gene regulatory elements now annotated by RefSeq Functional Elements
The RefSeq Functional Elements project at NCBI has prioritized curation of experimentally validated regulatory elements for human host genes associated with SARS-CoV-2 entry into cells. The annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types. We annotated 236 regulatory features for 27 distinct biological regions, including regulatory elements for the ABO, ACE2, ANPEP, CD209, CLEC4G, CLEC4M, CTSL, DPP4, and TMPRSS2 genes. More information can be found here.

New eukaryotic genome annotations
This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • maize annotation release 103, based on the new assembly Zm-B73-REFERENCE-NAM-5.0 (GCF_902167145.1)
  • marmoset annotation release 105, based on the new assembly Callithrix_jacchus_cj1700_1.1 (GCF_009663435.1)
  • Chinese hamster annotation release 104, based on the assembly CriGri_1.0 (GCF_000223135.1) and the new assembly CriGri-PICRH-1.0 (GCF_003668045.3)
  • Asian giant hornet annotation release 100, based on the new assembly V.mandarinia_Nanaimo_p1.0 (GCF_014083535.2)
  • Florida lancelet annotation release 100, based on the new assembly Bfl_VNyyK (GCF_000003815.2)
  • Anopheles stephensi annotation release 100, based on the new assembly UCI_ANSTEP_V1.0 (GCF_013141755.1)

Updated and improved collection of RefSeq representative genome assemblies now available
The collection of representative genome assemblies for Bacteria and Archaea contains 11,727 prokaryotic assemblies to represent their respective species. More information can be found here.

Updated protein family models used by PGAP available for download
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available.

This release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. More information can be found here.

Future change: Mouse Reference Assembly Update
RefSeq annotation of the new mouse GRCm39 assembly is in progress, and is expected to be included in the next release.

Updated and improved collection of RefSeq representative genome assemblies now available

We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.

We have updated the database on the Microbial Nucleotide BLAST page as well as the basic nucleotide BLAST RefSeq Representative Genome Database, to reflect these changes.

You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“.  A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!

New annotations in RefSeq: white-tufted-ear marmoset, ruddy duck, and more

New annotations in RefSeq: white-tufted-ear marmoset, ruddy duck, and more

In June and July, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

Acipenser ruthenus (sterlet)
Anguilla anguilla (European eel)
Aphantopus hyperantus (ringlet)
Callithrix jacchus (white-tufted-ear marmoset)
Chelonus insularis (wasp)
Cricetulus griseus (Chinese hamster)
Cygnus atratus (black swan)
Drosophila subobscura (fly)
Electrophorus electricus (electric eel)
Etheostoma cragini (Arkansas darter)
Hippoglossus stenolepis (Pacific halibut)
Mirounga leonina (Southern elephant seal)
Morone saxatilis (striped sea-bass)
Mus musculus (house mouse)
Oxyura jamaicensis (ruddy duck)
Pan paniscus (pygmy chimpanzee)
Populus alba (eudicot)
Scophthalmus maximus (turbot)
Spodoptera frugiperda (fall armyworm)
Stegodyphus dumicola (spider)
Vitis riparia (eudicot)
Zootoca vivipara (common lizard)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

New interaction data, downloads and track hub available for RefSeq Functional Elements 

We’ve added several new enhancements to the RefSeq Functional Elements dataset, which provides genome annotation and richly annotated RefSeq and Gene records for experimentally validated non-genic functional regions in human and mouse. Read on to see what we’ve done!

Continue reading “New interaction data, downloads and track hub available for RefSeq Functional Elements “