Using the NIH Comparative Genomics Resource (CGR) to gain knowledge about less-researched organisms
The scientific community relies heavily on model organism research to gain knowledge and make discoveries. However, focusing solely on these species misses valuable variation. Comparative genomics allows us to use knowledge from a model species, such as Saccharomyces cerevisiae, to understand traits in other, related organisms, such as Saccharomyces pastorianus or Saccharomyces eubayanus. Applying this information may provide valuable insight for other less-researched organisms. The National Institutes of Health (NIH) Comparative Genomics Resource (CGR) offers a cutting-edge NCBI toolkit of high-quality genomics data and tools to help you do just that. Continue reading “Comparing Yeast Species Used in Beer Brewing and Bread Making”
RefSeq release 220 is now available online and from the FTP site. You can access RefSeq data through NCBI Datasets.
What’s included in this release?
As of September 5, 2023, this full release incorporates genomic, transcript, and protein data containing:
- 391,350,361 records
- 289,333,423 proteins
- 56,423,426 RNAs
- sequences from 141,099 organisms
Continue reading “RefSeq Release 220”
An updated bacterial and archaeal reference genome collection is available! This collection of 18,343 genomes was built by selecting exactly one genome assembly for each species among the 312,000+ prokaryotic genomes in RefSeq, except for E. coli for which two assemblies were selected as reference.
The criteria for selecting the reference assembly for a given species include assembly contiguity and completeness and quality of the RefSeq annotation.
- 790 species were added to the collection
- 199 species are represented by a better assembly (compared to the April 2023 release)
- 70 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment
Continue reading “Now Available! Updated Bacterial and Archaeal Reference Genomes Collection”
In April, May, and June, the NCBI Eukaryotic Genome Annotation Pipeline released eighty-two new annotations in RefSeq!
- Homo sapiens (human) T2T-CHM13v2.0 now includes many more alternative splice variants
- Homo sapiens (human) GRCh38.p14 includes all transcripts from MANE v1.2, and includes over 78,000 new RefSeq Functional Element (RefSeqFE) features added since our last annotation in 2022
- Mus musculus (house mouse) GRCm39 integrates curation for over 3,000 genes and 14,000 transcripts since September 2020
- Rattus norvegicus (Norway rat) mRatBN7.2, including curation of over 5000 genes since our last annotation in 2021
New annotations: Continue reading “New Annotations in RefSeq!”
RefSeq release 219 is now available online and from the FTP site. You can access RefSeq data through NCBI Datasets.
What’s included in this release?
As of July 18, 2023, this full release incorporates genomic, transcript, and protein data containing:
- 371,291,248 records
- 3,752,372,037,103 nucleotide bases
- 106,842,615,422 amino acids
- sequences from 138,491 organisms
The release is provided in several directories as a complete dataset and divided by logical groupings.
Updates & announcements
Continue reading “RefSeq Release 219”
Do you need to work with variant data mapped to historical human RefSeq transcript versions? To make it easier to map your data to the current GRCh38 reference genome and MANE transcripts, we’re now providing a collection of RefSeq transcript alignments including both the latest versions in the GCF_000001405.40-RS_2023_03 annotation release, and older transcripts going back to 1999. The data are available for download from the FTP site.
As shown in the example below (Image 1), you can view these alignments in the Genome Data Viewer by loading the remote bam track (GCF_00001405-RS_2023_03_knownrefseqs_aln.bam) from the FTP site. Continue reading “Now Available! Access to Historical Human Transcript Alignments”
Do you work with or study prokaryotic proteins? As previously announced, we’ve been adding Gene Ontology (GO) terms to RefSeq prokaryotic protein sequence records (example below) to standardize the language when describing the functions of genes and their products. Over 100 million RefSeq proteins from prokaryotes now have at least one GO Term, a 55% increase since we started propagating GO terms from Conserved Domains Database (CDD) architectures in March. Continue reading “Gene Ontology (GO) Terms on 100M+ RefSeq Prokaryotic Protein Sequence Records”
As previously announced, we are continuously curating a better Prokaryotic Reference Genomes Collection. An updated bacterial and archaeal reference genome collection is now available! This collection of 17,623 genomes was built by selecting exactly one genome assembly for each species among the 283,000+ prokaryotic genomes in RefSeq, except for E. coli for which two assemblies were selected as reference.
- 480 species were added to this collection
- 178 species are represented by a better assembly
- 17 species were removed due to changes in NCBI Taxonomy or uncertainty in their species assignment
Continue reading “New Release! Updated Bacterial and Archaeal Reference Genomes Collection Now Available”
In February and March, the NCBI Eukaryotic Genome Annotation Pipeline released forty-two new annotations in RefSeq for the organisms listed below. Additionally, interim builds for over sixty species were run during that time period to fix some issues with gene symbol assignment.