Do you need to compare and combine data based on NCBI RefSeq and UniProt datasets, and aren’t sure which proteins are comparable? For many years, NCBI Gene has provided information about the relationships between RefSeq and UniProt accessions courtesy of data imported from UniProt, but the tremendous growth of both datasets has led to large gaps in the data. We have developed a new process to compare the two datasets, first looking for 100% identical proteins and then checking the remaining sequences for similar matches in related taxa. The result is mapping information now covering over 170 million RefSeq proteins across the tree of life.
You can find links to related UniProt accessions on individual NCBI Gene records. The entire dataset is available on our FTP site. Continue reading “Now Available! Compare NCBI RefSeq and UniProt Datasets”
Effective June 2023, the HomoloGene records will redirect to the Datasets Gene Table
Do you use HomoloGene to view and download data? You can now access updated homology data from NCBI Datasets through the Datasets Gene Table with connections to NCBI Orthologs. Go directly from a HomoloGene record to the Datasets Gene Table that will give you access to up-to-date sequence data and metadata. NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
The Datasets Gene Table provides connections to the NCBI Ortholog interface (Figure 1) that provides the following data:
- Orthology data based on an updated algorithm that identifies orthologs spanning > 500 vertebrate species
- Similar gene data based on protein architectures that spans all eukaryotes
Continue reading “New Way to View and Download Related Genes”
Join us October 25-29 in Los Angeles, CA
We are looking forward to seeing you in-person at the American Society of Human Genetics (ASHG) annual meeting, October 25-29, 2022, in Los Angeles, California.
We will present a variety of talks and posters featuring our clinical and human genetic resources, as well as genome products and tools. We are excited to introduce the NIH Comparative Genomics Resource (CGR), a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. If you’re interested in providing feedback that will be used to help drive CGR forward, consider joining our round table discussion.
Check out NCBI’s schedule of activities and events:
Continue reading “Connect with NCBI at ASHG 2022”
NCBI Gene now has descriptive information about genes from the Alliance of Genome Resources for organisms including Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus, Rattus norvegicus, and Saccharomyces cerevisiae.
Figure 1. The gene summary section of the Drosophila melanogaster slmb Gene Full Report showing the link to the corresponding record at the Alliance of Genome Resources.
The Summary section of the Gene Full Report page has Links to gene pages at the Alliance of Genome Resources (Figure 1). These are also in the right-hand sidebar of the Links to other resources section. In the case of genes that don’t have a RefSeq summary, we use the textual gene descriptions from the Alliance of Genome resources.
The Drosphila slmb gene record shows the enhancements provided by the Alliance of Genome Resources. The gene_info.gz files on the Gene FTP site also include AllianceGenome references in the dbXrefs column.
Announcing a new feature in NCBI Datasets: the gene table.
To access it, start from the human species page (Figure 1) and click View all genes to view a table of all human genes.
Figure 1: Human species page. Click “View all genes” to view a table of human genes.
Continue reading “Try out the new gene table from NCBI Datasets!”
NCBI Gene has added Ensembl Rapid Releases to the calculation of matching annotations between NCBI RefSeq and Ensembl. This has resulted in the inclusion of over 60 additional assemblies for a total of 241 organisms represented in the set. Matches are made based on transcript and CDS comparisons, and Ensembl gene, transcript, and protein identifiers for annotations similar to the NCBI RefSeq annotations are reported in NCBI Gene and in the gene2ensembl file on the Gene FTP site. The Ensembl annotation is also available in the graphical view and in NCBI’s Genome Data Viewer to give you a side-by-side view of how the annotations compare. Check out blue whale E2F1 for an example.
Figure 1. Balaenoptera musculus E2F transcription factor 1 in Genome Data Viewer
NCBI Datasets, the new set of services for downloading genome assembly and annotation data (previous Datasets posts), has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.
NCBI Datasets has a fresh new homepage (Figure 1) highlighting the types of data available through our tools. Available data include genome assemblies, genes, and SARS-CoV-2 genomic and protein data. You can easily access these from the new page or learn more with our new documentation pages.
Figure 1. Features of the new Datasets homepage with quick access to help documentation including the Quickstart and How-to guides as well as access to Genome, Gene, and Coronavirus Data, and the Datasets and Dataformat command-line tools. Continue reading “New NCBI Datasets home and documentation pages provide easier access”
Important Note: Please see our latest documentation on how to download gene ortholog data. The commands below have been deprecated in the latest version of the NCBI Datasets command-line tools.
You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
Continue reading “The Datasets command-line tool now provides ortholog data”
NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!
Continue reading “Announcing the RefSeq annotation of rat mRatBN7.2!”
In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool? Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).
Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.
Continue reading “Programmatic access to Gene data using Datasets command-line and API”