GenBank release 219.0 (4/14/2017) has 200,877,884 traditional records containing 231,824,951,552 base pairs of sequence data. In addition, there are 451,840,147 WGS records containing 2,035,032,639,807 base pairs of sequence data, 165,068,542 TSA records containing 149,038,907,599 base pairs of sequence data, as well as 1,438,349 TLS records containing 636,923,295 base pairs of sequence data.
At the March 2017 NCBI Genomics Hackathon, participants developed six functional software prototypes, several of which are still under active development. Software is available from the NCBI-Hackathons GitHub site.
- Squidstream provides naming consistency by converting sequence feature IDs in entire files (bed, gff3, wig, etc.) to the desired ID format using a single command.
- ga4gh-ncbi-api is a method that links NCBI’s API and the GA4GH (Global Alliance for Genomics and Health) API, and generates a searchable list of genome datasets from NCBI.
- Graph_Extraction provides code to implement a simple graph genome browser.
- Sidearm searches the SRA database for viruses using the NCBI magicBLAST tool.
- Scan2CNV is a commandline tool that generates copy number variation (CNV) calls from raw SNP array data.
- Single Cell Reproducible Epigenomics Workflow (SCREW) is a single-cell whole-genome bisulfite sequencing (SC-WGBS) pipeline and docker image for performing standard single-cell DNA methylation analyses.
In the past month, the NCBI Eukaryotic Genome Annotation Pipeline has released new annotations in RefSeq for the following organisms:
- Zea mays (maize)
- Labrus bergylta (ballan wrasse)
- Monopterus albus (swamp eel)
- Corvus cornix cornix (hooded crow)
- Prunus persica (peach)
- Rhincodon typus (whale shark)
- Oncorhynchus kisutch (coho salmon)
- Pseudomyrmex gracilis (ant)
See more details on the Eukaryotic RefSeq Genome Annotation Status page.
Next week, NCBI staff will show you how to quickly find and download human genome annotations from both the web and the command line for incorporation into your workflows. We will also show you how to convert the accessions in these files to those used in other bioinformatics databases, as well as how to visualize these annotations on our Genome Data Viewer.
Date and time: Wednesday, May 10, 2017 12:00 PM – 12:30 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar.
After the live presentation, the webinar will be uploaded to the NCBI YouTube channel. Any related materials will be accessible from the Webinars and Courses page; you can also learn about future webinars there.
NCBI is pleased to offer a direct entry point to the NCBI Genome Data Viewer (GDV) that supports the exploration, visualization and analysis of eukaryotic RefSeq genome assemblies.
The new GDV homepage includes an interactive interface for a quick overview of supported organisms, specific genome searches plus inter-connectivity to Assembly and RefSeq annotation resources. About 100 genome assemblies are now ready for GDV exploration with more on the way. Stay tuned!
New icons are starting to appear in PubMed that take you directly to free full text publications uploaded in an institutional repository (IR). Here’s an example:
The icons only appear when there is no free full text available from the journal or PMC (PubMed Central). So far, only 4 IRs with eligible publications are participating – you can see which ones they are here. They already expand access to around 25,000 publications.
The NCBI program that enables this is LinkOut. You can read more about it in the NLM Technical Bulletin. IRs can apply by email to join LinkOut. And if you are an author at an institution with a repository, support your IR and enable more people to read your work.
NCBI’s RefSeq project provides comprehensive annotation of the human and other eukaryotic genomes through a combination of curation and an evidence-based eukaryotic genome annotation pipeline. Our curated records, ‘Known RefSeqs’, can be identified by the accession prefix (NM_, NR_, NG_, NP_). Model RefSeq records (XM_, XR_, and XP_ accession prefixes) are predicted based on transcript evidence (RNA-Seq and more) and protein support from Known RefSeqs, Swiss-Prot, and select INSDC records.
We recognize that many scientists access genome annotation data from one of three sources – NCBI, Ensembl, or UCSC. NCBI provides access to the human (and other) genome annotation results in the Genome Data Viewer, by BLAST and FTP, and per gene in NCBI’s Gene resource. Ensembl provides RefSeq annotation information based directly on the FTP content that NCBI releases. In the past, UCSC has provided a partial dataset of RefSeq human genome annotation content by aligning Known RefSeq transcripts to the genome using BLAT. Using this approach, additional model RefSeq transcript variants, non-transcribed pseudogenes, and immunoglobulin and T-cell receptor regions, were not available through UCSC services. In rare cases the independent alignment method resulted in small differences in the exon structure compared to NCBI’s placement details as well as some ambiguous placements for transcripts originating from very similar paralogs that are uniquely placed within the NCBI dataset.
This blog post is directed toward all authors who have articles in PubMed.
Have you ever discovered that your name isn’t spelled correctly in the citation on a PubMed record, or that there are mistakes in your affiliation, the title of the abstract, or other citation data?
We have good news: recently, NLM released the PubMed Data Management System (PMDM), which allows publishers to correct PubMed citation data directly. If you’re an author who has found citation mistakes in PubMed, you should contact the publisher of the journal, and they will make the changes. Changes made in PMDM, should appear in PubMed within 1-2 days.
Authors who report citation errors to NLM will be asked to contact the publisher directly. However, NLM will continue to investigate and address error reports that relate to our value-added data, such as MeSH Headings.
We’re hoping that this new process will shorten and simplify the process of correcting citation errors. You can read more about PMDM in the NLM Technical Bulletin. Please let us know if you have questions or comments, and we’re looking forward to more error-free citations!
Annotation Release 101 for the bottlenose dolphin (Tursiops truncatus) is out in RefSeq! This annotation was based on the NIST Tur_tru v1 assembly, which has a four-fold increase in contiguity from the assembly used in the previous annotation. Over four billion RNA-Seq reads from skin and blood tissue were used for gene prediction. As a result of these improvements, the percent of partially-represented protein-coding genes went down from 24% to 4%. Over 2500 genes that were fragmented in the previous assembly were merged into complete genes. A total of 24,026 genes were annotated, and 17,096 of them were protein-coding. A full report on the annotation can be found here.
Subscribe to the NCBI YouTube channel to receive alerts about new videos ranging from quick tips to full webinar presentations.