We are pleased to announce the second installment of the Virus Hunting Codeathon that will take place from November 4-6, 2019 at the University of Maryland in College Park.
The NCBI will help run this bioinformatics codeathon, hosted by the UMIACS and CBCB at the University of Maryland. The purpose of this event is to continue develop techniques, code, and pipelines to identify known, taxonomically definable, and novel viruses from metagenomic datasets on cloud infrastructure.
This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. We especially encourage people who have experience in Computational Virus Hunting or related fields to participate. The event is open to anyone selected for the codeathon and willing to travel to College Park (see below).
- Fast, federated indexing
- Metadata features
- Genome graphs for viruses
- Approximate taxonomic analysis
- Domain/HMM Boundary and Taxonomic Refinement
- Bringing together approximate taxonomy and domain models
- Sequence data quality metrics
- Phage-host interactions
We will provide the final list of projects before the codeathon starts.
Have you ever searched for a variant in ClinVar with a gene symbol and a c., and wondered why you got no result? Is the variant not in ClinVar, or was something wrong with your search?
Wonder no more – we’ve improved searching in ClinVar so you get results for a gene symbol and c. more often!
While a gene symbol and c. make an ambiguous query and a full HGVS expression is always the best search term, this new service will help you find the variant when gene symbol and c. are all the information that you have.
Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.
The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:
- incorrect organism assignment
- metagenome submitted as an organism genome
- targeted sub-genome assembly not flagged as partial genome representation
- gross contamination with other sequences
You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!
Try the following examples:
For more information, see the Genome Size Check documentation.
As part of our ongoing effort to improve your search experience, we’ve made it easier for you to find the sequence of your favorite organelle genome plus all the information and data associated with it. To find organelle genomes, search for an organism name combined with an organelle description, for example human mitochondrion, tomato chloroplast or Toxoplasma gondii RH apicoplast.
A new results panel will appear with links to the organelle genome sequence, annotated genes, and related phylogenetic and population studies. The panel appears with these searches in an All Databases search or within any of NCBI’s sequence databases including Gene, Nucleotide, Protein, Genome, Assembly. For the human mitochondrial genome, a graphical schematic of the genome allows you to navigate to individual mitochondrial encoded genes (Figure 1).
Figure 1. The organelle genome results for a search with human mitochondrion. The panel provides access to analysis tools, downloads, and other relevant results. Clicking any of the gene objects on the genome graphic links leads to the relevant Gene record, for example Gene ID: 4512 in the case of COX1.
Try it out using the following example searches and let us know what you think!
On Wednesday, September 25, 2019 at 12 PM, NCBI staff will present a webinar on the new My Bibliography, a central place to save and share your citations. You can add PubMed citations, create them manually, or upload them from citation managers. In this webinar you will learn how to navigate the new interface, receive a few helpful tips to make your experience easier, get a sneak peek of features under development, and learn how you can help us improve My Bibliography by providing feedback.
- Date and time: Wed, Sep 25, 2019 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
You can now access RefSeq release 96 online, from the FTP site, and through NCBI’s Entrez programming utilities (E-utilities).
This full release incorporates genomic, transcript, and protein data available, as of September 9, 2019 and contains 213,863,503 records, including 152,910,397 proteins, 28,017,380 RNAs, and sequences from 94,946 organisms.
The release is provided as a complete dataset and also in several directories divided by logical groupings.
1. New Mus musculus (house mouse) Annotation Release 108
The latest annotation run for Mus musculus, 108, is a complete re-annotation of the mouse GRCm38.p6 assembly that incorporates ongoing curation work and new computed models based on extensive long-read transcriptome data.
See the annotation report for details. You can access these annotation products through the sequence databases and on the FTP site.
2. Updated Homo sapiens Annotation Release 109.20190905
Annotation Release 109.20190905 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report has details. You can access the annotation products from the sequence databases or download the data from the FTP site. We will continue to update the human genome annotation frequently so that we can
incorporate ongoing curation work including the MANE project and other curation activities. See our post on the increased frequency of annotation for more information on the new schedule.
3. dbSNP Human Build 153
The short variations (SNPs) annotated on human RefSeq transcripts and RefSeqGene records now incorporate data from dbSNP build 153.
The most popular filters are included on the new PubMed sidebar by default. You can now access many more filters using the additional filters link. Try it today and let us know what you think!
Figure 1. Click the “Additional filters” button to see many more filters.
We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.
Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.
See our previous post and our documentation for details on how to obtain and run PGAP yourself.
Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!
NCBI is pleased to announce a Structural Variant Hackathon at the Baylor College of Medicine, Houston Texas, immediately before ASHG on October 11-13, 2019.
We’re specifically looking for folks who have experience in working with structural variants, complex disease, precision medicine, and similar genomic analysis. If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for large scale genomic analyses from high-throughput experiments (please note that the event itself will focus on open access public human data).
Potential topics include:
- Mapping structural variants to public databases
- Calculating the heritability of different types of structural variants
- CNV effect on isoform expression
- Assembly accuracy for metagenomics
- Quality assessment in large cohorts
The hackathon runs from 9 am – 6 pm each day, with the potential to extend into the evening hours each day. There will also be optional social events at the end of each day. Working groups of five to six individuals, with various backgrounds and expertise, will be formed into five to eight teams with an experienced leader. These teams will build pipelines and tools to analyze large datasets within a cloud infrastructure. Each day, we will come together to discuss progress on each of the topics, bioinformatics best practices, coding styles, etc.
There will be no registration fee associated with attending this event.
Note: Participants will need to bring their own laptop to this program. No financial support for travel, lodging, or meals is available for this event.
Now it’s easier than ever to access all data in ClinVar for a variant or set of variants across all reported diseases. ClinVar’s new XML is organized by variant only (Variation ID), instead of the variant-disease pair. This reduces redundancy, for example in cases where a variant is related to several disease concepts, and makes the XML consistent with the ClinVar web pages. You can get ClinVarVariationRelease XML from the /xml/clinvar_variation/ directory on the ClinVar FTP site. New features in ClinVarVariationRelease XML shown in Figure 1 include:
- Explicit elements to distinguish between variants that were directly interpreted and “included” variants, those that were interpreted only as part of a Haplotype or Genotype. The clinical significance for included variants is indicated as “no interpretation for the single variant”.
- Explicit elements to distinguish records for simple allele, haplotypes, and genotypes
- The Replaces element that provides a history and indicates accessions that were merged into the current accession.
- A section that maps the submitted name or identifier for the interpreted condition to the corresponding name used in ClinVar and the MedGen Concept Identifier (CUI)
Figure 1. ClinVar variant-centric XML showing a variant record for a haplotype (VCV000236230) that comprises two included variations (SimpleAlleles) that are marked as “no interpretation for the single variant”. The record includes all the condition records (RCVList) with names and identifiers from MedGen, OMIM and other sources.
To learn more about how to use this data, read our documentation.
Tell us how ClinVar has helped you by writing to us at email@example.com.