If you are a consumer or producer of AGP (A Golden Path) files for genome assemblies, please read on. We’d like your feedback on the proposed changes described here.
As you know, AGP files are used to describe the structure of certain genome assemblies. The AGP file format has not kept up with changes in sequencing technology or International Sequence Database Collaboration (INSDC) feature usage. NCBI is therefore proposing to extend the current AGP v2.0 specification to add new linkage evidence types and a gap type of “contamination” as detailed below and described in the AGP v2.1 proposed specification.
Proposed changes from AGP v2.0 to AGP v2.1:
Add ‘proximity-ligation’ and ‘pcr’ to the set of accepted linkage evidence values
Drop ‘strobe’ from the set of accepted linkage evidence values
Expand the definition of ‘paired-end’ linkage evidence to include ‘mate-pairs’ and molecular-barcode techniques
Add a gap-type of ‘contamination’
definition: a gap inserted in place of foreign sequence to maintain the coordinates
usage: treated as linked to preserve the original scaffold but with linkage evidence ‘unspecified’
April 16 – May 7: Comment period
May 8 – May 10: AGP v2.1 proposal finalized
May 12 – May 16: AGP v2.1 approved at the annual INSDC meeting
Summer 2019: NCBI begins accepting the new linkage-evidence types, and using the contamination gap type
Note: NCBI would continue to accept genome submissions in AGP v2.0 format.
We are seeking your input on these proposed changes. Please comment on this post or write to email@example.com if you have any comments or suggestions.
NCBI announces Annotation Release 100 of the Pacific white shrimp (Penaeus vannamei) genome in RefSeq, based on the assembly (GCF_003789085.1) submitted by the Institute of Oceanology, Chinese Academy of Sciences. The Pacific white shrimp is one of the most important shrimp species in fisheries and aquaculture and represents the first decapod to have its genome annotated by NCBI. We predicted 24,987 protein coding genes with evidence from alignment of six billion RNA-Seq reads and homology with invertebrate proteins. This annotation will enable genomic research in this commercially important species.
If you’ve been searching in Gene, Nucleotide, Protein, Genome or Assembly databases, you’ve probably noticed the new search experience we introduced in September to interpret several common language searches and offer improved results. We’re excited to announce we’ve added as-you-type suggestions to the search bar in these databases.
Here’s a peek at the new menu in the NCBI Gene database.
Figure 1. Typing into the search box brings up automatic suggestions of the most popular queries.
Earlier this year, we announced the release of a new and improved search feature that interprets plain language to give better results for common searches. This feature, originally developed in NCBI Labs and later released on the NCBI All Databases search, is now available across several NCBI resources: Nucleotide, Protein, Gene, Genome, and Assembly. Whether you are searching for a specific gene or for a whole genome, you will now retrieve NCBI’s best results regardless of the database you search.
The image below shows the results for a search for human INS in the Nucleotide database. Even though this is a Nucleotide search, the results include relevant information from Gene, Protein, Taxonomy, plus links to the NCBI reference sequences (RefSeq) as well as access to BLAST and the insulin gene region in NCBI’s genome browser, the Genome Data Viewer.Figure 1. The new natural language search result in the Nucleotide database from a search for human INS.
Try out this new search capability and let us know what you think. And keep visiting the NCBI Labs search page to try our latest experiments, which we’ll also announce here on NCBI Insights.
addition of new file types, including a feature_count.txt file with counts of gene, RNA, and CDS features of specific types and a translated_cds.faa file with conceptual translations of each CDS feature on the genome
improvements to the Sequence Ontology feature types used in GFF3, including identification of pseudogene gene features as “pseudogene” instead of “gene” in column 3
improvements to the gene_biotype calculation to categorize transcribed pseudogenes as transcribed_pseudogene instead of misc_RNA
addition of the #!annotation-source unofficial pragma to GFF3 files with the annotation name, for assemblies where that information is available
expanded the UCSC sequence name mapping provided in the assembly report files to provide mappings between GenBank or RefSeq sequence accessions, chromosome or scaffold names, and the UCSC sequence name for most of the recent assemblies in the UCSC Genome Browser
Nearly complete set of translation-related genes lends support to hypothesis that giant viruses evolved from smaller viruses
An international team of researchers, including NCBI’s Eugene Koonin and Natalya Yutin, has discovered a novel group of giant viruses (dubbed “Klosneuviruses”) with a more complete set of translation machinery genes than any virus that has been described to date. “This discovery significantly expands our understanding of viral evolution,” said Koonin. “These are the most ‘cell-like’ viruses ever identified. However, the computational analysis of the virus genomes shows that these viruses have not evolved from cells by reductive evolution but rather have evolved from smaller viruses, gradually acquiring genes from their hosts at different stages of their evolution.”
In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.
Figure 1. Annotation of a region of chromosome 17 near the KCNJ12 and KCNJ18 genes. Top panel: Annotation release 105 on GRCh37.p13 represented by a configured graphic display of sequence record NC_000017.10. Bottom panel: Annotation release 106 on assembly GRCh38 represented by a configured graphic display of sequence record NC_000017.11. New gene models are circled.
In a previous blog post, we explained several important concepts about the human reference genome. We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled. In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013. This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.