Author: NCBI Staff

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Do you work with human-derived sequence data? Do you often struggle with the need to determine if your data is free of human sequence and therefore suitable for public distribution? We encourage submitters to screen for and remove contaminating human reads from data files prior to submission to SRA. To support investigators in this effort, we offer a tool to remove human sequence contamination from your SRA submissions!

Human Read Removal Tool (HRRT)

The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with ‘N’. Continue reading “Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions”

ClinVar to offer improved support for somatic data

ClinVar to offer improved support for somatic data

We need your input! 

ClinVar is NCBI’s archive of reports of the relationships among human genetic variations and diseases, with supporting evidence. To make ClinVar data more accurate and useful, we are introducing an enhanced data model to better accept and support classifications of somatic variants. 

How you can help 

Do you have somatic variant classifications to submit to ClinVar? We want to hear from you! We are now testing ClinVar’s enhanced data model and support for classifications of somatic variants.   Continue reading “ClinVar to offer improved support for somatic data”

New wizard for submitting mRNA sequences to GenBank

New wizard for submitting mRNA sequences to GenBank

Do you submit eukaryotic nuclear mRNA sequences to GenBank? A new mRNA submission wizard is available! Built on the modern Submission Portal framework, this new wizard will bring you an enhanced experience, including:  

    • Guided submission experience specific for mRNA sequences 
    • Automated trimming of vector and removal of short sequences  
    • Easier input for source metadata 
    • New feature annotation web forms for coding region (CDS) and untranslated region (5’ UTR, 3’ UTR)  
    • Extensive feature previews (Figure 1) 
    • Faster sequence processing and accession assignment  
    • Access to a fix error workflow prior to accession assignment 

Watch a short video (4 min) to see how to annotate CDS features in this new wizard!  Continue reading “New wizard for submitting mRNA sequences to GenBank”

NCBI-NIAID Beyond Phylogenies Codeathon was a success!

NCBI-NIAID Beyond Phylogenies Codeathon was a success!

SARS-CoV-2 genomic data is critical for monitoring the viral spread and evolution of the COVID-19 pandemic, identifying newly emerging variants, and developing and evaluating the countermeasures. As of September 2022, over 13 million SARS-CoV-2 genomes have been sequenced across the world, making it the most sequenced pathogen ever. A cornerstone of genomic analysis is building a phylogeny, which demonstrates the relatedness of individual isolates to the rest of the sequenced genomes. However, the volume of SARS-CoV-2 genomes presents novel opportunities beyond phylogenies, as well as computational challenges to traditional methods of genomic analyses and visualization. Continue reading “NCBI-NIAID Beyond Phylogenies Codeathon was a success!”

Full-scale access to microbial Pathogen Detection data in the Cloud!

Full-scale access to microbial Pathogen Detection data in the Cloud!

NCBI’s Pathogen Detection resource now provides selected data on the Google Cloud Platform (GCP) allowing you better access to over 1 million bacterial isolates.

Data on GCP include:

  1. The tables from the MicroBIGG-E database of anti-microbial resistance (AMR), stress response, virulence genes, and genomic elements and the Pathogen Isolates Browser that are both accessible through Google BigQuery.
  2. The MicroBIGG-E sequences in FASTA format that are available from Google Cloud Storage.

Features & Benefits

Pathogen Detection data on GCP allows you larger-scale access than is currently available through the web or from FTP.  Notably, there is no FTP access to MicroBIGG-E; the web interface is limited to 100K rows and sequence downloads are restricted.  There are no such restrictions on GCP. MicroBIGG-E at BigQuery also allows you to download all AMRFinderPlus results. Currently there are more than 20 million rows of antimicrobial resistance, virulence, and stress response genes, and point mutations, identified in more than 1 million pathogen isolates.

Here are two examples where researchers have used MicroBIGG-E and AMFinderPlus data to advance research on antimicrobial resistance:

    • Identifying conserved functional regions in erythromycin resistance methyltransferases (PMID: 34795028).
    • Assessing the health risks of antibiotic resistance genes (PMCID: PMC8346589).

Continue reading “Full-scale access to microbial Pathogen Detection data in the Cloud!”

RefSeq Release 216

RefSeq Release 216

RefSeq release 216 is now available online, from the FTP site, and through NCBI’s new resource, Datasets.

This full release incorporates genomic, transcript, and protein data available as of January 9, 2023, and contains 342,395,932 records, including 249,868,639 proteins, 49,869,497 RNAs, and sequences from 128,299 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 216”

Introducing a new and improved SciENcv experience!

Introducing a new and improved SciENcv experience!

Want to submit federal grant applications quickly and easily? Check out our new and improved SciENcv experience! Science Experts Network Curriculum Vitae (SciENcv) is an electronic system that helps you assemble professional information needed to apply for federal grant support.  

SciENcv helps you gather and compile information on expertise, employment, education, and professional accomplishments. You can use SciENcv to create and maintain financial documents and biographical sketches that are submitted as part of grant application packages.  Continue reading “Introducing a new and improved SciENcv experience!”

New feature in the Comparative Genome Viewer!

New feature in the Comparative Genome Viewer!

Easily distinguish reverse orientation alignments

We are excited to announce an update to NCBI’s Comparative Genome Viewer (CGV) that allows you to quickly determine the relative orientation of aligned segments. CGV displays whole genome alignments between two different eukaryotic assemblies (Figure 1). 

In the viewer, individual alignment regions are connected by colored bands between two chromosomes. These alignments are now colored differently depending on whether the aligned sequences on the two assemblies are in the same orientation (forward) or reverse orientation relative to one another. Forward orientation alignments are connected by green bands, whereas reverse alignments are connected by purple bands. Reverse alignments represent local genome inversions or inverted translocations and may point to areas of significant biological difference between the two assemblies.   Continue reading “New feature in the Comparative Genome Viewer!”

Announcing the GenBank and SRA Data Processing Webpage

Announcing the GenBank and SRA Data Processing Webpage

Interested in understanding how sequence data are submitted, processed, and made publicly available in GenBank and the Sequence Read Archive (SRA)? Announcing the GenBank and SRA Data Processing webpage!

Here you can learn about procedures that the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), uses for processing submitted data and public posting, as well as key definitions of data status. Continue reading “Announcing the GenBank and SRA Data Processing Webpage”

Next Phase of the NIH Preprint Pilot Launching Soon

Next Phase of the NIH Preprint Pilot Launching Soon

Phase 2 expands the scope of the preprints included in PubMed and PMC

Last month, the National Library of Medicine (NLM) announced plans to extend its NIH Preprint Pilot in PubMed Central (PMC) and PubMed beyond COVID-19 to encompass all preprints reporting on NIH-funded research. The second phase of the pilot, launching later this month, will include preprints supported by an NIH award, contract, or intramural program and posted to an eligible preprint server on or after January 1, 2023.   Continue reading “Next Phase of the NIH Preprint Pilot Launching Soon”