Tag: GenBank

Foreign Contamination Screen (FCS) tool for GenBank submissions

Foreign Contamination Screen (FCS) tool for GenBank submissions

We are excited to introduce a Foreign Contamination Screen (FCS) tool that you can now run yourself, with enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. If you submit genome assembly data to GenBank, the FCS tool is for you!

What is the FCS tool?

FCS, a quality assurance process used to make data suitable for submission, consists of two parts: FCS-adaptor and FCS-GX. FCS-adaptor searches for short sequences that are used as part of the lab preparation process and sometimes wind up in the final assembly by mistake. FCS-GX searches for sequences from a wide range of organisms including bacteria, fungi, protists, viruses, and others to identify sequences that don’t look like they are from the intended organism. In each case, you receive a report of the coordinates and identities of potential contaminants to be reviewed and removed (see Figure 1 for a sample report of the FCS-GX summary output). Both tools are designed to screen both eukaryote and prokaryote genomes.

Figure 1. FCS-GX report showing the summary of contamination identified in a tomato genome. The output indicates there are 83 sequences, adding up to 381 kb total length, to be removed from a mix of insect, fungal, and bacterial sources.

How do I use FCS?

FCS is available from GitHub. Simply download the two programs (FCS-adaptor and FCS-GX), and follow a few steps as outlined in the Quickstart. Both tools are also easy and inexpensive to run on commercial clouds such as Amazon Web Services (AWS) or Google Cloud Platform (GCP), and can screen genomes in a fraction of the time of other approaches. 

Why is FCS important?

Having high quality data available for analysis is necessary in order to arrive at accurate conclusions during research. With FCS, rapid detection of contaminants from foreign organisms in assembled genomes ensures that high value data is being provided for submission and available for reuse. We’ve already used FCS-GX to remove over one hundred megabases of contaminants and thousands of erroneous genes and proteins from previously submitted eukaryote genomes to make the data more useful for all. 

We want to hear from you!

We will update the FCS tool based on your feedback, so try it out and let us know what you think. Please contact us with comments and suggestions.

FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with FCS and other CGR news.

NLM’s all-new NCBI Datasets genome table is now available

NLM’s all-new NCBI Datasets genome table is now available

We are excited to introduce new and useful updates to the Datasets genome table that let you quickly find and download a genome dataset including genome, transcript and protein sequence, annotation, and a data report.

The new genome table includes many new features and benefits (see Figure 1). With the new genome table you can:

  • Find all current genomes, including metagenomes
  • View multiple taxa such as birds and bees, or polyphyletic groups like fish
  • Easily find genomes with NCBI RefSeq annotations
  • Get more accurate genome counts, since each row now represents a single genome with GenBank and RefSeq accessions for that genome in the same row
  • Customize your downloads to include either GenBank or RefSeq files, or both
  • Download tables or data packages

Continue reading “NLM’s all-new NCBI Datasets genome table is now available”

GenBank Release 250.0 is available!

GenBank Release 250.0 is available!

GenBank release 250.0 (6/17/2022) is now available on the NCBI FTP site. This release has 18.63 trillion bases and 2.69 billion records. 

The current release has 239,017,893 traditional records containing 1,395,628,631,187 base pairs of sequence data. There are also 1,796,349,114 WGS records containing 16,710,373,006,600 base pairs of sequence data, 546,991,572 bulk-oriented TSA records containing 485,056,129,761 base pairs of sequence data, and 111,142,107 bulk-oriented TLS records containing 41,999,358,847 base pairs of sequence data.

Continue reading “GenBank Release 250.0 is available!”

ASM Microbe 2022 was a success!

ASM Microbe 2022 was a success!

NCBI had the pleasure of attending and participating in this year’s American Society of Microbiology (ASM) Microbe conference, June 9-13 in Washington, D.C. NCBI staff participated in activities and events throughout the three-day conference. Over 4,500 attendees gathered in the exhibit hall and joined a variety of poster presentations and talks!

Reflections from a few of our NCBI experts

“It was a great honor for me to receive the ASM Elizabeth O. King Lecturer Award. Thank you to my colleagues, without whom so much of my work would not have been possible, and to all of those who attended my presentation on Making Genomics Accessible to Aid Public Health and Research.”

~Michael Feldgarden, Ph.D.  Continue reading “ASM Microbe 2022 was a success!”

Come see NCBI at the ASM Microbe Conference 2022

Come see NCBI at the ASM Microbe Conference 2022

The American Society of Microbiology (ASM) Microbe conference is back, and scheduled to take place in-person, June 9th-13th in Washington, D.C.

NCBI staff member Dr. Michael Feldgarden will be recognized by ASM with an award for his research. Other NCBI staff will present posters on NCBI resources and will also be available at our booth (#1128) to address your questions. Drop by to see what’s new and provide your feedback. We hope to see you there! Check out NCBI’s schedule of activities:  Continue reading “Come see NCBI at the ASM Microbe Conference 2022”

Average Nucleotide Identity (ANI) for assembly validation

Average Nucleotide Identity (ANI) for assembly validation

Validating genome assemblies submitted to GenBank using ANI based workflow

Average Nucleotide Identity (ANI) analysis is a useful tool to verify taxonomic identities in prokaryotic genomes. As part of the NCBI bacterial genome submission process, GenBank performs ANI analyses to compare submitted prokaryotic genome assemblies against reference data generated from type strains. You can learn about more about the relevant workflow and about type strain curation in our publications (PMC6978984 and PMC4383940).

We use genomes obtained from type strains (type assemblies) in computational comparisons, for example using ANI to reclassify or modify existing taxonomy with reasonable confidence. The taxonomy check status for all 1.3 million bacterial genome assemblies is summarized in the ANI_report_prokaryotes.txt file available from the ASSEMBLY_REPORTS FTP directory.  The README file describes the contents of the report in detail. You can run ANI on your genome on its own or in the context of annotation. Find more information here. Continue reading “Average Nucleotide Identity (ANI) for assembly validation”

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

The first complete genome sequence of the current monkeypox virus (MPXV) outbreak (isolate name MPXV_USA_2022_MA001) is now available with accession ON563414 in GenBank, a public database of DNA sequences hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).

Several cases of monkeypox have been identified in geographically widespread countries. Monkeypox is classified as a zoonotic disease where transmission of the virus is usually due to animal-human contact. Genetically, monkeypox viruses cluster into two groups: the Congo basin and the west African clade. This particular outbreak has been identified as due to a virus from the west African clade which is often associated with milder disease and, in this case, human-to-human spread is suspected. Continue reading “Monkeypox virus: Complete genome from the current outbreak now available in GenBank”

New in RAPT: Better taxonomic assignment and GO annotation

New in RAPT: Better taxonomic assignment and GO annotation

We are excited to announce two improvements to the Read assembly and Annotation Pipeline Tool (RAPT), which allows you to assemble genomic reads for bacterial or archaeal isolates and annotate their genes at the click of a button.

Improved taxonomic assignment

Now RAPT verifies the scientific name you provide with the reads, and corrects it as needed with the Average Nucleotide Identity (ANI) tool, which compares your genome to type strain assemblies in GenBank to place it in the taxonomic tree. So, even if you only have a rough idea of the species you have sequenced, input datasets tailored to your genome will be used for the annotation and you will get the best possible gene set from RAPT. Continue reading “New in RAPT: Better taxonomic assignment and GO annotation”

Announcing GenBank Release 249.0

Announcing GenBank Release 249.0

GenBank release 249.0 (4/19/2022) is now available on the NCBI FTP site. This release has 17.85 trillion bases and 2.66 billion records.

The current release has 237,520,318 traditional records containing 1,266,154,890,918 base pairs of sequence data. There are also 1,781,374,217 WGS records containing 16,071,520,702,170 base pairs of sequence data, 534,770,586 bulk-oriented TSA records containing 474,421,076,448 base pairs of sequence data, and 109,820,387 bulk-oriented TLS records containing 41,324,192,343 base pairs of sequence data.   Continue reading “Announcing GenBank Release 249.0”

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) has released a new resource, called the SARS-CoV-2 Variants Overview, that aggregates data related to SARS-CoV-2 variants from sequences available in NCBI’s GenBank and Sequence Read Archive (SRA) databases.

SARS-CoV-2 Variants Overview, a freely available online dashboard, was developed with guidance from the TRACE Working Group as part of NLM’s participation in the National Institutes of Health (NIH) Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.

One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies. Continue reading “Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 “