Want to submit high-quality data quickly and easily to GenBank? Check out our Foreign Contamination Screen (FCS) tool, a quality assurance process that you can run yourself. FCS offers enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. We recently made several improvements to make the tool even easier to use!
Now quicker and easier to run!
Decontaminate your genome with just one extra step.
Save the removed sequences in a separate file, if desired.
Find more contaminants with improved coverage of prokaryotes, protists, and more.
Do you submit or access Sequence Read Archive (SRA) data? In an ongoing effort to enhance your experience, NCBI is making several improvements to our widely used SRA database. SRA is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all organisms as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enable reproducibility and facilitate new discoveries through data analysis.
What improvements is NCBI making?
More transparent: We recently launched the GenBank and SRA Data processing page to help you better understand how sequence data are submitted, processed, and made publicly available.
More efficient: Faster data transfers, downloads, and analyses! We will be incrementally streamlining how you access SRA data as SRA Lite becomes the standard SRA file format. This simplified format reduces the average file size for more efficient analysis and storage of large datasets.
More reliable: A trusted source! SRA is a trustworthy database, and we are continuously improving our processes to ensure system reliability.
GenBank release 254.0 (2/19/2022) is now available on the NCBI FTP site. This release has 22.52 trillion bases and 3.37 billion records. The current release has 241,830,635 traditional records containing 1,731,302,248,418 base pairs of sequence data. There are also 2,337,838,461 WGS records containing 20,116,642,176,263 base pairs of sequence data, 672,261,981 bulk-oriented TSA records containing 630,615,054,587 base pairs of sequence data, and 121,067,644 bulk-oriented TLS records containing 46,465,508,548 base pairs of sequence data. Continue reading “GenBank Release 254.0 is Available!”→
Do you submit eukaryotic nuclear mRNA sequences to GenBank? A new mRNA submission wizard is available! Built on the modern Submission Portal framework, this new wizard will bring you an enhanced experience, including:
Guided submission experience specific for mRNA sequences
Automated trimming of vector and removal of short sequences
Easier input for source metadata
New feature annotation web forms for coding region (CDS) and untranslated region (5’ UTR, 3’ UTR)
Extensive feature previews (Figure 1)
Faster sequence processing and accession assignment
Access to a fix error workflow prior to accession assignment
Interested in understanding how sequence data are submitted, processed, and made publicly available in GenBank and the Sequence Read Archive (SRA)? Announcing the GenBank and SRA Data Processing webpage!
NCBI is looking forward to seeing you in person at the International Plant and Animal Genome Conference (PAG 30), January 13-18, 2023 in San Diego, California.
We’re especially excited to share our recent efforts on the NIH Comparative Genomics Resource (CGR), a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources on biomedical research.
We also want to hear from you! If you’re interested in sharing your feedback on your needs and experiences involving comparative genomics tools to inform CGR, consider joining our Feedback Session.
Check out NCBI’s schedule of activities and events:
GenBank release 252.0 (10/17/2022) is now available on the NCBI FTP site. This release has 20.35 trillion bases and 3.10 billion records. The current release has 240,539,282 traditional records containing 1,562,963,366,851 base pairs of sequence data. There are also 2,167,900,306 WGS records containing 18,231,960,808,828 base pairs of sequence data, 574,020,080 bulk-oriented TSA records containing 511,476,787,957 base pairs of sequence data, and 115,123,306 bulk-oriented TLS records containing 43,860,512,749 base pairs of sequence data.
GenBank release 251.0 (8/15/2022) is now available on the NCBI FTP site. This release has 19.55 trillion bases and 2.94 billion records. The current release has 239,915,786 traditional records containing 1,492,800,704,497 base pairs of sequence data. There are also 2,024,099,677 WGS records containing 17,511,809,676,629 base pairs of sequence data, 560,196,830 bulk-oriented TSA records containing 497,501,380,386 base pairs of sequence data, and 115,103,527 bulk-oriented TLS records containing 43,852,280,645 base pairs of sequence data.
We are excited to introduce a Foreign Contamination Screen (FCS) tool that you can now run yourself, with enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. If you submit genome assembly data to GenBank, the FCS tool is for you!
What is the FCS tool?
FCS, a quality assurance process used to make data suitable for submission, consists of two parts: FCS-adaptor and FCS-GX. FCS-adaptor searches for short sequences that are used as part of the lab preparation process and sometimes wind up in the final assembly by mistake. FCS-GX searches for sequences from a wide range of organisms including bacteria, fungi, protists, viruses, and others to identify sequences that don’t look like they are from the intended organism. In each case, you receive a report of the coordinates and identities of potential contaminants to be reviewed and removed (see Figure 1 for a sample report of the FCS-GX summary output). Both tools are designed to screen both eukaryote and prokaryote genomes.
Figure 1. FCS-GX report showing the summary of contamination identified in a tomato genome. The output indicates there are 83 sequences, adding up to 381 kb total length, to be removed from a mix of insect, fungal, and bacterial sources.
How do I use FCS?
FCS is available from GitHub. Simply download the two programs (FCS-adaptor and FCS-GX), and follow a few steps as outlined in the Quickstart. Both tools are also easy and inexpensive to run on commercial clouds such as Amazon Web Services (AWS) or Google Cloud Platform (GCP), and can screen genomes in a fraction of the time of other approaches.
Why is FCS important?
Having high quality data available for analysis is necessary in order to arrive at accurate conclusions during research. With FCS, rapiddetection of contaminants from foreign organisms in assembled genomes ensures that high value data is being provided for submission and available for reuse. We’ve already used FCS-GX to remove over one hundred megabases of contaminants and thousands of erroneous genes and proteins from previously submitted eukaryote genomes to make the data more useful for all.
We want to hear from you!
We will update the FCS tool based on your feedback, so try it out and let us know what you think. Please contact us with comments and suggestions.