Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon Highlights

The National Institutes of Health (NIH) Office of Data Science Strategy (ODSS), the National Library of Medicine’s (NLM’s) National Center for Biotechnology and Information (NCBI), and the Department of Energy’s (DOE’s) Office of Biological and Environmental Research (BER) hosted scientists from around the world for a virtual Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon. The codeathon, held September 27-October 1, 2021, attracted experts from national laboratories including the Los Alamos National laboratory, research institutions including the Joint Genome Institute, and students from universities across the world to develop benchmarking approaches to address challenges in conducting large-scale analyses of metagenomic data.

The pace at which scientists generate data is accelerating due to advances in research approaches and technology. Rapidly growing public databases like NIH’s Sequence Read Archive (SRA) provide new opportunities and challenges for scientific discovery. Today, more than 16 million unique “next-generation” SRA sequencing records are publicly available from the Google Cloud Platform and Amazon Web Services Open Data Sponsorship Program (AWS ODP) representing more than 17 Petabytes (PB) of data in the normalized SRA format on each cloud provider.

To take advantage of this growing collection of biomedical data, there is a need for efficient methods to search the archive using nucleotide sequences. Just as the introduction of tools like Basic Local Alignment Search Tool (BLAST) provided a key to unlock the potential of the GenBank archive, similar approaches are needed for SRA. Towards these efforts, we have developed an interagency Emerging Solutions in Petabyte Scale Sequence Search (ESPSSS) initiative which hosted its first workshop in June. As metagenomic samples comprise more than 30% of the sequence records in SRA, ESPSSS is initially focusing on metagenomic benchmarking. In the spirit of developing community driven solutions, ESPSSS hosted the virtual Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon in September to bring together students, researchers, and computing professionals to collaborate on developing sequence search benchmarking approaches.

Collaborative work by codeathon participants—who were split into four teams— generated the following proof-of-concept or early-stage solutions:

  1. A pipeline used for the identification of metagenomic samples with user-provided long sequence queries,
  2. A gold-standard dataset and pipeline to benchmark contig containments,
  3. A benchmark harness for read/contig tools, and
  4. A pipeline to combine an experimental SRA sequence index with BLAST.

You can learn more about these projects by visiting their GitHub repositories. Work is underway to publish the solutions developed via this codeathon and to plan for a second codeathons in the summer of 2022 to further refine our approaches, so stay tuned!

If you have any questions about NCBI codeathons or interest in participating in future events, please reach out to the NCBI codeathon team at . For any queries related to the petabyte scale sequence search initiative, please contact

  1. This looks like an amazing event, will definitely keep an eye out for the 2022 event. One thing, the GitHub link isn’t working, could this be fixed to look at the code generated from this event?

