
Beyond Phylogenies: Enriching Analyses and Visualizations of Genomic Variants Codeathon
Codeathon Outcomes:
This codeathon concluded on July 29th, 2022. To see what teams accomplished, check out their GitHub repositories.
Codeathon Description:
Large-scale genomic variant datasets are an important tool for basic science and public health. However, the massive size of modern datasets, such as those generated by the response to the COVID-19 pandemic, present computational challenges to traditional methods of analysis and visualization. In particular, the common workflow of determining consensus sequences, building a phylogenetic tree, and then visually inspecting the tree does not scale well to datasets with millions of samples. Large phylogenies are costly to generate, do not show very large numbers of samples in an interpretable way, and are poorly suited to simultaneously displaying multilayered metadata. This approach also discards information about within-sample variation when the sequenced samples are taken from populations of microbes or viruses. To foster the development of new tools for analyzing and visualizing large datasets of genomic variants, we are hosting a virtual codeathon Beyond Phylogenies: Enriched Analyses and Visualizations of Genomic Variants with The National Institute of Allergy and Infectious Diseases (NIAID).
The codeathon will focus on developing software to address four related problems:
- Building phylogenies from Variant Call Format (VCF) files accounting for within-sample variants. Typically, phylogenetic trees are inferred from alignments of consensus sequences, which do not capture within-sample diversity in the population samples that are common for viruses and microbes. In contrast to consensus sequences, VCF files store the number of sequencing reads supporting each allele at a variant site. Thus, building trees directly from VCFs, without intermediate consensus sequences, could open the door to richer analyses of complex viral and microbial datasets.
- Creating rich visualizations of phylogenies that display multilayered metadata. Simple static tree diagrams can only display a few features of metadata at the same time without becoming hopelessly cluttered. However, modern datasets can contain a great number of features, including clinical data, vaccine status, collection location, experimental conditions, epidemiological features, and taxonomical nomenclature. Enriching phylogenies with a large variety of metadata simultaneously would facilitate data exploration in biological, clinical, and epidemiological contexts.
- Optimizing analytical approaches and visualizations of relatedness to work with millions of samples. Genomic variant datasets, including those for SARS-CoV-2, are growing rapidly and reaching the million-sample scale. Building phylogenies from such large datasets is computationally intractable. Furthermore, a phylogeny with millions of leaves would be impossible to display usefully as a standard tree diagram. We’re looking for creative ways to represent and analyze patterns of relatedness in large datasets that are tractable and interpretable while preserving as much information as possible.
- Automated inference from phylogenies and variant datasets. For very large datasets with many metadata features, visual inspection of phylogenies is challenging and doomed to miss useful information. We’re interested in using methods from machine learning, statistics, and population genetics to automatically identify clades or variants of particular biological or clinical interest.
We seek to bring together a diverse group of collaborators with expertise in fields including phylogenetics, bioinformatics, and data visualization. We encourage both programmers and non-programming subject matter experts to apply. We will assign applicants to codeathon teams of 5-10 people based on their interests and skills.
During the week-long event, teams will collaborate virtually to design visualizations and write software to address one of the problems above. The codeathon will be cooperative rather than competitive, and teams will share ideas and technical expertise. At the end of the week, teams will present their work to each other and to representatives from NIH.
After the event, we will make the team products publicly available through the NCBI Codeathons GitHub Organization. Participants are encouraged to co-author a joint manuscript. We also encourage participants to share their work online and at conferences.
If you are interested in participating, pitching an idea for a team project, and/or serving as a team leader, apply at the “Registration” link or email us at codeathons@ncbi.nlm.nih.gov. Please note, participation may be capped due to technical limitations and the total number of accepted projects.