Do you need a smaller dataset for your analyses of virus data? In response to your feedback, NCBI Virus now allows you to download a randomized subset of your results for nucleotide, protein, or RefSeq genome sequences from any supported virus (Figure 1). This option is useful for viruses such as SARS-CoV-2 or Influenza A that have very large numbers of records, where the entire dataset may present a challenge. In such cases, a smaller representative sample is easier to work with to support your analysis. You can also reduce the bias in a dataset by getting a representative number of records for each country or host (Figure 2).
Figure 1: Virus Download Results menu with the option to “Download a randomized subset of all records (up to 2,000)”
“Download a randomized subset” option provides two types of random subsets:
- 2,000 entries selected randomly from the entire data set
- 20 random records from each country or each host in your data set (Figure 2)
Important Note: The randomized subsets option is only available for the data type (Nucleotide, Protein, or RefSeq Genome) selected above the results table on the webpage. Before downloading, you can still apply filters to narrow down the sequences in your dataset.
Figure 2: Download menu showing options to get up to 2000 randomized records or up to 20 random records stratified by Country or Host.
You can download data in all supported formats including:
- Results table showing metadata
- List of accessions
- FASTA sequences using default or custom definition lines
Stay up to date
Follow us on Twitter @NCBI and join our mailing list to keep up to date with NCBI Virus and other NCBI news.
We want to hear from you! Try it out and let us know what you think. If you have questions or would like to provide feedback, please use the yellow Feedback button on our website or reach out to us at firstname.lastname@example.org.