Have you ever wondered how your genetic make-up is different from your neighbor’s? The National Human Genome Research Institute (NHGRI)-funded Human Pangenome Research Consortium (HPRC) has built an initial version of a pangenome reference – a collection of new human reference genome sequences representing 47 individuals from across the globe. Pangenome graphs relate the sequences from the different genomes to one another. The pangenome allows researchers to compare these DNA sequences and get a more detailed view of the range of human genetic variation. This is the first step toward the HPRC’s goal of building a pangenome reference comprised of the genomes of 350 individuals from diverse genetic backgrounds. Continue reading “Now Available! Access Data from the Human Pangenome Research Consortium (HPRC) at NCBI”
Tag: BioProject
Important Update! Changes to ASSEMBLY_REPORTS and GENOME_REPORTS on FTP
Do you currently access genome assembly data through the FTP site? We are consolidating information provided in the ASSEMBLY_REPORTS and GENOME_REPORTS directories on the genomes FTP site to simplify access and ensure that you have the most accurate, up to date, and consistently reported data.
The assembly_summary files in the ASSEMBLY_REPORTS directory are gaining information in newly added columns 24-38, including statistics about the assembly (size, GC content, genome size, and number of sequences) as well as details about the provided annotation (number of genes, annotation name and date). See example below (Table 1). Check out the README for more details about the contents of the summary files. Continue reading “Important Update! Changes to ASSEMBLY_REPORTS and GENOME_REPORTS on FTP”
Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions
Do you work with human-derived sequence data? Do you often struggle with the need to determine if your data is free of human sequence and therefore suitable for public distribution? We encourage submitters to screen for and remove contaminating human reads from data files prior to submission to SRA. To support investigators in this effort, we offer a tool to remove human sequence contamination from your SRA submissions!
Human Read Removal Tool (HRRT)
The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with ‘N’. Continue reading “Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions”
Gapless Telomere to Telomere human genome (T2T-CHM13) now available
On April 1, 2022, Science published the first complete sequence of a human genome, known as T2T-CHM13. This notable scientific achievement comes two decades after the first human genome release from the Human Genome Project and offers an in situ look at biologically important regions, such as centromeres, telomeres, and segmental duplications, that were previously unassembled. Read on to learn more about how you can access this assembly and related resources at NCBI, or to access any one of the more than 1000 human genome assemblies now in GenBank. Continue reading “Gapless Telomere to Telomere human genome (T2T-CHM13) now available”
Retrieve genome data by BioProject using the Datasets command-line tool
You can now retrieve genome data using the NCBI Datasets command-line tool and API by simply providing a BioProject accession. You can go directly from a BioProject accession to genome data even when the BioProject accession is the parent of multiple BioProjects (Figure 1).
Figure 1. Command-lines using BioProject accessions with the datasets command-line tool and sample metadata. Top panel: command-line for downloading genome metadata for the Sanger 25 Genomes Project (PRJEB33226). Middle panel: a portion of the metadata in JSON format for the 25 Genomes Project. Bottom panel: command-line for downloading sequence data and annotation metadata for a component BioProject for the king scallop (PRJEB35331). Continue reading “Retrieve genome data by BioProject using the Datasets command-line tool”
Researchers: Now it’s easier to find the data you want in BioProject
We’ve improved BioProject to give you a better way to find all data from a specific project. We think you’ll love the new interface that lets you quickly choose the right BioProject with links to the data you want in other NCBI databases.
The updated BioProject browser makes it easier than ever to filter the data by a variety of attributes so you can quickly pick BioProjects that interest you.

Continue reading “Researchers: Now it’s easier to find the data you want in BioProject”
RefSeq Functional Elements now public
NCBI is pleased to announce the initial data release of RefSeq Functional Elements, a resource that provides RefSeq and Gene records for experimentally validated human and mouse non-genic functional elements. Data can be accessed via Gene, Nucleotide, BLAST, BioProject, Graphical Displays and FTP.
Accessing the Hidden Kingdom: Fungal ITS Reference Sequences
This post is geared toward fungi researchers as well as RefSeq and BLAST users.
Fungi have unique characteristics that can make it difficult to identify and classify species based on morphology. To address these issues, Conrad Schoch, NCBI’s fungi taxonomist, and Barbara Robbertse, NCBI’s fungi RefSeq curator, in collaboration with outside mycology experts, are curating a set of fungal sequences from internal transcribed spacer (ITS) regions of the nuclear ribosomal RNA genes. This set of standard DNA sequences for fungal taxa not only addresses these difficulties in identifying and classifying fungal species by morphology, but is also essential for analyzing environmental (metagenomics) sequencing studies. The curated ITS sequences, described in a recent article in Database (PMC Free Article), all have associated specimen data and, when possible, are taken from sequences from type materials, ensuring correct species identification and tracking of name changes. This article will show you how to access these ITS sequences and search them using the specialized Targeted Loci BLAST service.
The fungal ITS sequences are a RefSeq Targeted Loci BioProject (PRJNA177353). As you may know, a BioProject is a collection of biological data related to a single initiative; in this case, the goal is to collect and curate fungal sequences from targeted loci – specific molecular markers such as protein coding or ribosomal RNA genes used for phylogenetic analysis.
Continue reading “Accessing the Hidden Kingdom: Fungal ITS Reference Sequences”
The Tasmanian Devil 2: The tumor and Tasmanian devil mitochondrial genomes
The Tasmanian devil (Sarcophilus harrisii), the last remaining large marsupial carnivore, now faces extinction because of a strange and deadly infection, a transmissible cancer known as Transmissible Devil Facial Tumor Disease (TDFTD). In a previous NCBI Insights post, we discussed gene expression data from the tumors that established their neural origin and showed the tumors were likely derived from Schwann cells. In this post, we’ll consider some of the genome sequencing projects in the NCBI databases and explore evidence that the tumor originated in a different individual than the affected animal supporting the idea that the tumor cells themselves are infectious agents. Continue reading “The Tasmanian Devil 2: The tumor and Tasmanian devil mitochondrial genomes”