Re-evaluating the BLAST Nucleotide Database (nt)

Re-evaluating the BLAST Nucleotide Database (nt)

The ongoing sequencing revolution has resulted in exponential growth of the NCBI BLAST databases. The default BLAST nucleotide database (nt), the most popular Web BLAST database, is currently 903 billion letters and continues to grow rapidly – doubling in size in the last year. This growth will cause longer search times, reduced capacity, and more delays in updating the database. In the not-too-distant future, searching the entire nt database on the web will no longer be possible unless we modify the database scope and composition.

Because of the above concerns, we want to make the default Web BLAST nucleotide database smaller and more efficient. Some options are to:

    • Change its composition to improve the quality of sequence entries included
    • Take steps to slow its growth rate
    • Divide it into several databases by biological or functional categories

Continue reading “Re-evaluating the BLAST Nucleotide Database (nt)”

RefSeq Release 215

RefSeq Release 215

RefSeq release 215 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 7, 2022, and contains 335,372,031 records, including 244,583,657 proteins and sequences from 125,116 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 215”

Prokaryotic phylum name changes coming soon!

Prokaryotic phylum name changes coming soon!

Beginning in the first week of January 2023, NCBI Taxonomy will initiate changes to prokaryote phylum names in accordance with the recent inclusion of rank ‘phylum’ in the International Code of Nomenclature for Prokaryotes (ICNP). We first announced this update that involves changes to 42 NCBI taxa about a year ago. We will change several names that have long been in use (e.g., Firmicutes, Proteobacteria) to newly formalized names (e.g., Bacillota, Pseudomonadota) that may be unfamiliar to some.

You will still see the previous names on records and can search using them, but they will not be displayed as prominently as before. The organism names on Entrez records will not change (e.g., Bacillus subtilis). However, we will update the phylum names on the displayed lineages for ~276 million records (see an example in Figure 1 below). Continue reading “Prokaryotic phylum name changes coming soon!”

New and improved SciENcv experience starting January 2023!

New and improved SciENcv experience starting January 2023!

Science Experts Network Curriculum Vitae (SciENcv) is an electronic system that helps you assemble professional information needed to apply for federal grant applications. Starting January 2023, we will be introducing a new and improved SciENcv experience!

SciENcv helps you gather and compile information on expertise, employment, education, and professional accomplishments. You can use SciENcv to create and maintain financial documents and biosketches that are submitted with grant applications.

Why should I use SciENcv?

  • Eliminates the need to repeatedly enter biosketch and financial document information
  • Reduces the administrative burden associated with federal grant submission and reporting requirements
  • Allows you to describe your scientific contributions in your own words

Continue reading “New and improved SciENcv experience starting January 2023!”

Submit your data to dbGaP in 3 easy steps!

Submit your data to dbGaP in 3 easy steps!

Do you have human genetic data from a large-scale study? Submit your data to NCBI’s Database of Genotypes and Phenotypes (dbGaP) to contribute to meaningful discoveries about health. dbGaP contains data from more than 2.8 million study participants who have provided over 3.3 million molecular samples.

How do I submit data to dbGaP?

Step 1: Register your study

Step 2: Submit your data and get your study accession (phs#)

Step 3: Release your data

Continue reading “Submit your data to dbGaP in 3 easy steps!”

CCDS Release 24

CCDS Release 24

An updated dataset of human protein-coding regions from the Consensus Coding Sequence (CCDS) collaboration

Are you interested in a set of high-quality human coding regions (CDS) with equivalent annotation in NCBI’s RefSeq and EMBL-EBI’s (European Molecular Biology Laboratories-European Bioinformatics Institute) Ensembl annotations? Check out the new CCDS Release 24! This CCDS set was generated by comparing RefSeq Annotation Release 110 and Ensembl Release 108.

This update adds 2,746 new CCDS IDs and 237 new genes compared to the last human CCDS build (Release 22, 2018). CCDS Release 24 includes a total of 35,608 CCDS IDs that correspond to 19,107 genes, with 48,062 protein sequences from RefSeq and 47,762 from Ensembl.

The new CCDS release is available on FTP for bulk download and on the CCDS webpage in case you are looking for data on individual genes. Continue reading “CCDS Release 24”

New annotations in RefSeq!

New annotations in RefSeq!

In August and September, the NCBI Eukaryotic Genome Annotation Pipeline released thirty-eight new annotations in RefSeq for the following organisms:

  • Adelges cooleyi (spruce gall adelgid)
  • Aethina tumida (small hive beetle)
  • Anopheles aquasalis (mosquito)
  • Anopheles maculipalpis (mosquito)
  • Anthonomus grandis grandis (boll weevil)
  • Aphis gossypii (cotton aphid)
  • Bactrocera neohumeralis (fly)
  • Bombus affinis (bee)
  • Bombus huntii (bee)
  • Cataglyphis hispanica (ant)
  • Cygnus atratus (black swan) (pictured) Continue reading “New annotations in RefSeq!”
dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

Are you familiar with the well-known Framingham Heart Study, a multi-generation study of residents of Framingham, Massachusetts begun in 1948? Much of what is now known about the impact of genetics, lifestyle, and diet on cardiovascular health and disease has come from this research study. (See PMC4159698  for a historical perspective.) Did you know that data from this study and over 2,000 other studies that demonstrate the relationship between genetic and medical outcomes and other phenotypes are available from NCBI’s Database of Genotypes and Phenotypes (dbGaP)?

dbGaP was established in 2007 as a repository of human data from large scale studies. You can access data from more than 2.8 million study participants who have provided over 3.3 million molecular samples. You can retrieve patient-level phenotypic (e.g., demographic, clinical, exposure) data and molecular (e.g., called genotypes omics, sequence) data, and the results of association analyses from genome-scale case-control and longitudinal studies of heritable diseases.

What types of studies and data are available in dbGaP?

dbGaP contains a wide range of studies and types of data, all relating to human genetic and phenotypic measurements. Most dbGaP data are from NIH-funded research, but recently we have expanded to include non-NIH funded studies. An easy way to find dbGaP Studies, Phenotype and Molecular Datasets, Variables, Analyses and Documents is through the dbGaP Advanced Search (Figure 1). The interface allows you to filter results by different characteristics depending on the tab you choose.

Figure 1. The dbGaP Advanced Search interface. Tabs that appear at the top of the web interface allow you to select the studies, datasets, analyses, etc. of interest. Filters (facets) appear on the left (see inset). Click on filters to select values to find Links on the study summary pages provide direct access to data. Top panel:  Studies tab and the corresponding filter categories.  Bottom panel: Molecular data tab results with Study (Framingham SHARe), Markerset Source (Affymetrix) filters applied. 

Continue reading “dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!”

Announcing GenBank release 252.0

Announcing GenBank release 252.0

Now over 3 billion records!

GenBank release 252.0 (10/17/2022) is now available on the NCBI FTP site. This release has 20.35 trillion bases and 3.10 billion records. The current release has 240,539,282 traditional records containing 1,562,963,366,851 base pairs of sequence data. There are also 2,167,900,306 WGS records containing 18,231,960,808,828 base pairs of sequence data, 574,020,080 bulk-oriented TSA records containing 511,476,787,957 base pairs of sequence data, and 115,123,306 bulk-oriented TLS records containing 43,860,512,749 base pairs of sequence data. 

Continue reading “Announcing GenBank release 252.0”

New version of PGAP now available!

New version of PGAP now available!

We are happy to announce a new version of the stand-alone Prokaryotic Genome Annotation Pipeline (PGAP). This version helps you interpret your results by providing an estimate of the completeness and contamination of your PGAP-annotated genome assembly using CheckM.

CheckM uses the presence of a set of lineage-specific genes for the species provided  or the species returned by the taxonomy check (–taxcheck, –auto-correct-tax). The higher the completeness and the lower the contamination, the better the assembly is! If contamination is a concern, please try FCS-GX, a highly sensitive tool for detecting foreign contaminants in prokaryotic and eukaryotic genome assemblies.

This new release also contains code changes that improve prediction of some long genes, especially in low complexity regions. And, as with every release, PGAP incorporates incremental improvements from expert curators of the Protein Family Model collection that increase the precision of PGAP’s structural and functional annotation.

Please try this new version and share your experience with us!