Category: What’s New

Primer-BLAST now designs primers for a group of related sequences

Primer-BLAST now has a “Primers common for a group of sequences” submission tab that allows you to design primers for a group of highly similar sequences. For example, you may want test for expression of any transcript of gene rather than a specific splice variant, so you want to design primers to cover all transcript variants.  Or you may want to design primers that will amplify the same gene in closely related bacteria strains.  To find primers for a group of related sequences, Primer-BLAST aligns the longest sequence to the rest to find common regions. It uses these to limit the locations of primers. The longest sequence is also used as the representative template sequence in the results.  Figure 1 shows an example search for primers that will amplify all of the 15 splice variants for the human TP53 gene.

Figure 1. Primer-BLAST submission page and results for primers designed for the human TP53 transcripts. Top panel: The submission form with the “Primers common for a group of sequences” selected and the 15 RefSeq transcript accessions for TP53. Middle panel: The graphical results showing the longest sequence (NM_001126114.3) as the representative template, the locations of the primer pairs, and the alignment of the other template sequences. Bottom panel: An individual primer pair showing the locations on each of the template sequences.

Please try out this new feature and let us know what you think!

Improved chromosome searching in Genome Browsers

Are you interested in searching for a chromosomal region in a genome, but don’t know how to write the correct query?  The good news is that the NCBI Genome Data Viewer (GDV) now supports a much wider array of search options. Some examples are listed below:

  • chr1:1,500,000-2,000,000
  • chr2: 1.5M – 2M
  • chr2: 1.5M-2,540.2K
  • 2:1,500,000-2,000,000
  • 3: 21.33M – 22.01M
  • 3: 21.335M..21.337M
  • chr1:1,500,000 / 200
  • chr1:101,500,200
  • 1:101,500,200
  • 1:1,500K/0.5K
  • chr5
  • 10

You can use any of these queries or the ones described below for assembly aliases either on the GDV landing page or in the GDV search box (Figure 1).

Figure 1. The search boxes on the GDV landing page (left) and within the GDV graphical interface (right) showing queries with chromosome aliases for the domestic cat. Continue reading “Improved chromosome searching in Genome Browsers”

RefSeq Release 202 is public

RefSeq release 202 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 8, 2020, and contains 255,571,455 records, including 186,755,483 proteins, 33,077,068 RNAs, and sequences from 104,969  organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200815
Updated Annotation Release 109.2020815 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here.

The annotation products are available in the sequence databases and on the FTP site.

This update includes around 15,000 updated RefSeq transcripts revised to use CAGE and polyA data to define 5′ and 3′ ends, and match the reference GRCh38 sequence.

Coronavirus host gene regulatory elements now annotated by RefSeq Functional Elements
The RefSeq Functional Elements project at NCBI has prioritized curation of experimentally validated regulatory elements for human host genes associated with SARS-CoV-2 entry into cells. The annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types. We annotated 236 regulatory features for 27 distinct biological regions, including regulatory elements for the ABO, ACE2, ANPEP, CD209, CLEC4G, CLEC4M, CTSL, DPP4, and TMPRSS2 genes. More information can be found here.

New eukaryotic genome annotations
This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • maize annotation release 103, based on the new assembly Zm-B73-REFERENCE-NAM-5.0 (GCF_902167145.1)
  • marmoset annotation release 105, based on the new assembly Callithrix_jacchus_cj1700_1.1 (GCF_009663435.1)
  • Chinese hamster annotation release 104, based on the assembly CriGri_1.0 (GCF_000223135.1) and the new assembly CriGri-PICRH-1.0 (GCF_003668045.3)
  • Asian giant hornet annotation release 100, based on the new assembly V.mandarinia_Nanaimo_p1.0 (GCF_014083535.2)
  • Florida lancelet annotation release 100, based on the new assembly Bfl_VNyyK (GCF_000003815.2)
  • Anopheles stephensi annotation release 100, based on the new assembly UCI_ANSTEP_V1.0 (GCF_013141755.1)

Updated and improved collection of RefSeq representative genome assemblies now available
The collection of representative genome assemblies for Bacteria and Archaea contains 11,727 prokaryotic assemblies to represent their respective species. More information can be found here.

Updated protein family models used by PGAP available for download
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available.

This release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. More information can be found here.

Future change: Mouse Reference Assembly Update
RefSeq annotation of the new mouse GRCm39 assembly is in progress, and is expected to be included in the next release.

Easily download large amounts of genomic data with NCBI Datasets

Do you need to download a lot of genomic data? Maybe you need all primate reference genomes or maybe you need just a few really big genomes? Prior to the advent of NCBI Datasets, downloading such a large amount of data could be a frustrating and time consuming experience involving failed downloads and writing custom scripts.

NCBI Datasets makes large genome downloads simpler, faster, and more reliable. You don’t have to write a script. You can be sure you get all the data requested. And sharing the data is easier than ever.  Figure 1 shows an example data download process using Datasets.Datasets download process

Figure 1. Downloading and processing genomic data using NCBI Datasets. The example shows downloading the set of RefSeq primate assemblies through the Datasets web interface. Since the downloaded files would exceed 15GB, the file comes as a “dehydrated bag” — a small, easily downloaded, zipped file with metadata and links to download the data. You can “rehydrate” the unzipped dehydrated files —  fill them with the corresponding data — using the datasets command-line tool.

Continue reading “Easily download large amounts of genomic data with NCBI Datasets”

Hiding sequences in an alignment now available in the MSA Viewer!

Do you ever wish there was a quick way to hide partial or poor quality sequences from a multiple alignment view? NCBI’s Multiple Sequence Alignment Viewer (MSAV) now allows you to do just that with an easy hide/show rows feature! Hidden rows won’t be shown in the PDF/SVG download, and these sequences will not be included in the FASTA alignment download file.

You can easily manage which rows are shown through the menu available by right-clicking on a row or through the Rows dialog (Figure 1).

Figure 1. The MSA viewer showing the options for hiding or showing rows. Right clicking any sequence row provides options for hiding single or selected rows or restoring hidden rows. You can also manage rows through the edit row dialog activated by clicking the “Rows” button next to the gear icon at the upper-right. Check or uncheck sequences to add or remove them from the display. The “Rows shown” status message at the lower-right of the MSAV indicates the total number of rows in the sequence alignment and the number displayed.

Keep in mind that hiding rows does not re-calculate the alignment, so it’s important to know if any rows have been hidden from your current view. The “Rows shown” message at the lower-right indicates whether you are displaying all rows.

You can find more tips on using the MSA Viewer, including information about anchors, consensus, and coloring settings in our user guide. Please get in touch if you have any questions or suggestions using the Feedback link on the page or writing to the NCBI Help Desk.


GenBank release 239 is available

GenBank release 239.0 (8/18/2020) is now available on the NCBI FTP site. This release has 9.89 trillion bases and 2.12 billion records.

The current release has 218,642,238 traditional records containing 654,057,069,549 base pairs of sequence data. There are also 1,408,122,887 WGS records containing 8,841,649,410,652 base pairs of sequence data, 417,524,567 bulk-oriented TSA records containing 366,968,951,160 base pairs of sequence data, and 75,682,157 bulk-oriented TLS records containing 27,825,059,498 base pairs of sequence data.

Growth between releases

During the 60 days between the close dates for GenBank Releases 238.0 and 239.0, the ‘traditional’ portion of GenBank grew by 226,233,810,648 basepairs and by 1,520,005 sequence records. During that same period, 80,474 records were updated. An average of 26,675 ‘traditional’ records were added and/or updated per day.

Between releases 238.0 and 239.0, the WGS component of GenBank grew by 727,603,148,494 basepairs and by 105,270,272 sequence records. The TSA component of GenBank grew by 7,021,242,098 basepairs and by 7,799,517 sequence records. The TLS component of GenBank grew by 324,424,370 basepairs and by 618,976 sequence records.

The total number of sequence data files increased by 425 with this release. The divisions are as follows:

  • BCT: 37 new files, now a total of 490
  • ENV: 2 new files, now a total of 62
  • INV: 9 new files, now a total of 95
  • MAM: 5 new files, now a total of 76
  • PAT: 7 new files, now a total of 212
  • PLN: 321 new files, now a total of 547
  • PRI: 1 new file, now a total of 35
  • ROD: 7 new files, now a total of 41
  • VRL: 2 new files, now a total of 38
  • VRT: 35 new files, now a total of 182

Note: The unusually large increase in the number of PLN-division files is due to an influx of multiple sets of near-gigabase-scale chromosomal records for wheat (Triticum aestivum) and barley (Hordeum vulgare subsp. vulgare).

For downloading purposes, please keep in mind that the uncompressed GenBank Release 239.0 sequence data flatfiles require roughly 1,461 GB. The ASN.1 data files require approximately 938 GB.

More information about GenBank release 239.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

Coronavirus host gene regulatory elements now annotated by RefSeq Functional Elements

The COVID-19 pandemic has drawn attention to the human host genes associated with SARS-CoV-2 entry and to the elements that regulate expression of these genes. At NCBI, we have prioritized curation of experimentally validated regulatory elements for these genes in the RefSeq Functional Elements project. Our annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types.  We have annotated 236 regulatory features for 27 distinct biological regions in the latest human Annotation Release (109.20200522) including regulatory elements for the ABOACE2, ANPEPCD209CLEC4GCLEC4MCTSL, DPP4,and TMPRSS2 genes

You can view our regulatory element to target gene linkages in the regulatory interactions track using our new track hub that we recently announced.  You can also see the biological regions and features tracks. These have functional and descriptive metadata, including biological region summaries, experimental evidence types, publication support and more.

The example in Figure 1 shows RefSeq Functional Element feature annotation in NCBI’s Genome Data Viewer (GDV) for the ABO gene region (GRCh38, NW_009646201.1: 73,864-103,789) the determiner of the human ABO blood group. A genome-wide association study recently identified non-coding  ABO variants associated with COVID-19 disease severity (PMID:32558485), which map to some of the RefSeq Functional Elements in this region.ABO region showing biological regions in GDVFigure 1. The human ABO gene region in the NCBI GDV displaying the RefSeq Functional Element features.  The biological regions aggregate track shows underlying feature annotation for an ABO upstream enhancer (LOC112637023),  promoter region (LOC112679202),  +5.8 intron 1 enhancer (LOC112679198),  a 3′ regulatory region (LOC112639999), and a +36.0 downstream enhancer (LOC112637025).  Functional Element features include numerous enhancers, promoters, cis-regulatory elements and protein / transcription factor binding sites.

We have more information about RefSeq Functional Elements on our website, including data download and extraction options. Stay tuned to NCBI Insights and other NCBI social media for future announcements about RefSeq Functional Elements!

NIH Genetic Testing Registry (GTR) now accepting microbe tests including SARS-CoV-2 / COVID-19 tests

The profound impact of the COVID-19 pandemic prompted the NIH Genetic Testing Registry (GTR) to expand its scope to include microbe tests. We will focus initially on molecular tests to detect the SARS-CoV-2 virus and serology tests to detect viral antigens and antibodies to the virus. This project contributes to efforts to flatten the curve of the pandemic by sharing test data, bringing transparency on the validity of available tests, and making it easy to identify orderable tests at point of care.

We invite all labs that offer molecular tests for SARS-CoV-2 to diagnose COVID-19 and serologic tests for the antibodies to the virus or viral antigens to determine previous exposure to share their test data in GTR. Click here for instructions on how to submit your test.

Continue reading “NIH Genetic Testing Registry (GTR) now accepting microbe tests including SARS-CoV-2 / COVID-19 tests”

NCBI Ending Support for TLS 1.0 and 1.1

On September 1, 2020, NCBI will no longer support connections from web browsers using TLS 1.0 or 1.1. If your browser does not support TLS 1.2 or higher, you will not be able to access NCBI web pages after September 1.

What is TLS?

TLS (Transport Layer Security) is an internet standard designed to minimize the risk of sensitive data being intercepted and used for malicious purposes while it’s being transmitted over insecure networks. TLS provides a secure connection between your computer and NCBI so that your data are protected. TLS 1.2 remedies several vulnerabilities in the older TLS 1.0 and 1.1 versions.

How can I tell if this will affect my browser?

You can check your web browser using tools such as How’s My SSL and Qualys SSL Checker. You may also want to review this list of browsers that support TLS 1.2.

What errors may I see if my browser uses TLS 1.0 or 1.1?

You may see a variety of errors after September 1, but the following are common:

  • “This site can’t provide a secure connection”
  • “Secure Connection Failed”
  • “This page can’t be displayed”

What can I do if my browser uses TLS 1.0 or 1.1?

The easiest solution is to upgrade your web browser to its most recent version. Another possibility is to configure your existing browser to TLS 1.2. You may need to consult the IT staff in your organization for assistance with these options.

New Automated Validation in ClinVar Submission

You, as a submitter, are the beating heart of ClinVar. Your contributions helps thousands of genetic counselors and clinicians, as well as their patients and patients’ family members. We have added validation to the online file submissions portal, so that you submitters have more control over how to deal with errors in your submitted files.

You now have two options when submitting data. You can submit any data that passes validation and receive a report of the data that failed. The failed data can be reviewed and resubmitted when it’s convenient for you.

Continue reading “New Automated Validation in ClinVar Submission”