We know it’s not always easy to find the sequence data you’re after at NCBI. Maybe it’s because you’re no expert at constructing queries, and you end up with no results or too many results. Or maybe you’re an Entrez wizard, but creating a query full of Booleans and filters seems like overkill when you could just write a short natural language query, like you’re used to doing in Google. The next time you search for a gene, transcript or genome assembly for a given organism, try the new search experience we’re piloting in NCBI Labs.
In NCBI Labs, you can now search for sequences using natural language and get the best results.
Figure 1. The new interface for specified transcript search.
The improved search experience now available in NCBI Labs addresses 3 types of queries that commonly fail in searches at NCBI: organism-gene (e.g. human BRCA1), organism-transcript (e.g. Mouse p53 transcripts) and organism-assembly (e.g. dog reference genome). For each of these query types in NCBI Labs, we now return NCBI’s highest quality sequence sets or reference and representative assemblies in an easy-to-view panel.
Example queries are shown below to get you started.
The 2018 Nucleic Acids Research database issue features several papers from NCBI staff that cover the status and future of databases including CCDS, ClinVar, GenBank and RefSeq. These papers are also available on PubMed. To read an article, click on the PMID number listed below.
GenBank release 223.0 (12/15/2017) has 206,293,625 traditional records (including non-bulk-oriented TSA) containing 249,722,163,594 base pairs of sequence data. In addition, there are 551,063,065 WGS records containing 2,466,098,053,327 base pairs of sequence data, 201,559,502 TSA records containing 181,394,660,188 base pairs of sequence data, and 12,695,198 TLS records containing 4,458,042,616 base pairs of sequence data.
GenBank release 221.0 (8/13/2017) has 203,180,606 traditional records containing 240,343,378,258 base pairs of sequence data. In addition, there are 499,965,722 WGS records containing 2,242,294,609,510 base pairs of sequence data, 186,777,106 TSA records containing 167,045,663,417 base pairs of sequence data, and 1,628,475 TLS records containing 824,191,338 base pairs of sequence data.
Entrez Direct is a UNIX/LINUX command-line interface to E-utilities, the API to the NCBI Entrez system. One of Entrez Direct’s most useful features is its ability to parse and reformat complex XML data returns from EFetch. In this post, we will explore how to use these features to parse, reformat and process specific data from PubMed records downloaded in XML using EFetch. Though this post focuses on PubMed, the technique is universal and applies to any XML returned by E-utilities from any database. The example explored here is also presented briefly in the Entrez Direct documentation; here we’ll dive in a bit depeer to see how it works. Let’s get started!
NCBI, in collaboration with NLM and the National Network of Libraries of Medicine NLM Training Center (NTC) at the University of Utah, recently presented the second offering of A Librarian’s Guide to NCBI. Health Sciences Librarians from 17 universities and two federal agencies attended the five-day intensive course on the NIH campus. This second offering of the training continues to prepare health science librarians for supporting NCBI molecular databases and tools, and training patrons in the use of NCBI resources at their own institutions.
Participants and instructors from the 2014 “A Librarian’s Guide to NCBI” outside of the National Library of Medicine.
As before, all the course materials are available online. Feel free to learn from them, adapt them for your own teaching, and share them with others. You can use the links below to access the updated 2014 course materials. These include the slide sets with demonstrations and practice problems.