Update: NCBI is now in the process of merging EST and GSS records into the Nucleotide database, and we expect to complete this process in early 2019. Accession.version and GI identifiers will not change during this process.
As of December 1, 2018, all records from the databases for Expressed Sequence Tags (EST) and Genome Survey Sequences (GSS) will reside in NCBI’s Nucleotide database. This change will provide a single point of access for all GenBank sequence data with a common look and feel.
Read more to learn about how this change affects these resources:
- Websites (Entrez)
- APIs (E-utilities)
- FTP sites
- Submission procedures
- TSA (have a look if you’re not familiar!)
Why are we doing this?
Sequencing technologies have moved away from generating ESTs for evaluating gene expression and GSSs for evaluating clone libraries in favor of next-generation data, which is deposited in resources such as the Sequence Read Archive (SRA). Consolidating the EST and GSS datasets thus helps us to align our services to the current needs of the bioinformatics community.
Changes to websites
The most notable change is that the EST (nucest) and GSS (nucgss) Entrez databases will be retired, along with the default EST and GSS record formats. All EST and GSS records will be moved to the Nucleotide (nuccore) database and will have the default “GenBank” view shared by all current Nucleotide records (see Figure 1). New filters will be added to the Nucleotide database to make it easy for users to remove or include EST and GSS sequences in search results.
Changes to APIs
Similar to the changes on the web, there will no longer be separate EST and GSS databases (db=nucest and db=nucgss) in the E-utilities API. Requests containing these values will be redirected to db=nuccore. This also holds true for dbfrom in ELink requests, but values of linkname containing nucest or nucgss will be ignored after December 1 (resulting in returns unrestricted by linkname, containing all possible links between dbfrom and db.)
Users accessing the Nucleotide database (db=nuccore, db=nucleotide) with esearch should note that after December 1, all search results will contain EST and GSS records matching the provided query (term). This may markedly increase the number of records returned. To remove EST and GSS sequences from esearch results, add the following terms to your query:
NOT gbdiv est[prop] NOT gbdiv gss[prop]
Changes to FTP sites
EST and GSS data have been part of the regular GenBank release set for many years, and will continue to be available at ftp.ncbi.nlm.nih.gov/genbank/. These data will be in the standard GenBank format. After December 1, 2018, the current specialized (default) EST and GSS formats will no longer be available by FTP at ftp.ncbi.nlm.nih.gov/repository/dbEST and ftp.ncbi.nlm.nih.gov/repository/dbGSS/.
Changes to submission procedures
We will continue to accept submissions of EST and GSS sequences; however, there will no longer be special processes for submitting these sequence types. We recommend that submitters of EST and GSS data begin using the tool tbl2asn now. This tool will be required after December 1, 2018. Please write to firstname.lastname@example.org for more information.
Changes to BLAST
For many years BLAST has supported distinct databases for EST and GSS data, and these are available from the database pulldown on nucleotide BLAST web pages. We will continue to support these databases beyond December 1, 2018, for both web and standalone BLAST, so there is no need to alter any process that depends on these databases.
TSA – Have a look!
Finally, we encourage interested users to consider TSA (transcript shotgun assembly) data as a rich source of information about expressed sequences. TSA data are computational assemblies of sequence reads, and as such form attractive BLAST databases useful for identifying putative transcripts (choose “Transcript Shotgun Assembly (TSA)” from the nucleotide BLAST database menu).
We thank all past and present submitters of EST and GSS data for the invaluable benefit these data have provided to numerous genomic sequencing projects over the years. Please let us know if you have any questions or concerns about these changes!