Tag: Genome assemblies

RefSeq Release 208 is available!

RefSeq Release 208 is available!

RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Join us on September 22, 2021 at 12PM eastern time learn to use the datasets command-line tools (datasets and dataformat) to access, filter, download, and format data and metadata for genomes. Through examples from eukaryotes and the SARS-CoV-2 coronavirus, you will see how to use metadata to filter for genome sequences with desired properties such as genomes with high contig N50 values.

  • Date and time: Wed, September 22, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!

Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!

The new reference assembly for sheep is now annotated! Assembly ARS-UI_Ramb_v2.0 is made of 142 scaffolds, a drop from 2,640 in the 2017 assembly Oar_rambouillet_v1.0. With a contig N50 of 43 Mb, ARS-UI_Ramb_v2.0 is 15 times more contiguous than the first assembly of the Rambouillet breed.

Annotation Release 104 (AR 104) of ARS-UI_Ramb_v2.0 reflects these improvements. Nearly 200 more coding genes have a 1:1 ortholog in the human genome than in the annotation of Oar_rambouillet_v1.0 (AR 103). The number of coding models annotated as partial is down 35% from 165 to 107, and the number of coding models labeled low quality due to suspected indels or base substitutions in the underlying genomic sequence decreased by 51% (1646 to 796). Based on BUSCO analysis, 99.1% of the models (cetartiodactyla_odb10) are complete in AR 104 versus 98.8% in AR 103. Details of this annotation, including statistics on the annotation products, the input data used in the pipeline and intermediate alignment results, can be found here. Continue reading “Announcing the RefSeq annotation of sheep ARS-UI_Ramb_v2.0!”

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Join us on August 18, 2021 at 12PM eastern time for the second webinar on finding data for your non-model research organism. In this webinar, you will learn how to use NCBI’s web resources to get data for a plant species, the black cottonwood. You will see how to find, access, and analyze gene and sequence data from Datasets and other NCBI web resources, as well as sample metadata and gene expression RNA-Seq data from SRA and the SRA Run Selector. You will also see an example that highlights how to use and analyze these data in a typical workflow set up in a Jupyter notebook that uses the NCBI next-gen aligner Magic-BLAST to get relative gene expression levels across samples.

  • Date and time: Wed, August 18, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Introducing the new NCBI Datasets Genomes page

The updated NCBI Datasets Genomes page now has genome data for all domains of life, including bacterial and viral genomes.

The genomes table (Figure 1) now offers filters for:

  • Reference genomes — switch it on to only show reference or representative genomes
  • Annotated — switch it on to only show annotated genomes
  • Assembly level — use the assembly level slider to select higher-quality genomes
  • Year released — use the slider to limit your results to recent genomes

In addition, the new Actions column connects you to NCBI’s Genome Data Viewer, BLAST, and Assembly. The Text filter box lets you search by the name of the assembly, species/infraspecies, or submitter.Figure 1. The new Datasets Genomes page with primate assemblies showing the STATUS switches (reference genomes, annotated); expanded filters section with ASSEMBLY LEVEL and YEAR RELEASED sliding selectors; and the Actions column menu with access to Assembly details, BLAST, the Genome Data Viewer, and Download options. Continue reading “Introducing the new NCBI Datasets Genomes page”

Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!

We have re-annotated all RefSeq genomes for Escherichia coliMycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.

The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.

Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange). 

Continue reading “Announcing the re-annotation of RefSeq genome assemblies for E. coli and four other species!”

Assembly database passes 1 million genome assemblies!

The NCBI Assembly database now provides sequence and metadata for more than 1 million genome assemblies from over 85,000 different species.

Assembly crossed the 1 million genome assemblies milestone on Sunday, April 18, 2021 (Figure 1).

Figure 1. Assembly status and growth. More than 1 million assemblies are now searchable through the NCBI web site (top panel). The number of genome assemblies at NCBI has accelerated rapidly in the past decade.

Continue reading “Assembly database passes 1 million genome assemblies!”

New NCBI Datasets home and documentation pages provide easier access

NCBI Datasets, the new set of services for downloading genome assembly and annotation data (previous Datasets posts), has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.

NCBI Datasets has a fresh new homepage (Figure 1) highlighting the types of data available through our tools. Available data include genome assemblies, genes, and SARS-CoV-2 genomic and protein data.  You can easily access these from the new page or learn more with our new documentation pages.

Figure 1. Features of the new Datasets homepage with quick access to help documentation including the Quickstart and How-to guides as well as access to Genome, Gene, and Coronavirus Data, and the Datasets and Dataformat command-line tools. Continue reading “New NCBI Datasets home and documentation pages provide easier access”

RefSeq Release 205 is available!

RefSeq Release 205 is available!

RefSeq release 205 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257  organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Continue reading “RefSeq Release 205 is available!”

New release of the Read Assembly and Annotation Pipeline Tool (RAPT), now 2X faster!

There is a new release of the Read assembly and Annotation Pipeline Tool (RAPT) available from our GitHub site. RAPT is a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates that can run on your local computer or the Google Cloud Platform (GCP). With this new release, jobs will run twice as fast as with the December release. For example, we have assembled and annotated a Salmonella enterica genome in under an hour on a 16-CPU machine with the new release.
We have also added several new features based on your feedback including:

  1. The –stop-on-errors flag that will stop the process if there evidence from the average nucleotide identity check that there is sample mix-up or contamination by other bacteria.
  2. The ability to accept forward and reverse reads of paired-end runs in separate files. These can be compressed (gzip) files.

Finally, thanks to all who came to our webinar in December and provided their comments! For these who couldn’t join us, you can now view the recording on our YouTube channel.

Contact us at prokaryote-tools@ncbi.nlm.nih.gov with any question and to let us know if you would like to become a beta-tester for RAPT.