GGRaSP

An R-package for selecting representative genomes using Gaussian mixture models

GGRaSP

An R-package for selecting representative genomes using Gaussian mixture models

You are here

GGRaSP (Gaussian Genome Representative Selector with Prioritization) is an R-package that generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. GGRaSP also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian Mixture Model to select an appropriate cluster threshold, thus allowing for both generalizable high-throughput and more dataset specific use.

Key Features

  • Rapidly simplify large datasets containing up to multiple thousands of genomes.
  • Optional run without any a priori knowledge of the shape of the data.
  • Generation of images, tables, and annotation files enabling detailed analysis of the phylogeny and GGRaSP clusters.

Sample Output

Sample output from PanACEA

The capabilities of GGRaSP is demonstrated by generating a reduced list of 315 genomes from a genomic dataset of 4,600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Original 4,600 genome set (A), clustered using cut-off (B), and reduced to 315 representatives genomes (C).

Publications

GGRaSP: A R-package for selecting representative genomes using Gaussian mixture models.
Bioinformatics (Oxford, England). 2018-04-14;
PMID: 29668840

Funding

This project has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Award Number U19AI110819.