JCVI: Tolerating Some Redundancy Significantly Speeds Up Clustering of Large Protein Databases.
 
 
Section Banner

Publications

Citation

Li W, Jaroszewski L, Godzik A

Tolerating Some Redundancy Significantly Speeds Up Clustering of Large Protein Databases.

Bioinformatics (Oxford, England). 2002 Jan 01; 18: 77-82.

External Citation

Abstract

Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in approximately 1 h and at 75% identity in approximately 1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds.

This publication is listed for reference purposes only. It may be included to present a more complete view of a JCVI employee's body of work, or as a reference to a JCVI sponsored project.