Description:Protein clustering on the Grid with CD-HIT
Abstract:CD-HIT performs protein clustering on a protein or genome sequence database. This consists in removing redundant sequences at a given sequence similarity level and generating a new database with the representatives only. As protein and genome databases are growing up day after day, the clustering process on interesting datasets in a single machine is not feasible due to memory constrains. A Grid environment allows an adaptive database distribution in order to optimize its overall analysis. This activity was proposed by CNIO (Spanish National Cancer Research Centre) and started in the context of the BioGridNet Program.