UMR CNRS 7253

This repository aims at sharing data sets for the analysis of co-clustering algorithms. It currently contains 72 artificial data tables of reals, which have been generated from latent block models. The main benefits of using these data sets are that:

the true generative model is known;
several plausible ground truths are provided for labeling;
the ultimate classification error is carefully controlled;
the magnitude of row and column classification errors are similar.

Item 1. ensures that the co-clustering structure exists; items 2. and 3. enable the accurate absolute and relative assessments of the benchmarked co-clustering compounds; item 4. ensures that the learning problem indeed belongs to the co-clustering problems category, whose analysis cannot be conducted by one-way clustering tools. A more comprehensive description of the data sets is given here, and supporting evidences regarding the claims above are detailed there.

For information about citing data sets in publications, please see here.

If you have comments, suggestions, if you wish to donate a series of data sets, or for any other question, feel free to contact the repository maintainer.

UMR CNRS 7253

Site Tools

Sidebar

Page Tools

User Tools