Data Sets

Each data set consists in four components

All data sets have been generated from the Latent Block Model with normally distributed entries. They differ in the type of model used:

Then, the setups differ with respect to:

The following figure shows the original data table, which is unsorted, on the left. The left and top sidebars respectively represent row and column class labels. The table is displayed sorted in the center, using an element of the set of possible row and column labels. The right-hand-side summary represents the block average of the sorted data table, that is, the average of the entries sharing identical row and column assignments.

Unsorted Table Sorted Table Summary Table
 Unsorted Table  Sorted Table  Summary Table

As stated in the list above, we provide several (2000) possible row and column labels, sampled from the distribution of labels given the table (see details in paper). These labels do not cover exhaustively the distribution of labels, but enable to estimate expected performances with respect to the distribution of true labels.

Finally, the parameters that fully specify the generative model employed are provided for information as they may be useful for interpretation purposes.

Data Archives

Each archive .zip corresponds to a type of data sets defined by the error, the size and the number of clusters. It contains 20 folders corresponding to 20 different data sets. Each folder is named by AAA_err_BBB_n_CCC_g_DDD_tab_EEE, where :

The folder contains 4 files in ascii format (entries are tab-separated and lines end with line feeds):

  1. data_table.txt lists the entries of the data table: each line stands for a row of the data table;
  2. col_class.txt lists 2000 possible column labels: each line corresponds to a full labeling of the columns of the table;
  3. row_class.txt lists 2000 possible row labels: each line corresponds to a full labeling of the rows of the table;
  4. readme.txt specifies the parameters of the latent block model used to generate the data, together with a precise assessment of the difficulty of the task, according to the total/row/column conditional Bayes' error (see paper).