Using Standard Clustering to Cluster Data

Rulex can cluster data with a k-means algorithm, by dividing a given dataset into k clusters. The statistical average of all the data items in the same cluster is defined as the cluster centroid.

 A k-means (or k-medians, or k-medoids, according to the option specified by the user) clustering algorithm is employed to aggregate representative records with similar profiles. The centroid of each cluster provides the values of the profile attributes to be used in a subsequent Apply Model task when a new pattern is assigned to that cluster.

The input dataset contains the following attribute roles:

  • profile attributes: attributes to be employed to measure similarities in an unsupervised learning problem. To preserve generality a profile attribute can also be a label attribute. If nominal profile attributes are used, a combination of k-means and k-modes is adopted to deal with them.

  • cluster id: optional nominal attribute providing the initial cluster assignment for each pattern.

  • weight: optional variable used to provide a measure of relevance for each example in the dataset, thus affecting the position of the cluster centroid.


Prerequisites

Additional tabs

Along with the Options tab, where the task can be configured, the following additional tabs are provided:

  • Documentation tab where you can document your task,

  • Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO). 

  • Clusters & Results tabs where you can see the output of the task computation. See the Results table below.


Procedure

  1. Drag and drop the Standard Clustering task onto the stage.

  2. Connect a Split Data task, which contains the attributes you want to cluster, to the new task.

  3. Double click the Standard Clustering task.

  4. Configure the task options as described in the table below.

  5. Save and compute the task. 

Standard K-Means Clustering options

Parameter Name

PO

Description

Attributes to consider for clustering

profilenames

Drag and drop the attributes that will be used as profile attributes in the clustering computation.

Clustering type

centroidtype

Three different approaches for computing cluster centroids are available:

  • k-means, where the mean is used to compute the cluster centroid

  • k-medians, where the median is used to compute the cluster centroid

  • k-medoids, where the point of the dataset closest to the mean is used as the cluster centroid.

Clustering algorithm

kmeanstype

Three different clustering algorithms are available:

  • Standard, where cluster centroids are recomputed only after all the points have been reassigned;

  • Incremental, where cluster centroids are recomputed after each point moving;

  • Error-based, where point moving is decided by minimizing the error, instead of the distance from cluster centroid.

Distance method for clustering

distmethod

The method employed for computing distances between examples.

Possible methods are: Euclidean, Euclidean (normalized), ManhattanManhattan (normalized), Pearson.

Details on these methods are provided in the Distance parameter of the Managing Attribute Properties page.

Distance method for evaluation

evaldistmethod

Select the method required for distance, from the possible values: Euclidean, Euclidean (normalized), ManhattanManhattan (normalized), Pearson.

For details on these methods see the Managing Attribute Properties page.

Normalization for ordered variables

normtype

Type of normalization adopted when treating ordered (discrete or continuous) variables.

Every attribute can have its own value for this option, which can be set in the Data Manager. Details on these options are provided in the Distance parameter of the Managing Attribute Properties page.

These choices are preserved if Attribute is selected in the present menu; every other value (e.g. Normal) supersedes previous selections for the current task.

Initial assignment for clusters

assigntype

Procedure adopted for the initial assignment of points to clusters; it may be one of the following:

  • Random: very fast, but less accurate; with this choice several executions (see option 4) of the algorithm can be performed (starting from different random initializations) to retrieve a better result;

  • Smart: it can be slow, but tries to produce initial clusters having maximum distance from each other;

  • Weight-based: cluster are initialized by taking into account weights (if present); in particular, points with high weight are placed into different clusters.

(Optional) attribute for initial cluster assignment

clusteridname

Optionally select a specific attribute from the drop-down list, which will be used as an initial cluster assignment.

(Optional) attribute for weights

weightname

Optionally select an attribute from the drop-down list, which will be used as a weight in the clustering process.

Number of clusters to be generated

nclustot

The required number of clusters. The number of clusters cannot exceed the number of different examples in the training set.

Number of executions

ntimes

Number of subsequent executions of the clustering process (to be used in conjunction with Random as the Initial assignment for clusters option); the best result among them is retained.

Maximum number of iterations

nkmiter

Maximum number of iterations of the k-means inside each execution of the clustering process.

Minimum decrease in error value

mindecrease

The error value corresponds to the average distance of each point from the respective centroid.

This value, measured at each iteration, should gradually decrease. When the error decrease value (i.e. the difference in error between the current and previous iteration) falls below the threshold specified here, the clustering process stops immediately since it is supposed that no further significant changes in error will occur.

Initialize random generator with seed

initrandom

If checked, the positive integer shown in the box is used as an initial seed for the random generator; with this choice two iterations with the same options produce identical results.

Keep attribute roles after clustering

keeproles

If selected, roles defined in the clustering task (such as profile, labels, weight and cluster id) will be maintained in subsequent tasks in the process. 

Aggregate data before processing

aggregate

If checked, identical patterns in the training set are considered as a single point in the clustering process.

Append results

append

Additional attributes produced by previous tasks are maintained at the end of the present one, rather than being overwritten.

Results

The results of the task can be viewed in three separate tabs:

  • The Clusters tab displays a spreadsheet displaying the values of the profile attributes for the centroids of created clusters, together with the number of elements and the dispersion coefficient (given by the normalized average distance of cluster members from the centroid) for each of them. In particular, the columns clustnum, nelem and normsdev contain the index of the cluster, the number of elements and the dispersion coefficient, respectively. The last row, characterized by a null index in the column clustnumreports the values pertaining to the default cluster, obtained by including in a single group all the elements of the training set.

  • The Results tab, where a summary on the performed calculation is displayed, among which:

    • the execution time,

    • the number of valid training samples,

    • the average weight of training samples,

    • the number of clusters built,

    • the average dispersion of clusters,

    • the dispersion coefficient of the default cluster,

    • the minimum and the maximum number of points in clusters,

    • the number of singleton clusters, including only a point of the training set.

Example

The following examples are based on the Adult dataset.

Scenario data can be found in the Datasets folder in your Rulex installation.

The scenario aims to divide the dataset into a specific number of defined clusters.

The following steps were performed:

  1. First we import the adult dataset with an Import from Text File task.

  2. Ignore the income attribute in a Data Manager task.

  3. Split the dataset into a test and training set with a Split Data task.

  4. Generate the required clusters in the Standard Clustering (K-means) task.

  5. Apply the rules to the dataset with an Apply Model task.

  6. Use the Take a look functionality to check the results of the forecast.

Procedure

Screenshot

After having imported the age dataset with an Import from Text File task, add a Data Manager task to the process.

In the Data Manager we can see that the attributes in the dataset are as follows:

  • 2 continuous attributes

  • 4 integer attributes

  • 8 nominal attributes.

Select the income output attribute, which provides the correct assignment for each pattern, and check Ignore in the Attributes tab.

Then add a Split Data task and split the dataset into a training set (70%) and test set (30%).

Add a Standard Clustering (K-means) task to the process and configure it as follows:

  • Drag and drop the age, education-num, capital-gain and hours-per-week attributes onto the Attributes to consider for clustering list

  • Select Normal in the Normalization for ordered variables drop down list. In this way it is possible to retrieve a correct grouping even when attributes span a very different domain.

  • Enter 2 in the Number of clusters to be generated (the number of classes in the original classification problem) edit box.

After clicking Compute process to start the analysis, the properties of the generated clusters can be viewed in the Monitor tab of the Standard Clustering task.

At the end of the process the dispersion coefficients of the clusters are displayed. A similar histogram can be viewed for the number of elements, by opening the corresponding #Elements tab, as shown in the screenshot.

Note that you can stop the process at any point by clicking the Stop computation button in the main toolbar. In this case, the last cluster subdivision is maintained and considered hereinafter.

After the execution we obtain two clusters whose characteristics are displayed in the Clusters panel of the task.

In each row of the spreadsheet the first columns contain the centroids for the clusters. The cluster column contains the progressive index of the cluster, whereas the columns nelem and disp give the number of elements and the dispersion coefficient, respectively.

The last (third) row reports the values characterizing the default cluster, obtained by including in a single group all the elements of the training set.

Clicking on the Results tab displays a summary of the computation performed, with: 

  • the task name and identifier and execution time,

  • some input data quantities,

  • some results of the computation, such as the number of clusters generated and their properties.

Add an Apply Model task to the process to create the index of the cluster to which each pattern in the training and in the test set belongs. This is obtained by finding the nearest centroid (according to the Distance and Normalization options selected in the Option panel of the Standard Clustering task.

Compute the task leaving its default settings. To view the results, right-click the Apply Model task and select Take a look

32 additional result variables have been added to the dataset as can be seen in the final Data Manager task (dataman2).

The first three result variables concern the cluster associated with the current pattern:

  • The index of the cluster: pred(Output).

  • The confidence of the association between cluster and pattern: conf(Output), given by 1−0.5∗d1/d2, where d1 and d2 are the distances from the nearest and the second nearest centroid, respectively. Since d1<d2 the confidence always lies in the interval [0.5,1].

  • The row of the Clusters tab in the Standard Clustering task containing the associated cluster: clust(Output).

The subsequent 14 result variables report the values of the profile attributes for the centroid of the associated cluster: pred(age), pred(workclass), etc.

The remaining 15 result variables concern the error performed when these values are employed as a forecast for the actual profile attributes of the pattern. In particular, the first of these result variables (error) provides the total error, whereas the others (err(age), err(workclass), etc.) give the error for each attribute.

Corresponding values for the patterns of the test set can be displayed by selecting Test set from the menu on the left.