Discretizing Data

Discretization transforms continuous data by defining a set of cutoffs that subdivide a continuous domain into a finite set of homogeneous intervals.

The points in each interval should have a high probability of belonging to the same class. These intervals increase the effectiveness of data in the creation of predictive models.


Prerequisites

Additional tabs

The following additional tabs are provided:


Procedure

  1. Drag and drop the Discretize task onto the stage.

  2. Connect a task that contains the attributes you want to transform to the Discretize task.

  3. Double click the Discretize task. On the left hand side of the pane there is a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Configure the options, as described in the table below.

  5. Save and compute the task

Discretize options

Name

PO

Description

Use previous cutoffs to discretize data

useprevious

If selected, the cutoffs defined in an upstream Discretize task will be used to discretize the new data, instead of defining new cutoffs. This is useful when you want data to be discretized in the same way in various point of the worklflow.

Method for discretization

inpdisctype

Select the method you want to use from the Method for discretization drop-down list. Possible values are:

  • Attribute Driven Incremental Discretization (0) (default choice): it is a top-down method that recursively adds separation points (cutoffs) for each discrete or continuous attribute. The method is designed to obtain a complete separation of the points of the training set, i.e. the discretization process must not generate ambiguities. This method is supervised and requires an output attribute.

  • Entropy (1): this top-down method recursively adds cutoffs according to a measure, based on entropy, of the information gain achieved by splitting an interval in two. This method is supervised and requires an output attribute.

  • ChiMerge (2): bottom-up chisquare-based technique that iteratively merges adjacent intervals according to a statistical measure of their similarity. This method is supervised and requires an output attribute.

  • Equal width (3): creates intervals of the same amplitude regardless of the output value. This method is unsupervised, and does not require an output value.

  • Equal frequency (4): creates intervals containing the same number of patterns regardless of the output value. This method is unsupervised, and does not require an output value.

  • Roc Curve (5): uses the ROC Curve to find the best cut-off. This method is supervised and requires an output attribute.

The Attribute Driven Incremental Discretization method usually scores the best performance but may be quite time consuming when there are large training sets. The Entropy method is usually faster but may generate some ambiguities and then compromise the accuracy of any subsequent analysis.

Minimum distance between different classes

minwidth

Specifies the minimum distance that must be kept between two patterns of different classes, as the percentage of the total number of attributes. This distance is computed as the number of attributes whose values are different in the two patterns. The minimum and default distance is one. If you select 100% all the attributes of each couple of heterogeneous patterns must differ. This is not always possible since many attribute can have the same value in the starting data, and in this case the method uses the available separations.

Number of patterns used for discretization

numdisc

Specifies how many patterns will be used. This option allows you to use only a randomly selected subset of the training set, which is particularly useful when there is a high amount of data, as a high number of patterns considerably slows sown the discretization process. The default value of -1 means that all patterns will be used.

Number of values for ordered variables

ninpval

Specifies the number of cutoffs to be inserted for each variable, which must not exceed the number of values available in the training set. The number of cutoffs must at least ensure that the minimum distance between different classes can be guaranteed.

Preselect best cutoffs

preselcut

If selected, the most promising cutoffs will be selected and employed in the subsequent phase. This consequently reduces the number of possible cutoffs to be analyzed. This works particularly well coupled with the Attribute Driven Incremental Discretization method.

Aggregate data before processing

aggregate

If selected, identical patterns will be aggregated and considered as a single pattern during the discretization phase.

Output attribute

outdiscname

Select the output attribute to be used for discretization from the drop-down list. Output attributes are mandatory for supervised methods.

Discretize output

outdisc

If selected the output attribute will be discretized. This option is available if you selected a discrete (e.g. integer) or continuous output attribute. You can then select the required discretization method in the Discretization method for output option.

  • the discretization method, which can either be Equal Frequency to create intervals that contain the same number of patterns (up to border effects), or Equal Width, to create intervals of the same amplitude.

  • the number of cutoffs to be created for the output. The default is 10 whereas 0 means that all possible cutoffs have to be inserted.

Discretization method for output

outdisctype

Select the discretization method you want to adopt to discretize the output. This option is available only if you have selected the Discretion output option.

Possible methods are:

  • Equal Frequency (1) to create intervals that contain the same number of patterns (up to border effects), or 

  • Equal Width (0), to create intervals of the same amplitude.

Number of cutoffs for output

noutval

Select the number of intervals to be created when discretizing output values. The default is 10 whereas 0 means that all possible cutoffs have to be inserted.

This option is available only if you have selected the Discretize output option.

Attributes to discretize

discnames

Drag and drop the ordered attribute you want to transform from the Available attributes list

Results

The results of the Discretize task can be viewed in two separate tabs.

  • Monitor: this tab displays the distribution of the number of generated cutoffs in the form of a histograms during the execution of the Discretize operation. These plots are available also at the end of the computation.

  • Results: this tab displays summary information on the performed computation, such as the execution time, number of cutoffs etc.


Example

The following examples are based on the Adult dataset.

Scenario data can be found in the Datasets folder in your Rulex installation.

In the example process discretization is performed on data deriving from the source as follows:

The following steps were performed:

  1. First we import the dataset.

  2. Use the Take a Look functionality to visualize the original dataset. 

  3. A Discretize task to define cutoffs.

  4. Use the Take a Look functionality to visualize data after discretization.

Procedure

Screenshot

After importing the adult.set dataset with an Import from Text File task, right-click the task and select Take a look to visualize the imported data.

The original dataset is made up of 32561 records, and the age attribute  includes almost all the possible integer values between 17 band 90.

We want to group all these possible values into 5 groups of equal frequency.

Add a Discretize task to the process and specify the following:

  • Method for discretization: Equal frequency

  • Number of values for ordered values: 5 (in order to create 5 separate groups)

  • Attributes to discretize: age

After running discretization the Monitor tab displays a histogram which reports the distribution of the number of cutoffs for each variable.

The Results tab summarizes information on the computation just performed.

Right-click the Discretize task and select Take a look to check the results.

You can see straight away that the age values have been grouped, as there are a fewer number of possible values.

To examine how the discretize task has grouped values in more detail, drag and drop the age attribute onto the Var1 column of the Statistic Manager, and select Values, frequencies and quantiles as the type of statistics.

Here you can see the five groups that have been created, with their assigned average values, and the number of rows belonging to each group.