Cleaning Datasets

The Fill/Clean task improves the quality of datasets by managing missing data in the Fill tab, and automatically removing unnecessary attributes in the Clean tab.

The task allows you to perform operations quickly on a data table, for example by dragging and dropping or filtering attributes to delete. This is particularly helpful when working with large amounts of data. These operations can also be performed in the Data Manager task, attribute by attribute, but it is more time-consuming.

Typical operations performed in this task are:

  • filling in missing values using:

    • fixed values,

    • min/max values, or

    • averages.

  • ignoring attributes:

    • which have too many missing values

    • where the mode ratio is too high (i.e. where a high percentage of values are identical)

    • which have too many different values, for example a different id for every row.

Additional tabs

The following additional tabs are provided:


Prerequisites


Procedure

  1. Drag and drop the Fill/Clean task onto the stage.

  2. Connect the task which contains the data you want to clean to the Fill/Clean task.

  3. Double click the Fill/Clean task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required. 

  4. Select the attributes that you want to keep in the dataset in the left hand list (keepnames); unselected tasks will be removed from the dataset after computation. Attributes can be checked/unchecked one by one or in groups by pressing the Ctrl or SHIFT key. Right-clicking this panel gives access to some additional options, such as checking/unchecking all attributes, inverting your selection, hiding ignored attributes, etc.

  5. Configure the fill and/or clean options as described in the table below.

  6. Save and compute the task.

Fill/Clean options

Option name

PO

Description

Fill options

Attributes subject to filling operations

missnames

Drag and drop the attributes that can be modified by the selected filling operations.

Attributes to define calculation groups

keynames

Drag and drop here the attributes that define sub-groups which will be used to calculate missing values.

Impute missing values for NOMINAL attributes with:

nomimpute, nomimptype, nomvalue

If selected, you can define how missing nominal attributes will be filled, either by specifying a fixed value (nomimputetype = 0), and manually entering it (nomvalue), or by selecting how the value can be calculated, by calculating the mode of each attribute (nomimputetype = 1) according to the specified calculation groups.

Impute missing values for ORDERED attributes with:

ordimpute, ordimptype, ordvalue

If selected, you can define how missing ordered attributes will be filled, either by specifying a fixed value (ordimptype = 0), and manually entering it (ordvalue), or by selecting how the value can be calculated, by selecting one of the available calculation types (nomimputetype) according to the specified calculation groups.

Possible methods are mean (1), median (2), mode (3), maximum (4) or minimum (5).

Impute missing values for TEMPORAL attributes with:

timeimpute, timeimptype, timevalue

If selected, you can define how missing temporal attributes will be filled, either by specifying a fixed value (timeimptype = 0), and manually entering it (timevalue), or by selecting how the value can be calculated, by selecting one of the available calculation types (timeimputetype) according to the specified calculation groups. 

Possible methods are mean (1), median (2), mode (3), maximum (4) or minimum (5).

Consider the whole row to compute MODE

wholerow

If selected, all attributes, and not just those added to the Key attributes list, will be used to calculate the mode for missing attributes. This applies only to missing value calculations that use the mode method, as selected above in the Impute missing attributes options.

Clean options



Attributes subject to cleaning operations

cleannames

Drag and drop the attributes that can be modified by the selected cleaning operations.

Attributes to be deleted

delnames

Drag and drop attributes you want to delete here.

If you change you mind, just select the attribute and click delete and it will be removed from the edit box and go back to the Available attributes list.

Remove attributes with missing ratio above (%)

missfiltermissperc 

If selected, removes attributes whose percentage of missing values exceeds the percentage specified for this option.

If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option (selected by default).

Remove attributes with number of mode values above (%)

modefilter, modeperc 

If selected, attributes where the mode ratio exceeds the specified value are removed. This may be useful, for example, where there is a high percentage of values are identical).

If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option (selected by default).

Remove nominal attributes with number of values above:

nomfilter, maxnomval

If selected, attributes who have a higher number of values than the specified limit are removed.

This limit removes attributes which have, for example, a different id for every row.

If you want to ignore, but not delete these attributes, select the Ignore attributes instead of deleting them option.

Ignore attributes instead of deleting them

setignore

If selected, attributes you chose to remove in the following options are ignored, and consequently not used to create the model, but will remain in the dataset (grayed out):

  • Remove attributes with missing ratio above

  • Remove attributes with number of mode values above

  • Remove nominal attributes with number of values above

Standardize ordered attributes

normalize

If selected, ordered attributes will be normalized using the formula (x-mean)/stdev.

For further information see https://en.wikipedia.org/wiki/Normalization_(statistics) → Standard Score.

Example

The following examples is based on the Northwind dataset.

We have removed some data from the northwind_customers.set data set to create a scenario where we need to clean up an incomplete dataset.

Scenario data can be found in the Datasets folder in your Rulex installation.

The following steps were performed:

  1. First we import the northwind_customers_missing_data.set dataset.

  2. Use the Take a Look functionality to visualize the training set prior to cleaning. 

  3. Fill/Clean task is added to clean up the dataset.

  4. Use the Take a Look functionality to visualize the training set after cleaning.

Procedure

Screenshot

After importing the northwind_customers_missing_data.set dataset with an Import from Text File task, right-click the task and select Take a look to visualize the imported data.

The original dataset is made up of 91 records and 12 attributes concerning customer information. 

In the first Customer No. column we can see straight away that there are missing customer numbers.

There are also many missing Region codes and by hovering over the header we can see that there are also 5 missing values for the Country attribute.

We'll now add a Fill/Clean task to the process to improve this scenario.

Double click on the Fill/Clean task and configure the Fill tab as follows:

  • Attributes subject to filling operationsCountry

  • Attributes to define calculation groupsCity

  • Impute missing values for NOMINAL attributes with: Mode

This means that missing values for the Country attribute will be calculated using the mode of the City attribute values.


Then add the Country No. attribute to the Attributes subject to filling operations edit box to allow the Fill/Clean task to make changes to this attribute too. 

Then specify in the Impute missing values for ORDERED attributes that any missing values should be replaced with the fixed value 0.0.

It is possible to specify a different fill behavior for nominal, ordered and temporal attributes.


In the Clean tab we want to ignore attributes that are not sufficiently represented in the dataset, as they may not provide reliable results.

For example, in our scenario we could define that if less than 50% of the patterns contain a value for the Region attribute we will ignore the attribute in subsequent data modeling tasks.

Select Ignore attributes instead of deleting them to ensure that no attributes are actually removed from the dataset.

Now save and compute the task.

Now right-click the task and select Take a look to check the resulting data.

As you can see the missing values in the Customer No. attribute column have been filled with the number 0, as we specified, to indicate there is no assigned number.

The Region column is grayed-out, indicating that its values will be ignored by subsequent calculations, and the Country attribute column no longer has missing values.