Finding and Replacing Values in Datasets

Sometimes it may be necessary to find and replace values that create a specific undesired outcome.

For example, imagine you have a dataset with information on parcels that need to be posted, and any parcels that weigh more than 4kg must be sent by courier. If this rule is not respected, an "overweight, cannot be sent" outcome is produced. When applied to a new dataset, the Rule Based Control task corrects these errors in the dataset, consequently modifying the way the parcel will be sent.

The task performs this operation according to the weights specified for each outcome class, and modifying the attributes that have been specified as modifiable according to selected correction methods, such as by calculating the overall mean or median or the mean or median for specified key attributes.

This task was previously called Find/Replace.


Prerequisites

Procedure

  1. Drag and drop the Rule Based Control task onto the process.

  2. Connect a task, which contains the ruleset and data you want to modify, to the new task.

  3. Double click the Rule Based Control task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Configure the options described in the table below.

  5. Save and compute the task.

Rule Based Control options

Parameter Name

Description

Key attributes

Drag and drop here the attributes for which calculations will always be performed. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Attributes that cannot be modified

Drag and drop here any attributes that must not be changed. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

The attributes specified here cannot be modified, even if they are added to the Attributes to be corrected list.

Any attributes included in the ruleset can be modified, unless added here, even if they are added to the Attributes to be corrected list.

These attributes include those NOT covered in the ruleset.

Attributes to be corrected

Drag and drop here any attributes that can be modified. 

Any attributes included in the ruleset can be modified, unless included in the cannot be modified list. These attributes include those NOT covered in the ruleset.

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Undesired output weight

Specify a weight for each outcome class. Assign a higher value to the outcome you want most to avoid.

Correct values for ORDERED attributes with

Select how you want to correct ordered values that have caused an undesired outcome class. Calculations can be performed on all values, or values for specific attributes, if these attributes have been dragged and dropped into the Key attributes list.

Possible values are: Mean, Median, Mode, Minimum, Maximum, Minimum change (ordered attributes only).

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Correct values for TEMPORAL attributes with

Select how you want to correct temporal values that have caused an undesired outcome class. Calculations can be performed on all values, or values for specific attributes, if these attributes have been dragged and dropped into the Key attributes list.

Possible values are: Mean, Median, Mode, Minimum, Maximum, Minimum change (ordered attributes only).

Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Append results

If selected, the results of this computation will be added to the dataset. Otherwise they replace the results of previous computations.

Example

In this very basic example shipping data is corrected to ensure that parcels heavier than 4kg, that should be sent by courier, are not mistakenly shipped by mail, incurring an excess weight charge.

The shipping excel file contains the following data:

Scenario data can be found in the Datasets folder in your Rulex installation.

We have created the following Rulex process with:

  1. An Import from Excel File (excel1) task with the excel data file above.

  2. A Data Manager (dataman1) task to define the roles of the attributes, create new attributes and ignore irrelevant attributes.

  3. A LLM (llm1) task to generate rules.

  4. A Rule Manager (ruleman1) task to analyze the generated rules.

  5. A Rule Based Control (fr1) task to substitute the attributes that lead to the undesired output class.

  6. A final Data Manager (dataman2) task to check the results of the correction.

Description

Screenshot

Import the shipping.xlsx file via an Import from Text File task.

Add a Data Manager task to the process, and configure it as follows:

  • The Parcel ID, Destination and Order Date attributes are ignored, as they should not be used to create the model.

  • A new Overweight attribute is created, to clearly show when a parcel exceeds 4 kg, with the following formula and its type is then set to NOMINAL:

   $"Overweight" = $"Weight in kg">4
  • A new Excess Weight attribute is created, to establish whether the shipping mode used was correct or incorrect for the parcel, using the following formula:

   $"Excess Weight" = ifelse($"Overweight"=="True" 
   &$"Shipping mode"=="post", "incorrect", "correct")

Add a Logic Learning Machine task, and define the Excess Weight attribute as the output by dragging and dropping it onto Output attributes.


Add a Rule Manager task to the process to view the generated rules.

Three rules have been generated.

Add a Rule Based Control task to the process, and configure it as follows:

  • The weight in kg attribute is added to the Attributes that cannot be modified column, as although the shipping mode can be changed, the weight of the parcel presumably cannot.

  • The weight of the output is modified, to 2 for incorrect values, as these are the most important output values.

To view the changes made by the Rule Based Control task, right-click the task in the stage and select Take a look.

The incorrect shipping mode for a 7kg parcel has correctly been changed from post to courier, using Rule 3.