Using Hierarchical Basket Analysis to Solve Association Problems

Hierarchical basket analysis generates association rules from frequent itemsets identified by the Frequent Itemsets Mining task.

Prerequisites

Additional tabs

The following additional tabs are provided:

  • Documentation tab where you can document your task,

  • Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO). 

  • Association Rules and Results tabs, where you can see the output of the task computation. See Results table below.


Procedure

  1. Drag and drop the Hierarchical Basket Analysis task onto the stage.

  2. Connect a Frequent Itemsets Mining task, which contains the frequent itemsets from which you want to extract the associations, to the new task.

  3. Double click the Hierarchical Basket Analysis task. The left-hand pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.

  4. Click on the Basic tab to configure the basic options, as described in the table below.

  5. Click on the Advanced tab to configure the basic options as described in the table below.

  6. Click on the Output tab to configure the basic options as described in the table below.

  7. Save and compute the task. 

Hierarchical Basket Analysis Basic options

Name

PO

Description

Minimum item support (# samples)

supth

All items which appear in orders fewer times than this threshold are discarded.

This option is enabled only if Auto (specify #items) option is not selected.

Auto (specify #items)

mbaspecnitem

I selected, the minimum support for items is automatically computed according to the minimum number of items to take into account specified in the #Items to consider option.

#Items to consider

mbanitemsup

The number of items to take into account (most frequent first).

This option is enabled only if the Auto (specify #items) option is selected.

Minimum association rule support (# samples)

assupth

All association rules which are verified fewer times than this threshold are discarded.

This option is enabled only if Auto (above average) option is not selected.

Auto (above average)

abavassupth

If selected, the minimum association rule support is set to the average support of rules with the same dimension (i.e. with the same premise(s)+consequence(s) number).

Minimum confidence, minimum lift

confth, liftth

The minimum confidence and lift values for association rules.

Minimum Kulczynski index, maximum p-value

liftth

Define values for:

  • the minimum Kulczynski index. This index is defined as the average between two ratios: the first is constituted by the support of the considered rule divided by the support of its premise(s), the second is constituted by the support of the considered rule divided by the support of its consequence(s). This association rule belongs to the category of null-invariant measures, which means that the value of the Kulczynski index is not affected by the number of transactions which do not include the premise(s), or the consequence(s) of the considered rule.

  • The maximum p-value for association rules. Each association rule corresponds to a 2x2 contingency matrix with premises and consequences defining its rows and columns (both the premises and the consequences can be either true or false, which is why we have a 2x2 table). The p-value measures the probability of the null hypothesis associated to the considered rule: i.e. the probability that premises and consequences are uncorrelated.

No maximum # of premises/consequences

coherth, mbaboundpval

If selected, no maximum number of premises can be specified.

Maximum # of premises/consequences

maxdimpremise_opt, maxdimconsequence_opt

The maximum number of premises and consequences of the association rules.

This option is enabled only if the No maximum # of premises/consequences option is not selected.

Hierarchical Basket Analysis Advanced options

Attribute to filter to select rows including relevant data


Drag and drop attributes to this edit box from the Available attributes list to specify a filtering criterion.

Items satisfying this criterion are not discarded, regardless their support. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Attribute to filter to discard rows including irrelevant data


Drag and drop attributes to this edit box from the Available attributes list to specify a filtering criterion.

Items satisfying this criterion are discarded, regardless their support. If both the selecting and the discarding filters are specified, the discarding filter prevails. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

Hierarchical Basket Analysis Output options

No maximum # of premises/consequences

fitnomaxdim

If selected, no maximum number of premises/consequences will be generated.

Minimum number of different attributes involved in each role

mbadiffattr

Specify the minimum number of different attributes that must be included in each role.

Negative Rules (NOT A implies B, A implies NOT B)

mbagennegrules

If selected, negative rules are also generated. Negative rules are rules for which premise(s) or consequence(s) appear in negative form. For instance: A implies NOT B or NOT A implies B.

Maximum Kulczynski value which triggers the check for negative rules

mbaepsilon_opt

Considering that the presence of a high value for the Kulczynski index identifies a strong and robust correlation between premises and consequences constituting a rule, the same index can also used, from another perspective, to guide the mining of negative rules.

Consequently if the Kuclzysnki index is low (up to the specified maximum value), it is evaluated if the considered rule becomes strong when expressed in negative form (for instance when denying the premise).

Maximum # of premises/consequences, negative rules

maxnegdimconsequence, 
maxnegdimpremise

The maximum number of premises and consequences of the association rules.

This option is enabled only if the Negative Rules (NOT A implies B, A implies NOT B) option is selected.

Results

The results of the task can be viewed in two separate tabs:

  • The Association rules tab displays the generated item sets, where:

    • Rule ID is the association rule ID.

    • Positive/negative premise(s) distinguish between positive and negative premises. Negative premises only appear if the Negative Rules option is selected. If negative premises are listed in the current row, NOT is printed is this column; otherwise nothing is printed.

    • Premise Item ID contains the Item IDs of premises.

    • Positive/negative consequence(s) distinguish between positive and negative consequences. Negative consequences only appear if the Negative Rules option is selected. If negative premises are listed in the current row, NOT is printed is this column; otherwise nothing is printed.

    • Consequence Item ID contains the Item ID(s) of consequences.

    • Support premise(s) is the percentage of orders in which premise(s) appear in the dataset. 

    • Support # premise(s) is the number of times in which premise(s) appear in the dataset.

    • Support consequence(s) the number of times consequence(s) appear in the dataset.

    • Support shows the relevance of the considered rule, i.e. it counts how many transactions include both all premises, and is expressed as a percentage with respect to the total number of orders. 

    • Support # shows the relevance of the considered rule, i.e. it counts how many transactions include both all premises, and is expressed in absolute terms.

    • Confidence measures the reliability of the considered association rule. More specifically, it measures the following: if all the items in the premise of the rule are bought, how often are all the ones in the consequence bought too. Confidence values are comprised between 0 and 1.

    • Lift represents a relative measure of interdependence between premises and consequences. If consequences are independent from premises, lift is equal to 1. Consequently, if the lift is greater than 1 there is a direct correlation between item purchases, while a lift lower than 1 is an indicator of inverse correlation.

    • Cosine is a normalized interdependence measure, comprised between 0 and 1. The greater the cosine score, the stronger the interdependence between premise(s) and consequence(s).

    • Conviction represents a specificity measure, proportional to confidence and inversely proportional to support. The conviction value increases for reliable and rare associations and tends to infinity if confidence is maximum (i.e. equal to 1).

    • Leverage represents an absolute measure of interdependence between premise(s) and consequence(s). If consequence(s) are independent from premise(s), leverage is equal to 0.

    • Chi-square reports the value of the Chi-square test. If missing, it points out that the contingency table associated to the rule does not allow a reliable p-value estimate through the Chi-square test. In these cases, the Fisher’s exact test is preferred and its p-value estimate is upper-bounded as shown in [2].

    • p-value is the probability of the null hypothesis associated to the rule (i.e. no relationship between premise and consequence).

  • The Results tab, where details on the execution of the analysis are displayed:

    • Task identifier is the ID code for the task, internally used by the Rulex engine.

    • Task name is the name of the task.

    • Elapsed time is the time required for the latest computation (in seconds).

    • Minimum # support threshold for items is the minimum threshold for items applied during latest computation, in absolute terms.

    • Minimum support threshold for items (percentage) is the minimum threshold for items applied during latest computation as a percentage.

    • Number of different items in input is the number of distinct items which were fed to the task during latest computation.

    • Number of different orders in input is the number of distinct orders which were fed to the task during latest computation.

    • Number of generated association rules is the number of the associative rules displayed in the Association Rules tab.

Example

The Groceries scenario follows on from the Frequent Itemsets Mining example.

In the example process, frequent itemsets are extracted from an imported dataset, from which association rules are generated and analyzed. The Groceries dataset, used in the scenario, contains 9835 supermarket transactions in separate rows.

Scenario data can be found in the Datasets folder in your Rulex installation.

The following steps were performed:

  1. First we import the Groceries dataset.

  2. The data is prepared in the Data Manager task.

  3. The dataset is restructured in the Reshape To Long task.

  4. Frequent sequences are extracted with the Frequent Itemsets Mining task.

  5. Hierarchical Basket Analysis task is connected to the Frequent Itemsets Mining task to extract association rules.

  6. The association rules are imported via the Import From Task task.

  7. The association rules are analyzed in the Data Manager.

Procedure

Screenshot

The Groceries scenario follows on from the Frequent Itemsets Mining example.

We'll pick up the example here from when we add the Hierarchical Basket Analysis task to the process.

Set the following options:

  • Items to consider to 50

  • Minimum confidence to 0.2.

  • Maximum # of premises to 2

Save and execute the task.



Association rules are stored in the Association Rules tab. Each association rule will be characterized by premise(s) and consequence(s). If, for instance, a rule includes tropical fruit as a premise and citrus fruit as a consequence, it means that if a transaction includes a tropical fruit, it is also likely to include a citrus fruit.

Different indicators qualify and quantify the strength of this cross-selling relationship. To view which rules have the highest confidence, right-click on the Confidence column in the Association Rules tab and select Sort Descending.

We can now perform a few further steps in order to analyze the extracted rules in further detail, and perform filtering and statistical operations on the rules.

As the most reliable rules may have low support you could try repeating the analysis after setting the Minimum association rule support to 20, and check which rule has the highest lift.


In order to view the rules themselves and not just the dataset we can import the rules only by adding an Import From Task task to the process.

Double-click the task and select:

  • Process: the name of the process you are working in

  • Task: the name of the Hierarchical Basket Analysis task, Hba1 in our scenario.

  • Import dataset from: Association rules

  • Structures to be imported from target task: Association rules

Save and compute the task.

Add a Data Manager task to the Import from Task to analyze the rules by filtering them in the Query Manager pane.

Alternatively you can right-click Import from Task and select Take a Look to see the same information in read-only mode.

For example you could filter those rules where the Lift attribute is higher than 1.5, as in the example.

              

Alternatively you could also compute min/max or average values in the Statistic Manager. For example by using the Variance option from the univariate statistics on the Confidence attribute, as in the example.