An Introduction to Association Rule Learning

What is Association Rule Learning? It is a rule-based machine learning technique used to find patterns in data.

Cem Bıkmaz
Python in Plain English

--

The association rules method is to present these correlations in the best way through rules, if there are significant correlations between the items that occur simultaneously and frequently, and if there are significant correlations. In other words, it is a rule-based machine learning technique used to find patterns in data.

There is a very big problem in social media channels, based on e-commerce resources on the internet. There are hundreds of thousands of content on such sites and they are stored in their databases. We cannot upload these hundreds of thousands of content to the user. We should use content filtering methods. When we watch or like a video, we enter a certain flow. It takes the best extract of that flow and personalizes us. Our purpose in these systems is to filter the contents.

Apriori Algorithm

It is a basket analysis method and is used to reveal product associations.

Support(X, Y) = Freq(X, Y) / N There are 3 very simple formulas. The 1st is the Support value. It expresses the probability of X and Y occur together. It is the frequency of X and Y appearing together divided by N.

Confidence(X, Y) = Freq(X, Y) / Freq(X) It expresses the probability of purchasing product Y when product X is purchased. The frequency at which X and Y appear together is divided by the frequency at which X appears.

Lift = Support(X, Y) / (Support(x) * Support (Y)) When X is purchased, the probability of buying Y increases by a multiple of lift. The probability of X and Y appearing together is the product of the probabilities of X and Y appearing separately. It states an expression such as how many times the probability of buying another product increases when we buy a product.

We aim to suggest products to users in the product purchasing process by applying association analysis to the online retail II dataset.

Data Pre-processing

!pip install mlxtend
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
# It ensures that the output is on one line.
pd.set_option('display.expand_frame_repr', False)
from mlxtend.frequent_patterns import apriori, association_rules

df_ = pd.read_excel("datasets/online_retail_II.xlsx", sheet_name = "Year 2010-2011")
df = df_.copy()
df.info()
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Invoice 541910 non-null object
# 1 StockCode 541910 non-null object
# 2 Description 540456 non-null object
# 3 Quantity 541910 non-null int64
# 4 InvoiceDate 541910 non-null datetime64[ns]
# 5 Price 541910 non-null float64
# 6 Customer ID 406830 non-null float64
# 7 Country 541910 non-null object
df.head()
# Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country
# 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
# 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
# 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
# 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
# 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

We use this function to determine the threshold values of the data.

def outlier_thresholds(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

This function also replaces the determined outlier threshold values with outliers.

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In this function, we extract the values containing ‘C’ from the data. ‘C’ means returned items. To calculate Total Price, the variables Quantity and Price must be greater than zero. We close the function by calling the Outlier and Threshold functions.

def retail_data_prep(dataframe):
dataframe.dropna(inplace=True)
dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
dataframe = dataframe[dataframe["Quantity"] > 0]
dataframe = dataframe[dataframe["Price"] > 0]
replace_with_thresholds(dataframe, "Quantity")
replace_with_thresholds(dataframe, "Price")
return dataframe
df = retail_data_prep(df)

Preparing the ARL Data Structure (Invoice-Product Matrix)

First, we will try to put the invoices on the lines because they will be our basket.

df_gr = df[df['Country'] == 'Germany']
df_gr.head()
Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country
1109 536527 22809 SET OF 6 T-LIGHTS SANTA 6.0 2010-12-01 13:04:00 2.95 12662.0 Germany
1110 536527 84347 ROTATING SILVER ANGELS T-LIGHT HLDR 6.0 2010-12-01 13:04:00 2.55 12662.0 Germany
1111 536527 84945 MULTI COLOUR SILVER T-LIGHT HOLDER 12.0 2010-12-01 13:04:00 0.85 12662.0 Germany
1112 536527 22242 5 HOOK HANGER MAGIC TOADSTOOL 12.0 2010-12-01 13:04:00 1.65 12662.0 Germany
1113 536527 22244 3 HOOK HANGER MAGIC GARDEN 12.0 2010-12-01 13:04:00 1.95 12662.0 Germany

According to the Invoice and Description, we got groupby and counted Quantities. We said how many of these products are on this Invoice.

df_gr.groupby(['Invoice', 'Description']).agg({"Quantity": "sum"}).head(20)# Invoice Description
# 536527 3 HOOK HANGER MAGIC GARDEN 12.0
# 5 HOOK HANGER MAGIC TOADSTOOL 12.0
# 5 HOOK HANGER RED MAGIC TOADSTOOL 12.0
# ASSORTED COLOUR LIZARD SUCTION HOOK 24.0
# CHILDREN'S CIRCUS PARADE MUG 12.0
# HOMEMADE JAM SCENTED CANDLES 12.0
# HOT WATER BOTTLE BABUSHKA 4.0
# JUMBO BAG OWLS 10.0
# JUMBO BAG WOODLAND ANIMALS 10.0
# MULTI COLOUR SILVER T-LIGHT HOLDER 12.0
# PACK 3 FIRE ENGINE/CAR PATCHES 12.0
# PICTURE DOMINOES 12.0
# POSTAGE 1.0
# ROTATING SILVER ANGELS T-LIGHT HLDR 6.0
# SET OF 6 T-LIGHTS SANTA 6.0
# 536840 6 RIBBONS RUSTIC CHARM 12.0
# 60 CAKE CASES VINTAGE CHRISTMAS 24.0
# 60 TEATIME FAIRY CAKE CASES 24.0
# CAKE STAND WHITE TWO TIER LACE 2.0
# JAM JAR WITH GREEN LID 12.0

We use unstack to avoid multiplexing and we use iloc to show the first 5 observations. If a product is on an invoice, we did this to show how much information came from that product. If a product is not in the cart(invoice), NA will come.

df_gr.groupby(['Invoice', 'Description']).agg({"Quantity": "sum"}).unstack().iloc[0:5, 0:5]# Description  50'S CHRISTMAS GIFT BAG LARGE  DOLLY GIRL BEAKER  I LOVE LONDON MINI BACKPACK  RED SPOT GIFT BAG LARGE  SET 2 TEA TOWELS I LOVE LONDON
# Invoice
# 536527 NaN NaN NaN NaN NaN
# 536840 NaN NaN NaN NaN NaN
# 536861 NaN NaN NaN NaN NaN
# 536967 NaN NaN NaN NaN NaN
# 536983 NaN NaN NaN NaN NaN

We need one hot encoded version. We want to write 0 where it says NA.

df_gr.groupby(['Invoice', 'Description']).agg({"Quantity": "sum"}).unstack().fillna(0).iloc[0:5, 0:5]# Description  50'S CHRISTMAS GIFT BAG LARGE  DOLLY GIRL BEAKER  I LOVE LONDON MINI BACKPACK  RED SPOT GIFT BAG LARGE  SET 2 TEA TOWELS I LOVE LONDON
# Invoice
# 536527 0.0 0.0 0.0 0.0 0.0
# 536840 0.0 0.0 0.0 0.0 0.0
# 536861 0.0 0.0 0.0 0.0 0.0
# 536967 0.0 0.0 0.0 0.0 0.0
# 536983 0.0 0.0 0.0 0.0 0.0

Now we are going to do something a little different than the last one we did. Here, we will write 1 if the products in the invoices are greater than 0 in quantity and we will write 0 if it is less than 0 or 0. Operating on rows or columns with apply. Here we will go through all the cells by applying the applymap and performing the operation.

df_gr.groupby(['Invoice', 'Description']).agg({"Quantity": "sum"}).unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0).iloc[0:5, 0:5]# Description  50'S CHRISTMAS GIFT BAG LARGE  DOLLY GIRL BEAKER  I LOVE LONDON MINI BACKPACK  RED SPOT GIFT BAG LARGE  SET 2 TEA TOWELS I LOVE LONDON
# Invoice
# 536527 0 0 0 0 0
# 536840 0 0 0 0 0
# 536861 0 0 0 0 0
# 536967 0 0 0 0 0
# 536983 0 0 0 0 0

We create a function called create_invoice_df. If we want to search according to the id variable and get results, it will do the same as above according to the stockcode. If we entered the id as False, it will perform the above operation according to Description.

def create_invoice_product_df(dataframe, id=False):
if id:
return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \
applymap(lambda x: 1 if x > 0 else 0)
else:
return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \
applymap(lambda x: 1 if x > 0 else 0)
gr_inv_pro_df = create_invoice_product_df(df_gr)
gr_inv_pro_df.head(20)

gr_inv_pro_df = create_invoice_product_df(df_gr, id=True)
gr_inv_pro_df.head()
def check_id(dataframe, stock_code):
product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0].tolist()
print(product_name)

check_id(df_gr, 16016)
# ['LARGE CHINESE STYLE SCISSOR']

Possibilities of All Possible Product Combinations

Support(X, Y) = Freq(X, Y) / N There are 3 very simple formulas. The 1st is the Support value. It expresses the probability of X and Y occur together. It is the frequency of X and Y appearing together divided by N.

Confidence(X, Y) = Freq(X, Y) / Freq(X) It expresses the probability of purchasing product Y when product X is purchased. The frequency at which X and Y appear together is divided by the frequency at which X appears.

Lift = Support(X, Y) / (Support(x) * Support (Y)) When X is purchased, the probability of buying Y increases by a multiple of lift. The probability of X and Y appearing together is the product of the probabilities of X and Y appearing separately. It states an expression such as how many times the probability of buying another product increases when we buy a product.

There is a possibility of appearing together in this function, whatever value we enter in min_support will not take into account the values below those values.

frequent_itemsets = apriori(gr_inv_pro_df, min_support=0.01, use_colnames=True)frequent_itemsets.sort_values("support", ascending=False).head()#  support       itemsets
# 538 0.818381 (POST)
# 189 0.245077 (22326)
# 1864 0.225383 (POST, 22326)
# 191 0.157549 (22328)
# 1931 0.150985 (22328, POST)
check_id(df_gr, 22328)#['ROUND SNACK BOXES SET OF 4 FRUITS ']

By inserting the support values we found with Apriori into the association_rules function, and we find some other statistical data such as confidence and lift.

rules = association_rules(frequent_itemsets, metric="support", min_threshold=0.01)

rules.sort_values("support", ascending=False).head()

I apologize for not being able to fit the final output. According to the last output table, the probability of POST product and product numbered 22326 appearing together is 0.225383. The probability of being bought together is 0.275401. The increase in the probability of buying these two products together is 1.123735.

# antecedents consequents  antecedent support  consequent support   support  confidence      lift  leverage  conviction
# 2650 (POST) (22326) 0.818381 0.245077 0.225383 0.275401 1.123735 0.024817 1.041850
# 2651 (22326) (POST) 0.245077 0.818381 0.225383 0.919643 1.123735 0.024817 2.260151
# 2784 (22328) (POST) 0.157549 0.818381 0.150985 0.958333 1.171012 0.022049 4.358862
# 2785 (POST) (22328) 0.818381 0.157549 0.150985 0.184492 1.171012 0.022049 1.033038
# 2414 (22328) (22326) 0.157549 0.245077 0.131291 0.833333 3.400298 0.092679 4.529540

Thank you for reading from start to finish. I felt the need to write this article to reinforce what I learned in Data Science School. I would like to thank Miuul and my Veri Bilimi Okulu teachers, especially Vahit Keskin, who contributed to my writing this article.

Mustafa Vahit Keskin Miuul

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

--

--

Experienced Data Scientist | Machine Learning Engineer, Deep Learning, Predictive Maintenance