Predicting Customer Value: Applying BG-NBD and Gamma-Gamma Models in Retail Analytics

Mustafa Germec, PhD
Python in Plain English
15 min readApr 17, 2024

--

Photo by Jayson Hinrichsen on Unsplash

Abstract

This project focuses on the application of the BG-NBD (Beta Geometric/Negative Binomial Distribution) and Gamma-Gamma models for estimating Customer Lifetime Value (CLTV) in a retail context. The data set includes customer purchase history from an omnichannel retail platform, with variables such as purchase frequency, recency, and monetary value. The BG-NBD model predicts future transaction numbers and customer churn probability, while the Gamma-Gamma model estimates the monetary value of transactions. Together, these models help businesses make informed decisions on sales, marketing, and resource allocation by understanding the future value of their customer base. The text also details the process of data preparation, model fitting, and CLTV calculation, providing insights into customer behavior and value segmentation.

Business Problem

FLO wants to determine a roadmap for sales and marketing activities. In order for the company to make medium-long term plans, it is necessary to estimate the potential value that existing customers will provide to the company in the future.

The data set consists of information obtained from the past shopping behavior of customers who made their last purchases from Flo via OmniChannel (both online and offline shopping) in 2020–2021.

Number of variables: 12
Number of observations: 19,945
Size of dataset: 2.7 MB

Variables

master_id: Unique customer number
order_channel: Which channel of the shopping platform is used (Android, iOS, Desktop, Mobile)
last_order_channel: Channel where the last purchase was made
first_order_date: The date of the customer’s first purchase
last_order_date: The last shopping date of the customer
last_order_date_online: The last shopping date of the customer on the online platform
last_order_date_offline: The last shopping date of the customer on the offline platform
order_num_total_ever_online: Total number of purchases made by the customer on the online platform
order_num_total_ever_offline: Total number of purchases made by the customer offline
customer_value_total_ever_offline: The total price paid by the customer for offline purchases
customer_value_total_ever_online: The total price paid by the customer for online purchases
interested_in_categories_12: List of categories the customer has shopped in the last 12 months

The BG-NBD (Beta Geometric/Negative Binomial Distribution) model and the Gamma-Gamma model are statistical models used in the field of customer analytics, particularly for predicting customer behavior and calculating customer lifetime value (CLV). Let’s delve into each model separately:

BG-NBD Model

The BG-NBD model is a probabilistic model that predicts the number of future transactions a customer will make over a certain period. It is based on two types of customer behavior:

Transaction Behavior

The model assumes that the number of transactions a customer makes follows a Negative Binomial Distribution (NBD). This part of the model accounts for the transactional heterogeneity across customers, meaning that different customers have different purchasing patterns.

Dropout Behavior

The model also incorporates a customer’s probability of becoming inactive or “dropping out.” This is modeled using a Beta Geometric (BG) distribution. The BG distribution captures the heterogeneity in the dropout process, recognizing that customers have different propensities to churn.

The combination of these two distributions allows the BG-NBD model to estimate the expected number of transactions for each customer and the probability that a customer is still active. It is particularly useful for businesses with repeat purchase patterns, such as subscription services or regular consumable goods.

Gamma-Gamma Model

The Gamma-Gamma model is often used in conjunction with the BG-NBD model to predict the monetary value of future transactions. While the BG-NBD model focuses on the purchase frequency, the Gamma-Gamma model is concerned with the monetary aspect.

The Gamma-Gamma model assumes that:

1. The monetary value of a customer’s transactions is randomly distributed around their average transaction value.
2. The average transaction value varies across customers but does not vary over time for any given customer.
3. The distribution of average transaction values across customers follows a Gamma distribution.

By fitting the Gamma-Gamma model to historical transaction data, businesses can estimate the expected average profit from a customer’s future transactions. This model is only applicable to customers with repeat purchases, as it requires a history of transaction values to make predictions.

Combining BG-NBD and Gamma-Gamma for CLTV

To calculate the Customer Lifetime Value (CLV), businesses often use the BG-NBD model to predict the number of future transactions and the Gamma-Gamma model to predict the average profit per transaction. By combining these two predictions, businesses can estimate the total future profit from a customer over a given time horizon.

The CLV is then calculated as:

CLV = Expected Number of Transactions (from BG-NBD) × Expected Average Profit per Transaction (from Gamma-Gamma)

These models are valuable for businesses because they help in making informed decisions about marketing, customer retention, and resource allocation by understanding the future value of their customer base.

The frequency column represents the number of repeat purchases (i.e., total purchases minus 1), recency is the age of the customer at their most recent purchase, T is the age of the customer in the dataset time units, and monetary_value is the average profit from each transaction for each customer.

Task 1: Understanding and Preparing Data

Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from lifetimes import BetaGeoFitter, GammaGammaFitter
from lifetimes.plotting import plot_period_transactions, plot_transaction_rate_heterogeneity, plot_frequency_recency_matrix
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Read the flo.csv data and create a copy of the Dataframe.

flo = pd.read_csv('Datasets/flo.csv')
                            master_id order_channel last_order_channel first_order_date last_order_date last_order_date_online last_order_date_offline  order_num_total_ever_online  order_num_total_ever_offline  customer_value_total_ever_offline  customer_value_total_ever_online       interested_in_categories_12
0 cc294636-19f0-11eb-8d74-000d3a38a36f Android App Offline 2020-10-30 2021-02-26 2021-02-21 2021-02-26 4.00 1.00 139.99 799.38 [KADIN]
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f Android App Mobile 2017-02-08 2021-02-16 2021-02-16 2020-01-10 19.00 2.00 159.97 1853.58 [ERKEK, COCUK, KADIN, AKTIFSPOR]
2 69b69676-1a40-11ea-941b-000d3a38a36f Android App Android App 2019-11-27 2020-11-27 2020-11-27 2019-12-01 3.00 2.00 189.97 395.35 [ERKEK, KADIN]
3 1854e56c-491f-11eb-806e-000d3a38a36f Android App Android App 2021-01-06 2021-01-17 2021-01-17 2021-01-06 1.00 1.00 39.99 81.98 [AKTIFCOCUK, COCUK]
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f Desktop Desktop 2019-08-03 2021-03-07 2021-03-07 2019-08-03 1.00 1.00 49.99 159.99 [AKTIFSPOR]

Define the outlier_thresholds and replace_with_thresholds functions required to suppress outliers

def outlier_thresholds(df, feature, q1=0.05, q3=0.95):
# Calculate percentiles (Q1 and Q3)
Q1 = df[feature].quantile(q1)
Q3 = df[feature].quantile(q3)

# Calculate the IQR (Interquartile Range)
IQR = Q3 - Q1

# Determine the outlier cutoffs
lower_bound = int(round(Q1 - 1.5 * IQR, 0))
upper_bound = int(round(Q3 + 1.5 * IQR, 0))

# Identify outlier indices
outlier_indices = df.index[(df[feature] < lower_bound) | (df[feature] > upper_bound)].tolist()

return lower_bound, upper_bound

def replace_with_thresholds(dataframe, variable):
lower_bound, upper_bound = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < lower_bound), variable] = lower_bound
dataframe.loc[(dataframe[variable] > upper_bound), variable] = upper_bound

Suppress “order_num_total_ever_online”, “order_num_total_ever_offline”, “customer_value_total_ever_offline”, “customer_value_total_ever_online” variables if they have outliers

variables = [col for col in flo.columns if 'ever' in col]
for variable in variables:
replace_with_thresholds(flo, variable)

Omnichannel refers to customers shopping both online and offline platforms. Create new variables for the total number of purchases and expenditures of each customer.

flo['total_order_num'] = flo['order_num_total_ever_online'] + flo['order_num_total_ever_offline']
flo['total_order_value'] = flo['customer_value_total_ever_offline'] + flo['customer_value_total_ever_online']

Examine the variable types. Change the type of variables expressing date to date.

flo.dtypes
date_variables = [col for col in flo.columns if 'date' in col]
flo[date_variables] = flo[date_variables].apply(pd.to_datetime)
flo.dtypes
# Before changing the data type
master_id object
order_channel object
last_order_channel object
first_order_date object
last_order_date object
last_order_date_online object
last_order_date_offline object
order_num_total_ever_online float64
order_num_total_ever_offline float64
customer_value_total_ever_offline float64
customer_value_total_ever_online float64
interested_in_categories_12 object
total_order_num float64
total_order_value float64
dtype: object

# After changing the data type
master_id object
order_channel object
last_order_channel object
first_order_date datetime64[ns]
last_order_date datetime64[ns]
last_order_date_online datetime64[ns]
last_order_date_offline datetime64[ns]
order_num_total_ever_online float64
order_num_total_ever_offline float64
customer_value_total_ever_offline float64
customer_value_total_ever_online float64
interested_in_categories_12 object
total_order_num float64
total_order_value float64
dtype: object

Task 2: Creating the CLTV Data Structure

Take 2 days after the date of the last purchase in the data set as the analysis date.

flo['last_order_date'].max()    # Timestamp('2021-05-30 00:00:00')
analysis_date = dt.datetime(2021, 6, 1)

Create a new cltv dataframe containing customer_id, recency_cltv_weekly, T_weekly, frequency and monetary_cltv_avg values. Monetary value should be expressed as the average value per purchase, and recency and tenure values should be expressed on a weekly basis.

cltv = pd.DataFrame()
cltv['customer_id'] = flo['master_id']
cltv['recency_cltv_weekly'] = (flo['last_order_date'] - flo['first_order_date']).astype('timedelta64[D]') / 7
cltv['T_weekly'] = (analysis_date - flo['first_order_date']).astype('timedelta64[D]') / 7
cltv['frequency'] = flo['total_order_num']
cltv['monetary_cltv_avg'] = flo['total_order_value'] / flo['total_order_num']
cltv = cltv[cltv['frequency'] > 1]
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57 5.00 187.87
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86 21.00 95.88
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86 5.00 117.06
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86 2.00 60.98
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43 2.00 104.99
... ... ... ... ...
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f 41.14 88.43 3.00 133.99
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f 42.29 65.29 2.00 195.24
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f 88.71 89.86 3.00 210.98
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f 98.43 113.86 6.00 168.29
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f 39.57 91.00 2.00 130.98
[19945 rows x 5 columns]

Task 3: Establishing BG/NBD, Gamma-Gamma Models and Calculating CLTV

Fit the BG/NBD model

bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(
cltv['frequency'],
cltv['recency_cltv_weekly'],
cltv['T_weekly']
)
<lifetimes.BetaGeoFitter: fitted with 19945 subjects, a: 0.00, alpha: 80.49, b: 0.00, r: 3.83>

Estimate the expected purchases from customers within 3 months and add it to the cltv dataframe as expected_3_month.

cltv['expected_3_month'] = bgf.predict(
4*3,
cltv['frequency'],
cltv['recency_cltv_weekly'],
cltv['T_weekly']
)
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57 5.00 187.87 0.95
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86 21.00 95.88 0.98
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86 5.00 117.06 0.66
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86 2.00 60.98 0.69
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43 2.00 104.99 0.40
... ... ... ... ... ...
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f 41.14 88.43 3.00 133.99 0.48
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f 42.29 65.29 2.00 195.24 0.48
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f 88.71 89.86 3.00 210.98 0.48
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f 98.43 113.86 6.00 168.29 0.61
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f 39.57 91.00 2.00 130.98 0.41
[19945 rows x 6 columns]

Estimate the expected purchases from customers within 6 months and add it to the cltv dataframe as exp_sales_6_month.

cltv['exp_sales_6_month'] = bgf.predict(
4*6,
cltv['frequency'],
cltv['recency_cltv_weekly'],
cltv['T_weekly']
)
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57 5.00 187.87 0.95 1.91
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86 21.00 95.88 0.98 1.95
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86 5.00 117.06 0.66 1.33
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86 2.00 60.98 0.69 1.38
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43 2.00 104.99 0.40 0.79
... ... ... ... ... ... ...
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f 41.14 88.43 3.00 133.99 0.48 0.97
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f 42.29 65.29 2.00 195.24 0.48 0.96
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f 88.71 89.86 3.00 210.98 0.48 0.96
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f 98.43 113.86 6.00 168.29 0.61 1.21
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f 39.57 91.00 2.00 130.98 0.41 0.82
[19945 rows x 7 columns]
plot_period_transactions(bgf)
plt.show()
plot_frequency_recency_matrix(bgf)
plt.show()
plot_transaction_rate_heterogeneity(bgf)
plt.show()

Examine the 10 people who will make the most purchases in the 3rd and 6th months and analyze whether there is a difference between them.

cltv.sort_values('expected_3_month', ascending=False).head(10)
                               customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month
8328 1902bf80-0035-11eb-8341-000d3a38a36f 28.86 33.29 25.00 97.44 3.04 6.08
15611 4a7e875e-e6ce-11ea-8f44-000d3a38a36f 39.71 40.00 26.00 156.79 2.97 5.94
19538 55d54d9e-8ac7-11ea-8ec0-000d3a38a36f 52.57 58.71 28.00 156.49 2.74 5.49
14373 f00ad516-c4f4-11ea-98f7-000d3a38a36f 38.00 46.43 25.00 152.66 2.73 5.45
6666 53fe00d4-7b7a-11eb-960b-000d3a38a36f 9.71 13.00 17.00 206.93 2.67 5.35
7330 a4d534a2-5b1b-11eb-8dbd-000d3a38a36f 62.71 67.29 28.00 165.70 2.58 5.17
6756 27310582-6362-11ea-a6dc-000d3a38a36f 62.71 64.14 25.00 156.24 2.39 4.78
14054 645b95bc-544e-11ea-b1db-000d3a38a36f 71.43 72.00 26.00 160.99 2.35 4.69
1364 a2c95e4e-5b09-11ea-acac-000d3a38a36f 55.29 67.43 23.00 156.60 2.18 4.35
5759 dd8f7930-615f-11ea-8dd8-000d3a38a36f 32.43 47.00 19.00 65.15 2.15 4.30
cltv.sort_values('exp_sales_6_month', ascending=False).head(10)
                               customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month
8328 1902bf80-0035-11eb-8341-000d3a38a36f 28.86 33.29 25.00 97.44 3.04 6.08
15611 4a7e875e-e6ce-11ea-8f44-000d3a38a36f 39.71 40.00 26.00 156.79 2.97 5.94
19538 55d54d9e-8ac7-11ea-8ec0-000d3a38a36f 52.57 58.71 28.00 156.49 2.74 5.49
14373 f00ad516-c4f4-11ea-98f7-000d3a38a36f 38.00 46.43 25.00 152.66 2.73 5.45
6666 53fe00d4-7b7a-11eb-960b-000d3a38a36f 9.71 13.00 17.00 206.93 2.67 5.35
7330 a4d534a2-5b1b-11eb-8dbd-000d3a38a36f 62.71 67.29 28.00 165.70 2.58 5.17
6756 27310582-6362-11ea-a6dc-000d3a38a36f 62.71 64.14 25.00 156.24 2.39 4.78
14054 645b95bc-544e-11ea-b1db-000d3a38a36f 71.43 72.00 26.00 160.99 2.35 4.69
1364 a2c95e4e-5b09-11ea-acac-000d3a38a36f 55.29 67.43 23.00 156.60 2.18 4.35
5759 dd8f7930-615f-11ea-8dd8-000d3a38a36f 32.43 47.00 19.00 65.15 2.15 4.30

Based on the analysis, the top 10 customers who will make the most purchases in the 3rd and 6th months are the same. These customers have high recency scores, indicating that they have made recent purchases. They also have high frequency scores, suggesting that they make purchases frequently. Their monetary scores are relatively high, indicating that they are high spenders. These customers are expected to bring significant revenue to the company in the coming months.

Fit the Gamma-Gamma model. Estimate the average value that customers will leave and add it to the cltv dataframe as exp_average_value.

gg = GammaGammaFitter(penalizer_coef=0.01)
gg.fit(
cltv['frequency'],
cltv['monetary_cltv_avg']
)
<lifetimes.GammaGammaFitter: fitted with 19945 subjects, p: 4.15, q: 0.47, v: 4.08>
cltv['exp_average_value'] = gg.conditional_expected_average_profit(
cltv['frequency'],
cltv['monetary_cltv_avg']
)
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month  exp_average_value
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57 5.00 187.87 0.95 1.91 193.63
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86 21.00 95.88 0.98 1.95 96.67
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86 5.00 117.06 0.66 1.33 120.97
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86 2.00 60.98 0.69 1.38 67.32
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43 2.00 104.99 0.40 0.79 114.33
... ... ... ... ... ... ... ...
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f 41.14 88.43 3.00 133.99 0.48 0.97 141.36
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f 42.29 65.29 2.00 195.24 0.48 0.96 210.72
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f 88.71 89.86 3.00 210.98 0.48 0.96 221.78
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f 98.43 113.86 6.00 168.29 0.61 1.21 172.65
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f 39.57 91.00 2.00 130.98 0.41 0.82 142.09
[19945 rows x 8 columns]

Calculate 6-month CLTV and add it to the dataframe with the name cltv_df. Observe the 20 people with the highest CLTV values.

cltv_df = gg.customer_lifetime_value(
bgf,
cltv['frequency'],
cltv['recency_cltv_weekly'],
cltv['T_weekly'],
cltv['monetary_cltv_avg'],
time = 6,
freq = 'W',
discount_rate = 0.01
)

cltv['cltv_df'] = cltv_df
cltv.sort_values('cltv_df', ascending=False).head(20)
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month  exp_average_value  cltv_df
9055 47a642fe-975b-11eb-8c2a-000d3a38a36f 2.86 7.86 4.00 1065.80 1.06 2.13 1101.98 2458.35
13880 7137a5c0-7aad-11ea-8f20-000d3a38a36f 6.14 13.14 11.00 394.09 1.90 3.80 399.09 1591.37
8868 9ce6e520-89b0-11ea-a6e7-000d3a38a36f 3.43 34.43 8.00 601.23 1.23 2.47 611.49 1584.70
6402 851de3b4-8f0c-11eb-8cb8-000d3a38a36f 8.29 9.43 2.00 862.69 0.78 1.56 923.68 1507.20
14858 031b2954-6d28-11eb-99c4-000d3a38a36f 14.86 15.57 3.00 743.59 0.85 1.71 778.05 1392.35
6717 40b4f318-9dfb-11eb-9c47-000d3a38a36f 27.14 33.86 7.00 544.70 1.14 2.27 555.41 1324.24
11694 90f1b7f2-bbad-11ea-a0c9-000d3a38a36f 47.29 48.00 6.00 647.34 0.92 1.84 662.11 1275.11
1853 f02473b0-43c3-11eb-806e-000d3a38a36f 17.29 23.14 2.00 835.88 0.67 1.35 895.04 1267.18
11179 d2e74a36-3228-11eb-860c-000d3a38a36f 1.14 26.29 3.00 750.57 0.77 1.53 785.34 1264.37
7936 ae4ce104-dbd4-11ea-8757-000d3a38a36f 3.71 42.00 3.00 844.35 0.67 1.34 883.29 1239.61
9738 3a27b334-dff4-11ea-acaa-000d3a38a36f 40.00 41.14 3.00 837.06 0.67 1.35 875.67 1237.59
7312 90befc98-925a-11eb-b584-000d3a38a36f 4.14 8.86 6.00 431.33 1.32 2.64 441.40 1222.48
7171 77e66e92-31fa-11eb-860c-000d3a38a36f 16.86 26.29 5.00 566.76 0.99 1.98 582.44 1212.44
15516 9083981a-f59e-11e9-841e-000d3a38a36f 63.57 83.86 4.00 971.50 0.57 1.14 1004.57 1204.68
10876 ae149d98-9b6a-11eb-9c47-000d3a38a36f 6.14 7.14 9.00 317.48 1.76 3.51 322.51 1188.72
6666 53fe00d4-7b7a-11eb-960b-000d3a38a36f 9.71 13.00 17.00 206.93 2.67 5.35 208.74 1170.99
18997 41231c72-566a-11eb-9e65-000d3a38a36f 2.57 4.57 7.00 344.01 1.53 3.05 351.00 1125.00
16087 4031bc1e-c52a-11ea-9dde-000d3a38a36f 38.71 46.43 7.00 512.40 1.02 2.05 522.51 1122.39
1775 020e2b84-5bbb-11eb-8dbd-000d3a38a36f 1.00 18.71 5.00 443.06 1.07 2.14 455.51 1020.57
6857 0515a7ec-d49f-11ea-9838-000d3a38a36f 41.14 43.29 16.00 247.93 1.92 3.84 250.18 1009.14

Task 4: Creating Segments According to cltv_df Value

Divide all your customers into 4 groups (segments) according to 6-month cltv_df and add the group names to the data set.

cltv['cltv_segment'] = pd.qcut(cltv['cltv_df'], 4, labels=['D', 'C', 'B', 'A'])
                                customer_id  recency_cltv_weekly  T_weekly  frequency  monetary_cltv_avg  expected_3_month  exp_sales_6_month  exp_average_value  cltv_df cltv_segment
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57 5.00 187.87 0.95 1.91 193.63 387.52 A
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86 21.00 95.88 0.98 1.95 96.67 197.91 B
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86 5.00 117.06 0.66 1.33 120.97 168.73 B
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86 2.00 60.98 0.69 1.38 67.32 97.46 D
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43 2.00 104.99 0.40 0.79 114.33 95.35 D
... ... ... ... ... ... ... ... ... ...
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f 41.14 88.43 3.00 133.99 0.48 0.97 141.36 143.86 C
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f 42.29 65.29 2.00 195.24 0.48 0.96 210.72 212.09 B
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f 88.71 89.86 3.00 210.98 0.48 0.96 221.78 223.80 B
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f 98.43 113.86 6.00 168.29 0.61 1.21 172.65 219.82 B
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f 39.57 91.00 2.00 130.98 0.41 0.82 142.09 121.57 C
[19945 rows x 10 columns]

Make 6-month action recommendations to the management for the segments.

cltv.groupby('cltv_segment').agg({
'recency_cltv_weekly': ['mean', 'sum'],
'T_weekly': ['mean', 'sum'],
'frequency': ['mean', 'sum'],
'monetary_cltv_avg': ['mean', 'sum'],
'exp_sales_6_month': ['mean', 'sum'],
'exp_average_value': ['mean', 'sum'],
'cltv_df': ['mean', 'sum']
})
             recency_cltv_weekly           T_weekly           frequency          monetary_cltv_avg            exp_sales_6_month         exp_average_value            cltv_df           
mean sum mean sum mean sum mean sum mean sum mean sum mean sum
cltv_segment
D 138.51 690772.57 161.73 806528.29 3.77 18821.00 92.81 462857.90 0.82 4094.58 98.33 490368.34 80.48 401331.61
C 92.71 462256.00 112.81 562491.86 4.38 21830.00 125.86 627533.90 1.05 5212.31 132.33 659788.87 137.94 687755.47
B 83.01 413870.29 101.33 505242.86 5.13 25592.00 160.53 800382.01 1.19 5953.44 167.85 836887.49 198.43 989390.83
A 66.81 333131.57 82.01 408877.71 6.36 31717.00 228.63 1139936.58 1.50 7496.35 237.88 1186088.00 352.96 1759872.40

The data given above shows the average values for various metrics for different customer segments (D, C, B, A). Let’s analyze the data deeply:

Recency (recency_cltv_weekly): The average recency values for each segment decrease as we move from segment D to segment A. This indicates that customers in segment A have made more recent purchases compared to customers in segment D.

Lifetime (T_weekly): The average lifetime values for each segment increase as we move from segment D to segment A. This suggests that customers in segment A have been active for a longer period of time compared to customers in segment D.

Frequency: The average frequency values for each segment increase as we move from segment D to segment A. This indicates that customers in segment A make more frequent purchases compared to customers in segment D.

Monetary Value (monetary_cltv_avg): The average monetary values for each segment increase as we move from segment D to segment A. This suggests that customers in segment A spend more money on average compared to customers in segment D.

Expected Sales in 6 Months (exp_sales_6_month): The average expected sales values for each segment increase as we move from segment D to segment A. This indicates that customers in segment A are expected to generate more sales in the next 6 months compared to customers in segment D.

Expected Average Value (exp_average_value): The average expected average value for each segment increases as we move from segment D to segment A. This suggests that customers in segment A are expected to have a higher average transaction value compared to customers in segment D.

CLTV (cltv_df): The average CLTV values for each segment increase as we move from segment D to segment A. This indicates that customers in segment A have a higher customer lifetime value compared to customers in segment D.

Overall, customers in segment A have the highest recency, lifetime, frequency, monetary value, expected sales, expected average value, and CLTV. On the other hand, customers in segment D have the lowest values for these metrics. This suggests that customers in segment A are the most valuable and engaged customers, while customers in segment D are the least valuable and engaged.

Conclusions

In conclusion, the integration of the BG-NBD and Gamma-Gamma models presents a robust approach to predicting customer behavior and calculating Customer Lifetime Value (CLTV) in retail. By analyzing purchase history data, these models offer valuable insights into customer retention and profitability, guiding strategic business decisions. The study underscores the importance of CLTV as a metric for evaluating marketing efforts and optimizing resource allocation. Future research could explore the models’ applicability across different industries and customer segments, potentially enhancing their predictive accuracy and broadening their utility in diverse market contexts.

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--

I am interested in bioprocessing, data science, machine learning, natural language process (NLP), time series, and structured query language (SQL).