Telco Customer Churn Prediction

This article aims to teach how toprediction telco customer churn in python.

Emine Bozkus
Python in Plain English

--

Customer churn is a fundamental problem for the telecommunication (Telco) industry. It is defined as the loss of customers moved from one Telco operator to another. If customer churn can be predicted in advance, such as “is this customer going to leave us within the next X months?”, Telco operators can apply business marketing policies to such churn customers to retain and increase the customer base. In particular, given millions of Telco customers, even reducing the 1% churn rate will lead to a significant profit increase.

We can roughly define the Churn analysis as the whole of analytical studies on “a customer”, “a product or service”, and “the probability of abandonment”. We aim to be aware of this situation (even the customer himself may not be aware of this situation) before the customer leaves us (approaching to leave) and then to take some preventive actions.

Figure 1. Customer Churn

Telco churn data includes information about a fictitious telecom company that provided home phone and Internet services to 7,043 California customers in the third quarter. Which customers have left, stayed, or signed up for their service shows?

Figure 2. Architecture and Design of a Platform for Adaptive, Real-time Churn Prediction (Balle et al., 2013)

Business Problem

It is desirable to develop a machine learning model that can predict customers who will leave the company. You are expected to perform the necessary data analysis and feature engineering steps before developing the model.

In this project, the dataset named Telco Customer Churn from Kaggle was used. Each row represents a customer, each column contains the customer’s attributes. This dataset contains 21 columns (variables) and 7043 rows (customers) with information such as customerID, gender, Phone Service, and Internet Service.

Analysis of data columns to identify independent and dependent variables:

X is the independent variables — the variables we are using to make predictions

  • customerID — unique value identifying customer
  • gender — whether the customer is a male or a female
  • SeniorCitizen — whether the customer is a senior citizen or not (1, 0)
  • Partner — whether the customer has a partner or not (Yes, No)
  • Dependents — whether the customer has dependents or not (Yes, No). A dependent is a person who relies on another as a primary source of income,
  • tenure — number of months the customer has stayed with the company
  • PhoneService — whether the customer has a phone service or not (Yes, No)
  • MultipleLines — whether the customer has multiple lines or not (Yes, No, No phone service)
  • InternetService — customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity — whether the customer has online security or not (Yes, No, No internet service)
  • OnlineBackup — whether the customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection — whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport — whether the customer has tech support or not (Yes, No, No internet service)
  • StreamingTV — whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies — whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract — type of contract according to duration (Month-to-month, One year, Two year)
  • PaperlessBilling — bills issued in paperless form (Yes, No)
  • PaymentMethod — payment method used by customer (Electronic check, Mailed check, Credit card (automatic), Bank transfer (automatic))
  • MonthlyCharges — amount of charge for service on monthly bases
  • TotalCharges — cumulative charges for service during subscription (tenure) period

y is dependent variable — variable we are trying to predict or estimate

  • Churn — output value, predict variable

Import libraries

First, let’s import the necessary Python libraries:

# for data manipulation
import numpy as np
import pandas as pd

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for data splitting, transforming and model training
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Import Warnings
import warnings
warnings.simplefilter(action="ignore")
# Setting Configurations:

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Data Reading

Let’s import the dataset.

The first step of the analysis consists of reading and storing the data in a Pandas data frame using the pandas.read_csv function.

df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

Let’s look at the first 5 observations of the dataset.

df.head()

After using the head function, we can see that this database has some null values (using the info function we can confirm this), and our target variable is “Churn”, we can also see that we have Numerical and Categoricals, here we can see too that we only have few variables.

def check_data(dataframe, head=5):

print(20*"-" + "Information".center(20) + 20*"-")
print(dataframe.info())
print(20*"-" + "Data Shape".center(20) + 20*"-")
print(dataframe.shape)
print("\n" + 20*"-" + "The First 5 Data".center(20) + 20*"-")
print(dataframe.head())
print("\n" + 20 * "-" + "The Last 5 Data".center(20) + 20 * "-")
print(dataframe.tail())
print("\n" + 20 * "-" + "Missing Values".center(20) + 20 * "-")
print(dataframe.isnull().sum())
print("\n" + 40 * "-" + "Describe the Data".center(40) + 40 * "-")
print(dataframe.describe([0.01, 0.05, 0.10, 0.50, 0.75, 0.90, 0.95, 0.99]).T)

check_data(df)

Column “TotalCharges” is an object, and we have to convert it to a numeric value.

pd.to_numeric(df.TotalCharges, errors='coerce').isna()
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce') # Using errors=coerce. It will replace all non-numeric values with NaN.

Let’s convert the dependent variable to binary variable.

First of all, we digitize the target variable, the churn variable, and make it suitable for analysis. Because we will shape all other variables according to their relationship with churn.

# To find the number of churners and non-churners in the dataset:
df["Churn"].value_counts()
No     5174
Yes 1869
Name: Churn, dtype: int64
df["Churn"] = df["Churn"].map({'No': 0, 'Yes': 1})
def grab_col_names(dataframe, cat_th=10, car_th=20):  
"""
It gives the names of categorical, numerical and categorical but cardinal variables in the data set.
Note: Categorical variables with numerical appearance are also included in categorical variables.

Parameters
------
df: Dataframe
The dataframe from which variable names are to be retrieved
cat_th: int, optional
threshold value for numeric but categorical variables
car_th: int, optinal
threshold value for categorical but cardinal variables

Returns
------
cat_cols: list
Categorical variable list
num_cols: list
Numeric variable list
cat_but_car: list
Categorical but cardinal variable list

Examples
------
You just need to call the function and send the dataframe.)

--> grab_col_names(df)

Notes
------
cat_cols + num_cols + cat_but_car = total number of variables
num_but_cat is inside cat_cols.
The sum of the 3 returned lists equals the total number of variables:
cat_cols + num_cols + cat_but_car = number of variables

"""
def grab_col_names(dataframe, cat_th=10, car_th=20):

# cat_cols, cat_but_car
cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtypes != "O"]
cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtypes == "O"]
cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]

# num_cols
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
num_cols = [col for col in num_cols if col not in num_but_cat]

print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')

return cat_cols, num_cols, cat_but_car


cat_cols, num_cols, cat_but_car = grab_col_names(df)

cat_cols
num_cols
Observations: 7043
Variables: 21
cat_cols: 17
num_cols: 3
cat_but_car: 1
num_but_cat: 2
['tenure', 'MonthlyCharges', 'TotalCharges']

Analyze the numerical and categorical variables.

def target_vs_category_visual(dataframe, target, categorical_col):
plt.figure(figsize=(15, 8))
sns.histplot(x=target, hue=categorical_col, data=dataframe, element="step", multiple="dodge")
plt.title("State of Categorical Variables according to Churn ")
plt.show()

for col in cat_cols:
target_vs_category_visual(df, "Churn", col)

Perform target variable analysis.

The mean of the target variable according to the categorical variables, the mean of the numeric variables according to the target variable

def target_summary_with_num(dataframe, target, numerical_col):
print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n")

for col in num_cols:
target_summary_with_num(df, "Churn", col)
      tenure
Churn
0 37.570
1 17.979

MonthlyCharges
Churn
0 61.265
1 74.441

TotalCharges
Churn
0 2555.344
1 1531.796

We continue the analysis by examining the relationship of the Churn variable with other variables.

def target_summary_with_cat(dataframe, target, categorical_col):
print(pd.DataFrame({"CHURN_MEAN": dataframe.groupby(categorical_col)[target].mean()}))


for col in cat_cols:
target_summary_with_cat(df, "Churn", col)
  • There is no significant difference in the number of departures of male and female customers.
  • The churn rate of customers paying by electronic check seems high.
  • Customers with monthly contracts seem to have a high churn rate.
  • It seems that customers using fiber optic internet have a high churn rate.

Perform outlier observation analysis.

def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

def check_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False

def grab_outliers(dataframe, col_name, index=False):
low, up = outlier_thresholds(dataframe, col_name)

if dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] > 10:
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].head())
else:
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))])

if index:
outlier_index = dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].index
return outlier_index

for col in num_cols:
print(col, check_outlier(df, col))
tenure False
MonthlyCharges False
TotalCharges False

Perform a missing observation analysis.

def missing_values_table(dataframe, na_name=False):
na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]

n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
print(missing_df, end="\n")

if na_name:
return na_columns


missing_values_table(df)
              n_miss  ratio
TotalCharges 11 0.160

Perform correlation analysis.

corr_matrix = df[num_cols].corr()
corr_matrix
df.corrwith(df["Churn"]).sort_values(ascending=False)
Churn             1.000
MonthlyCharges 0.193
SeniorCitizen 0.151
TotalCharges -0.199
tenure -0.352
dtype: float64
# Correlation between all variables
plt.figure(5, figsize=(25, 10))
corr = df.apply(lambda x: pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)
# Correlation between churn and selected boolean and numeric variables
plt.figure()
ds_corr = df[['SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'PaperlessBilling',
'MonthlyCharges', 'TotalCharges']]

correlations = ds_corr.corrwith(df.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
figsize = (20, 10),
fontsize = 12,
color = '#9BCD9B',
rot = 40, grid = True)

plt.title('Correlation with Churn Rate \n', horizontalalignment="center", fontstyle = "normal", fontsize = "20", fontfamily = "charter")

By looking at the correlation results, we can make the following comments.

  • There is a positive correlation between churn and the age of customers most senior citizens churn. Maybe there is some campaign by competitors targeting the senior population.
  • Logically, longer tenure could also mean more loyalty and less churn risk.
  • It is also logical that more monthly charges can result in more churn risk.
  • However, it is interesting that total charges show a negative correlation to churn. The explanation can be that total charges also depend on the time the customer has spent with a company (tenure has a negative correlation). Also, it is questionable whether TotalCharges is an adequate variable to understand customer behavior and whether is it tracked by the customer.
  • A positive correlation between paperless billing and churn is something that needs extra exploring (not clear what can be divers for that behavior).

Take necessary actions for missing and contradictory observations.

na_cols = missing_values_table(df, True)
df.dropna(inplace=True)
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)
df.isnull().sum()
customerID          0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64

Perform the encoding operations.

Label Encode Binary data: Independent variables for machine learning algorithms can typically only have numerical values. Label encoding is used for all categorical variables with only two unique values.

le = LabelEncoder()

binary_cols = [col for col in df.columns if df[col].dtype not in [int, float]
and df[col].nunique() == 2]

def label_encoder(dataframe, binary_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
return dataframe

for col in binary_cols:
df = label_encoder(df, col)

def one_hot_encoder(dataframe, categorical_cols, drop_first=True):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

ohe_cols = [col for col in df.columns if 30 >= df[col].nunique() > 2]

df = one_hot_encoder(df, ohe_cols)

Standardize for numeric variables.

num_cols = [col for col in num_cols if col not in "customerID"]
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

df[num_cols].head()

Let’s build a model.

y = df['Churn']
X = df.drop(["customerID", "Churn"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=46).fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
accuracy_score(y_pred, y_test)
0.7834123222748816

Conclusion

Presently telecom market is facing severe competition. Customer churn prediction has become an important issue of customer relationship management to retain valuable customers. Therefore by performing research, key factors of churn to retain customers and their influence on churn will be well understood.

Proper churn management can save a huge amount of money for the company. Thus the economic value of customer retention can be summarized as:

  • satisfied customers can bring new customers
  • long-term customers usually do not get influenced much by competitors
  • long-term customers tend to buy more
  • company can focus on satisfying existing customer’s needs
  • lost customers share negative experiences and thus will have a negative influence on the image of the company

Thus customer retention as a function of i.e. {Price, service quality, customer satisfaction, brand image} could lead to better customer loyalty.

In future studies, different methods will be tried to compare different algorithms and their model accuracy.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.

References

  1. https://courses.miuul.com/p/feature-engineering
  2. Balle, B., Casas, B., Catarineu,A., Gavaldà, R., Manzano-Macho D., (2013). The Architecture of a Churn Prediction System Based on Stream Mining, Frontiers in Artificial Intelligence and Applications
  3. Dujmovic, N., (2022). Machine Learning fast-track: Telco Customer Churn Prediction, https://www.linkedin.com/pulse/machine-learning-fast-track-telco-customer-churn-neven-dujmovic/?trk=articles_directory, Date of access: 30 November 2022

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.

--

--