Python Project: Evaluating Marketing Campaign and Performing Customer Clustering

July 29, 2022 26 minute read

Problem exposition and Business Value

Marketing Campaigns can be considered a call to action for customers and they can be of any kind. For example, they can involve giving customers a discount, giving customers something for free, collecting points on a membership card, offering free delivery, offering additional credit and so on.

Regardless of the exact nature of any campaign, what is important is that different types of customers respond differently to different types of campaigns; customers in a specific segment may respond well to one campaign and poorly to another.

By finding out in advance what customer segments are more likely to respond well to a given marketing campaign, we can target those customers. This will encourage a higher conversion percentage and, ultimately, higher profitability.

Therefore, this analysis is aimed at finding:

different customer segments for this particular campaign
which segments responds well and which do not
conversion percentages and associated profitability for each customer segment, broken down by customer characteristics such as income, living arrangements, monthly amount spent on meat products and more
clusters within this customer data so that subsequent marketing campaigns can target customers more effectively

In this case, our marketing campaign offers a discount on certain products to customers. The cost per customer of such a campaign is $3 while a successful conversion results in a per-customer revenue of $5.

Data

Data source: This data is obtained from Kaggle and can be found in the Git repository

Observations: It contains 2216 observations where each observation represents a customer who converted or did not convert

Features: There are 35 features in total with a mix of interval and categorical data types. A few examples are:

Complain - 1 if customer complained in the last 2 years
DtCustomer - date of customer’s enrolment with the company
Education - customer’s level of education
Kidhome - number of small children in customer’s household
MntMeatProducts - amount spent on meat products in the last 2 years
MntFruits - amount spent on fruits products in the last 2 years
NumStorePurchases - number of purchases made directly in stores
NumWebPurchases - number of purchases made through company’s web site

Outcome variable: The outcome variable is called Response it is 1 if customer accepted the offer in the last campaign and 0 otherwise

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

# import functions
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler



# change settings for plots
plt.style.use('fivethirtyeight')

1) Load and Clean Data

data = pd.read_csv('/Users/alitaimurshabbir/Desktop/Personal/DataScience/CRM Analytics Datasets/marketing_campaign.csv', sep = ';')
data.head()

	ID	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	...	NumWebVisitsMonth	Z_CostContact	Z_Revenue	Response
0	5524	1957	Graduation	Single	58138.0	0	0	2012-09-04	58	635	...	7	3	11	1
1	2174	1954	Graduation	Single	46344.0	1	1	2014-03-08	38	11	...	5	3	11	0
2	4141	1965	Graduation	Together	71613.0	0	0	2013-08-21	26	426	...	4	3	11	0
3	6182	1984	Graduation	Together	26646.0	1	0	2014-02-10	26	11	...	6	3	11	0
4	5324	1981	PhD	Married	58293.0	1	0	2014-01-19	94	173	...	5	3	11	0

5 rows × 29 columns

1.1) Description of Columns

AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise
Complain - 1 if customer complained in the last 2 years
DtCustomer - date of customer’s enrolment with the company
Education - customer’s level of education
Marital - customer’s marital status
Kidhome - number of small children in customer’s household
Teenhome - number of teenagers in customer’s household
Income - customer’s yearly household income
MntFishProducts - amount spent on fish products in the last 2 years
MntMeatProducts - amount spent on meat products in the last 2 years
MntFruits - amount spent on fruits products in the last 2 years
MntSweetProducts - amount spent on sweet products in the last 2 years
MntWines - amount spent on wine products in the last 2 years
MntGoldProds - amount spent on gold products in the last 2 years
NumDealsPurchases - number of purchases made with discount
NumCatalogPurchases - number of purchases made using catalogue
NumStorePurchases - number of purchases made directly in stores
NumWebPurchases - number of purchases made through company’s web site
NumWebVisitsMonth - number of visits to company’s web site in the last month
Recency - number of days since the last purchase

1.2) Check for Duplicates

sum(data['ID'].duplicated())

Using the unique identifier “ID”, we find no duplicates

1.3) Find and deal with missing data

missing_percent = (data.isnull().sum().sum()/np.product(data.shape))*100
print('The proportion of missing data on a cell-by-cell basis is {}%'.format(missing_percent))

The proportion of missing data on a cell-by-cell basis is 0.03694581280788178%

The amount of data missing is miniscule. We can isolate which column has this missing data then drop those rows. This will not impact the quality of the dataset since, as stated, there’s very few rows that are empty

data.isnull().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

# drop rows where "Income" is missing data

data = data[data['Income'].notna()]

1.4) Check for extreme values that could be erroneous

data.describe()

	ID	Year_Birth	Income	Kidhome	Teenhome	Recency	MntWines	MntFruits	MntMeatProducts	MntFishProducts	...	NumWebVisitsMonth	AcceptedCmp3	AcceptedCmp4	AcceptedCmp5	AcceptedCmp1	AcceptedCmp2	Complain	Z_CostContact	Z_Revenue	Response
count	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	...	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.000000	2216.0	2216.0	2216.000000
mean	5588.353339	1968.820397	52247.251354	0.441787	0.505415	49.012635	305.091606	26.356047	166.995939	37.637635	...	5.319043	0.073556	0.074007	0.073105	0.064079	0.013538	0.009477	3.0	11.0	0.150271
std	3249.376275	11.985554	25173.076661	0.536896	0.544181	28.948352	337.327920	39.793917	224.283273	54.752082	...	2.425359	0.261106	0.261842	0.260367	0.244950	0.115588	0.096907	0.0	0.0	0.357417
min	0.000000	1893.000000	1730.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.0	11.0	0.000000
25%	2814.750000	1959.000000	35303.000000	0.000000	0.000000	24.000000	24.000000	2.000000	16.000000	3.000000	...	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.0	11.0	0.000000
50%	5458.500000	1970.000000	51381.500000	0.000000	0.000000	49.000000	174.500000	8.000000	68.000000	12.000000	...	6.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.0	11.0	0.000000
75%	8421.750000	1977.000000	68522.000000	1.000000	1.000000	74.000000	505.000000	33.000000	232.250000	50.000000	...	7.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.0	11.0	0.000000
max	11191.000000	1996.000000	666666.000000	2.000000	2.000000	99.000000	1493.000000	199.000000	1725.000000	259.000000	...	20.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	3.0	11.0	1.000000

8 rows × 26 columns

There do not seem to be any obviously erroneous values. However, we see that the earliest year of birth in the data is 1893. This dataset is from 2014, making this particular customer 121 years old at the time of collection. Is this a valid record? It’s possible but highly unlikely.

I will keep this record for now and, just for curiosity’s sake, check how many customers are older than 90 years of age

# to be 90 years of age or more, a customer needs to be born in or before 1924 (2014-1924 = 90)

data[data['Year_Birth'] <= 1924]

	ID	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	...	NumWebVisitsMonth	AcceptedCmp5	Complain	Z_CostContact	Z_Revenue
192	7829	1900	2n Cycle	Divorced	36640.0	1	0	2013-09-26	99	15	...	5	0	1	3	11
239	11004	1893	2n Cycle	Single	60182.0	0	1	2014-05-17	23	8	...	4	0	0	3	11
339	1150	1899	PhD	Together	83532.0	0	0	2013-09-26	36	755	...	1	1	0	3	11

3 rows × 29 columns

2) Exploratory Data Analysis

2.1) Overall Conversion Rates (%), Cost Per Acquisition (CPA) and Profitability

We have a lot of data in terms of different variables. We even have conversion data from the last 5 Marketing Campaign (our most recent campaign is not included in this count. It is our target variable).

Segmenting and dissecting the data in every possible way will lead to an extremely long piece of analysis. Not all of the analysis will be valuable either. After all, we are only interested in insights that have the maximum business impact.

Therefore, to frame the analysis, I am going to restrict the scope to explore only the most recent Marketing Campaign. In addition, I will choose what I judge to be the most interesting and promising variables to explore insights.

2.11) Conversion Rate (%)

# find number of non-responders vs responders, then create donut plot

response_counts = data.Response.value_counts()

no_response = str(round(response_counts[0]/sum(response_counts)*100)) + '%'
yes_response = str(round(response_counts[1]/sum(response_counts)*100)) + '%'

# donut plot

explode = (0, 0.2)
plt.figure(figsize = (8, 8))
plt.pie(response_counts,
        labels = ['Did Not Convert' + ' - ' + no_response,
                  'Converted' + ' - ' + yes_response],
        explode = explode,
        colors = ['indianred', 'teal'])

circle = plt.Circle((0,0), 0.7, color='white')

p = plt.gcf()
p.gca().add_artist(circle)

plt.suptitle('Marketing Campaign has a 15% Conversion Rate', size = 18)
plt.title('Proportions of Responders and Non-Responders to Total Customers',
           size = 15)

Text(0.5, 1.0, 'Proportions of Responders and Non-Responders to Total Customers')

None

2.12) Cost Per Acquisition (CPA) and Profitability

The data shows us that the cost to contact a single customer is fixed at $3. We can use this information to calculate how profitable our marketing campaign was

# find profit or loss for each customer targted by campaign

conditions = [data['Response'] == 1, data['Response'] == 0]
outcomes = [8, -3]

data['Profit'] = np.select(conditions, outcomes)

For now we can find a simple total for how profitable our campaign was. Later we will segment profitability by customer characteristics to figure out which of our customers were the most and least valuable

print('The total Profit (+) or Loss (-) of our marketing campaign was {} USD'.format(sum(data['Profit'])))

The total Profit (+) or Loss (-) of our marketing campaign was -2985 USD

Clearly, this marketing campaign was a failure overall, as the business lost $2,985.

Now one of our main aims is to figure out which customers, grouped by characteristics, caused us to lose money (so we reconsider marketing to them in the future) and which customers were profitable (so we can target them more narrowly next time)

2.2) Income

Here I want to find out how conversion and profitability vary with customer Income.

To be able to do this, I will examine the distribution of the Income variable and create appropriate bins, then plot these against the customer responses

# Let's create a density plot for a smooth graph

plt.figure(figsize = (9, 6))
data['Income'].plot(kind = 'kde', color = 'teal')
plt.xlim([0, 100000])
plt.suptitle('"Income" is Near Normal and Clearly Bimodal - Peaks at \$38,000 & \$65,000', size = 15)
plt.title('Distribution of Income Across Customers', size = 13)
plt.xlabel('Income Level ($)', size = 12)
plt.ylabel('Density', size = 12)

Text(0, 0.5, 'Density')

None

This confirms that the Income variable is Bimodal

2.21) Number and % of Customer Conversions by Income Group

# Create deciles - 10 bins with "Income" variable equally divided

data['Income Deciles'] = pd.qcut(data.Income, q = 10, precision = 0)

The following graph shows the raw number of customer conversions segmented by income deciles (10 bins of customers by income).

We can clearly see that

The greatest number of conversions are isolated to the two highest income deciles
The fewest number of customers which were targeted by the marketing campaign also belong to the above two deciles

data.groupby('Income Deciles')['Response'].value_counts().plot(kind = 'barh',
                                                               color = ['indianred','teal'],
                                                               figsize = (10, 8))

plt.suptitle('The Highest Income Earners Have The Greatest Number of Conversions', size = 16)
plt.title('Raw Number of Conversions and Non-Conversions vs Income Deciles', size = 14)
plt.ylabel('Income Deciles ($) and Customer Response (1 = Yes | 0 = No)', size = 13)
plt.xlabel('Number of Customers', size = 13)

Text(0.5, 0, 'Number of Customers')

None

Above we calculated the raw number of conversions. But a much better measure is to figure out the rate of conversions, which we show below.

We see that with this measure, the highest income groups continue to be the customer segment with the highest conversion

income_vs_conv = pd.crosstab(data['Income Deciles'], data['Response'])

income_vs_conv['Conversion Rate (%)'] = \
              round((income_vs_conv[1]/(income_vs_conv[0]
              + income_vs_conv[1]))*100, 2)

income_vs_conv.sort_values(by = 'Conversion Rate (%)', ascending = False)

Response	0	1	Conversion Rate (%)
Income Deciles
(79844.0, 666666.0]	135	87	39.19
(71819.0, 79844.0]	174	47	21.27
(32011.0, 38198.0]	190	31	14.03
(38198.0, 44529.0]	195	27	12.16
(58482.0, 65248.0]	195	26	11.76
(1729.0, 24118.0]	197	25	11.26
(44529.0, 51382.0]	197	24	10.86
(51382.0, 58482.0]	198	24	10.81
(65248.0, 71819.0]	200	22	9.91
(24118.0, 32011.0]	202	20	9.01

2.22) Marketing Campaign Profitability by Income Deciles

Now that we have looked at raw conversions and conversion rates, let’s assess the profitability of each customer income segment

# assign 'indianred' colour to loss-making segments
# and 'teal' to profitable segments

LC = 'indianred'
PC = 'teal'

data.groupby('Income Deciles')['Profit'].sum().plot(kind = 'barh',
                                                    color = [LC, LC,
                                                             LC, LC,
                                                             LC, LC,
                                                             LC, LC,
                                                             LC, PC],
                                                    figsize = (10, 6))


plt.suptitle("""Customers From Only The Highest Income Group Were Associated with Profitability""",
              size = 14)

plt.title('Profitability (\\$) vs Income Deciles (\\$)', size = 13)

plt.ylabel('Income Deciles ($)', size = 13)
plt.xlabel('Profit/Loss ($)', size = 13)

Text(0.5, 0, 'Profit/Loss ($)')

None

The marketing campaign resulted in losses for each and every single customer group except the highest incoming earning group when segmented by Income level.

This latter group were responsible for \$291 in profit. Conversely, the greatest loss was incurred in the second-to-lowest decile by Income which was approximately $450

Takeaway

This analysis seems to imply that the marketing campaign appeals the most to high income groups. They have the greatest conversion and, as a result, were the only profitable income group in the context of the marketing campaign.

If we run a future marketing campaign that is similar to this one, it could be a good idea to target high-income customers only, specifically those earning above $79,884 per year

2.3) Age and Education

Next I want to figure out the conversion rates and market campaign profitability when customers are segmented by Age and Education

# Create "Age" variable and plot histogram

data['Age'] = 2020 - data['Year_Birth']
data['Age'].hist(bins = 20, color = 'coral',
                 alpha = 0.5, figsize = (8, 6))
plt.xlim([20, 80])

plt.suptitle('"Age" is near-Normal, has a Mean of 51 and a Mode of 44', size = 16)
plt.title('Distribution of Age of Customers', size = 14)
plt.xlabel('Age Groups', size = 13)
plt.ylabel('Frequency', size = 13)

Text(0, 0.5, 'Frequency')

None

2.31) Conversion & Profitability by Age Group

# create Age bins

# Youngest customer is 24 years old

data['Age Bins'] = pd.cut(data['Age'],
                          bins = [24, 30, 40,
                                  50, 60, 70,
                                  80, 120])

# plot conversion rates against age groups

age_conv_rate = pd.crosstab(data['Age Bins'], data['Response'])

age_conv_rate['Conversion %'] = \
              round((age_conv_rate[1]/(age_conv_rate[0]+age_conv_rate[1]))*100, 2)

age_conv_rate.reset_index(inplace = True)
age_conv_rate['Age Bins'] = age_conv_rate['Age Bins'].astype(str)

plt.figure(figsize = (8, 6))
plt.bar(age_conv_rate['Age Bins'],
        age_conv_rate['Conversion %'],
        color = 'Coral',
        alpha = 0.8)

plt.xlabel('Age Groups', size = 13)
plt.ylabel('Conversion (%)', size = 13)
plt.suptitle('The Youngest and Oldest Customers Have the Greatest Conversion Rates', size = 15)

plt.title('Conversion Percentage by Age Group', size = 13)

Text(0.5, 1.0, 'Conversion Percentage by Age Group')

None

Customers with ages in the intervals of 24-30 and 70-80 have the greatest conversion rates. For every 50 customers targeted with the marketing campaign in these age groups, more than 10 customers convert.

Conversely, middle-aged customers (those between 40 and 70 yeaes of age) have the worst conversion rate (between 12% and 14%)

youngest_and_oldest_customers = \
data.groupby('Age Bins')['ID'].count()[0]
+ data.groupby('Age Bins')['ID'].count()[5]

print('The total number of customers targeted belonging '
      'to the oldest and youngest age groups is {}'.format(youngest_and_oldest_customers))

The total number of customers targeted belonging to the oldest and youngest age groups is 59

Out of the total number of 2216 customers targeted in this ad campaign, 165 or about 7.5% belong to the age groups with the greatest conversion rates.

Now, we need to be careful with the conclusions we extract from this analysis. Larger sample sizes generally lead to more valid and accurate estimates. It is quite possible that if we were to target a larger sample size of only the oldest and youngest customers that the conversion percentages may drop from their current levels of > 20%, as the larger sample size will be more indicative of reality

Another way of saying this is that the youngest and oldest age groups might have such high conversion rates because there are very few of them compared to the whole customer base, and the small sample size might not be indicative of the ground truth

Now let’s look at profitability

data.groupby('Age Bins')['Profit'].sum().plot(color = 'coral',
                                              figsize = (10, 6))
plt.ylabel('Loss ($)', size = 12)
plt.xlabel('Age Groups', size = 12)
plt.suptitle('Middle Aged Customers Are Associated with the Greatest Monetary Loss',
             size = 15)
plt.title('Marketing Campaign Profitability by Age Group', size = 13)

Text(0.5, 1.0, 'Marketing Campaign Profitability by Age Group')

None

Takeaway

While we expected the groups with the worst conversion rates to not be profitable (middle aged customers), it is surprising to find that not a single group of customers, when segment by Age, proved to be profitable in our most recent Marketing Campaign.

Customers between 40 and 50 years have an associated loss of more than $1000. Interestingly, it is this same age group that has the greatest representation by raw numbers in the overall number of customers

What does this tell us in combination with the graph showing conversions? It tells us that even in those Age Groups where the conversion rate is the highest, it is still not high enough for the marketing campaign to be profitable

Therefore, targeting customers by age does not seem like a good idea for future marketing campaigns

2.32) Education

# find sum of profit per value of "Education"

education_vs_profit = data.groupby('Education')['Profit'].sum()

# create lollipop chart

plt.figure(figsize = (10, 6))

my_range = range(1,len(education_vs_profit.index)+1)
plt.hlines(y = education_vs_profit.index,
           xmin = 0 ,
           xmax = education_vs_profit,
           color='indianred')

plt.plot(education_vs_profit, education_vs_profit.index,
         "o", color = 'black')

plt.xticks(size = 10)
plt.yticks(size = 10)
plt.suptitle('Customers of All Levels of Education are Associated with Losses',
             size = 14)

plt.title('Profitability by Education Level', size = 12)
plt.ylabel('Highest Education Achieved', size = 12)
plt.xlabel('Loss ($)', size = 12)

Text(0.5, 0, 'Loss ($)')

None

Takeaway

Much like the case with Age Groups, the “middle value” of Education, “Graduation” was related to the greatest loss incurred, reaching nearly -$1750. Moreover, all customers when segmented by Education are associated with losses.

These findings suggest that future marketing campaigns should not target or segment customers on the basis of this characteristic.

2.4) Living Arrangements: Individual vs With Family

Let’s consider:

“Individuals” to be customers who are Single and have no children, and therefore do not live with them
“With Family” to be customers who are either Single or Married but have at least 1 child or 1 teen living in the same house

# create two dataframes to capture one group each

individuals = data.loc[(data['Marital_Status'] == 'Single') &
                       (data['Kidhome'] == 0) &
                       (data['Teenhome'] == 0)]

families = data.loc[(data['Kidhome'] == 1) |
                    (data['Teenhome'] == 1)]

# create a Series with the sum of Profit for above dataframes

ind_vs_family_profit = pd.Series([individuals.groupby('ID')['Profit'].sum().sum(),
                                  families.groupby('ID')['Profit'].sum().sum()],
                                  index = ['Individual', 'With Family'])

# plot

ind_vs_family_profit.plot(kind = 'barh', color = ['#004643', '#6A2E35'],
                          figsize = (11, 6))
plt.suptitle('Customers Who Live Alone Are Profitable While Those Who Live with Families Incur Heavy Losses',
              size = 16)
plt.title('Total Profits ($) Per Customer by Living Arrangements',
          size = 14)
plt.xlabel('Profit/Loss ($)', size = 11)
plt.ylabel('Living Arrangement', size = 11)
plt.xticks(size = 10)
plt.yticks(size = 10)

(array([0, 1]), [Text(0, 0, 'Individual'), Text(0, 1, 'With Family')])

None

# plot conversion percentage

ind_vs_family_conv = pd.Series([round((individuals.Response.sum()/len(individuals))*100, 2),
                                round((families.Response.sum()/len(families))*100, 2)],
                                index = ['Individuals', 'With Family'])

ind_vs_family_conv.plot(kind = 'bar', color = ['#004643', '#6A2E35'],
                        figsize = (9, 7))
plt.suptitle('Conversion Rate of "Individuals" is Nearly Quadruple the Rate of "With Family"',
             size = 15)
plt.title('Conversion Rates (%) by Living Arrangements',
          size = 13)
plt.xlabel('Living Arrangement', size = 11)
plt.ylabel('Conversion Rate (%)', size = 11)
plt.xticks(size = 10)
plt.yticks(size = 10)

(array([ 0.,  5., 10., 15., 20., 25., 30., 35., 40., 45.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])

None

With nearly 4 times the conversion rate of customers who are considered to live “With Family”, customers who live alone (“Individuals”) are unsurprisingly associated with profitability (albeit a small profit) while the “With Family” group is associated with a loss of $2,700

Takeaway

So far we see that when we market to customers with the following characteristics, the marketing campaign is effective (as it is profitable):

High-income customers, specifically those earning more than $79,000 annually
Customers who live alone

This seems to imply that the marketing campaign is effective with wealthy bachelors and bachelorettes

2.5) Date of First Purchase (with some caveats)

I wonder if there is any relationship between the time elapsed between the date when a customer made their first purchase and the response rate. This isn’t the best pair of measures to compare, but unfortunately we do not have other useful time-related data. This could be on, for example, the time elapsed since the last purchase, not the first one. So we will make do with what we have.

It is difficult to come up with a hypothesis before this analysis. For example, if we had data on every purchase a customer has made within a given timeframe, we could hypothesize that customers who made purchases with greater frequency in the last X number of days would be significantly more likely to convert due to the marketing campaign than customers who bought less frequently.

But since we do not have that data, let’s just explore the conversion rates segmented by time elapsed since first purchase in months

The last date in the data is 2014-06-29, so I will consider 2014-06-30 to be “today”

# convert date of joining column to datetime format

data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'])

# Find the number of days elapsed since first purchase

data['Days Elapsed'] = datetime(2014, 6, 30) - data['Dt_Customer']
data['Days Elapsed'] = data['Days Elapsed'].dt.days

# create bins

myBins = np.linspace(min(data['Days Elapsed']),
                     max(data['Days Elapsed']),
                     num = 23, endpoint = True)

# create labels

labels = []

for i in range(22):
    time_elapsed = str(i) + ' Months Ago'
    labels.append(time_elapsed)

# add bins to dataframe

data['Time Elapsed Bins'] = pd.cut(data['Days Elapsed'],
                                   myBins, labels = labels,
                                   include_lowest = True)

# create crosstab of labelled bins and Response

time_elapsed_conv_rate = pd.crosstab(data['Time Elapsed Bins'], data['Response'])
time_elapsed_conv_rate

time_elapsed_conv_rate['Conversion %'] = round((time_elapsed_conv_rate[1]/
                                               (time_elapsed_conv_rate[0] +
                                                time_elapsed_conv_rate[1]))*100, 2)
time_elapsed_conv_rate.reset_index(inplace = True)

# plot crosstab

plt.figure(figsize = (12, 6))
plt.barh(time_elapsed_conv_rate['Time Elapsed Bins'],
        time_elapsed_conv_rate['Conversion %'],
        color = '#6D98BA',
        alpha = 0.8)

plt.suptitle('Our "Oldest" Customers Are Most Receptive to this Marketing Campaign', size = 15)
plt.title('Conversion Rates Segmented by Length of Time Elapsed Since First Purchase by Customer', size = 13)
plt.ylabel('Time Elapsed Since Customer\'s First Purchase', size = 12)
plt.xlabel('Conversion (%)', size = 12)
plt.yticks(size = 10)

([0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21],
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])

None

Takeaway

It seems to be the case that there is some phenomenon occurring whereby customers who were acquired a long time ago (alternatively, our oldest customers) have a much greater rate of conversion than those customers who were acquired recently.

Perhaps our oldest customers know more about the product/service we’re offering and so they are more liable to buy
Perhaps brand loyalty has come into play; older customers who have remained active with our company are more likely to be more loyal than newly acquired customers. Otherwise, they would not have converted. Similarly, we can say that not enough time has elapsed for the newly acquired customers to feel a sense of loyalty to our company
Perhaps there are other changes we made in the business where things are done differently to when they were done previously. This can include many things: a change in our branding, customer service, product and more. It is difficult to isolate this idea but it can be the basis for a future experiment

Regardless, the data suggests that it is a good idea to target our oldest customers with marketing campaigns similae to this one

3) Clustering Analysis

data.columns

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response',
       'Profit', 'Income Deciles', 'Age', 'Age Bins', 'Days Elapsed',
       'Time Elapsed Bins'],
      dtype='object')

data_clustering = data.drop(['ID', 'Year_Birth', 'Dt_Customer',
                             'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
                             'AcceptedCmp4', 'AcceptedCmp5', 'Z_CostContact',
                             'Z_Revenue', 'Income Deciles', 'Age Bins',
                             'Time Elapsed Bins', 'Profit'], axis = 1)

# create dummies for "marital_status" variable, then concatenate to dataframe

marital_status_dummies = pd.get_dummies(data['Marital_Status'])
data_clustering = pd.concat([data_clustering, marital_status_dummies], axis = 1)

# and drop original "marital_status" variable

data_clustering.drop('Marital_Status', axis = 1, inplace = True)

data_clustering.drop(['Absurd', 'YOLO'], axis = 1, inplace = True) # drop irrelevant dummies

# do the same as above for "Education" variable

education_dummies = pd.get_dummies(data['Education'])
data_clustering = pd.concat([data_clustering, education_dummies], axis = 1)
data_clustering.drop('Education', axis = 1, inplace = True)

# place outcome variable, Response, at the end of the dataframe

response_variable = data_clustering['Response']
data_clustering.drop('Response', axis = 1, inplace = True)
data_clustering['Response'] = response_variable

3.2) Correlation heatmap

corr = data_clustering.corr()

plt.figure(figsize = (10, 10))

sns.heatmap(corr,
            vmin = -1, vmax = 1, center = 0,
            cmap = sns.diverging_palette(20, 220, n = 10),
            square = True, linewidths = 5)

<AxesSubplot:>

None

We are only interested in the last column. We see that even the features with the greatest correlation with the outcome (Response) variable have only a weak-to-moderate magnitude. But these are the best features we have, so we will have to make do.

Overall, I am selecting only those features where the absolute magnitude of correlation coefficient is > 0.12.

This is a fairly arbitrary cut-off that I have chosen by inspecting the correlation coefficients of the whole dataset. My reasoning is that according to this criterion, we choose about 8 features and drop the rest.

If this criterion were harsher, we would have too few features to use. If it were more lax, we would be including features that have even worse correlation or no correlation at all with the outcome variable “Response”.

data_clustering.corr()['Response']

Income                 0.133047
Kidhome               -0.077909
Teenhome              -0.153901
Recency               -0.199766
MntWines               0.246299
MntFruits              0.122443
MntMeatProducts        0.237746
MntFishProducts        0.108145
MntSweetProducts       0.116170
MntGoldProds           0.140332
NumDealsPurchases      0.003451
NumWebPurchases        0.151431
NumCatalogPurchases    0.219914
NumStorePurchases      0.036241
NumWebVisitsMonth     -0.002209
Complain              -0.002029
Age                   -0.023692
Days Elapsed           0.196479
Alone                  0.018862
Divorced               0.054189
Married               -0.079821
Single                 0.108726
Together              -0.075293
Widow                  0.045656
2n Cycle              -0.035496
Basic                 -0.050080
Graduation            -0.039660
Master                 0.003919
PhD                    0.087979
Response               1.000000
Name: Response, dtype: float64

relevant_data = data_clustering.loc[:, ['Income', 'Teenhome', 'MntWines',
                                        'MntMeatProducts','MntGoldProds',
                                        'NumCatalogPurchases', 'NumWebPurchases',
                                        'Days Elapsed', 'Response']]

3.3) Scale data

We need to scale our data. We use MinMaxScaler instead of StandardScaler because the distribution of values in the selected features is not normal

# create a copy of relevant dataframe

data_clustering_copy = relevant_data.copy()

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data_clustering_copy)

# convert scaled data back into dataframe

data_clustering_copy_scaled = pd.DataFrame(data_scaled)

# re-assign names

column_names = list(data_clustering_copy.columns)

data_clustering_copy_scaled.columns = column_names

3.4) Fit K-Means Clustering Algorithm

3.41) Find optimal number of clusters (elbow method)

# find sum of squared error per number of clusters

sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(data_clustering_copy_scaled)
    sse.append(kmeans.inertia_)

plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

None

By eye, it seems 4 clusters is the optimal choice

# fit and transform data to create 4 clusters

km = KMeans(n_clusters = 4)
km.fit(data_clustering_copy_scaled)

# predict clusters for each row in original dataframe
# with non-scaled values

data_clustering_copy['clusters'] = km.labels_

Now we wrangle data in a way which allows us to segment “Response” outcome and other variable based on the different clusters that datapoints are assigned to

data_clustering_copy["cluster_0"] = data_clustering_copy.clusters == 0
data_clustering_copy["cluster_1"] = data_clustering_copy.clusters == 1
data_clustering_copy["cluster_2"] = data_clustering_copy.clusters == 2
data_clustering_copy["cluster_3"] = data_clustering_copy.clusters == 3

# find how many customers converted in each cluster

data_clustering_copy.groupby("clusters")['Response'].sum()

clusters
0      0
1      0
2    333
3      0
Name: Response, dtype: int64

All 333 customers that converted in the whole dataset belong to cluster 2. This is really interesting; it implies that the traits of customers in this cluster are significantly correlated with a higher likelihood of conversion.

I want to reiterate this finding a different way because it is quite important: every customer who converted was in Cluster 2, but not all customers in Cluster 2 converted.

The next step is to find out what these characteristics are because based off them, we will be able to better isolate relevant customers for our next marketing campaign and make this campaign more targeted. As a result, the next marketing campaign should see much greater conversion and profitability overall

data_clustering_copy.groupby('clusters').mean()

	Income	Teenhome	MntWines	MntMeatProducts	MntGoldProds	NumCatalogPurchases	NumWebPurchases	Days Elapsed	Response	cluster_0	cluster_1	cluster_2	cluster_3
clusters
0	50596.891304	1.060201	186.809365	70.006689	26.138796	1.657191	3.568562	222.513378	0.0	1.0	0.0	0.0	0.0
1	56785.689531	0.693141	465.129964	215.568592	66.324910	3.519856	5.485560	556.075812	0.0	0.0	1.0	0.0	0.0
2	60209.675676	0.306306	502.615616	293.765766	61.252252	4.201201	5.072072	449.081081	1.0	0.0	0.0	1.0	0.0
3	46530.607387	0.000000	190.585499	151.778386	33.727770	2.160055	2.997264	266.683995	0.0	0.0	0.0	0.0	1.0

Takeaway

Customers in cluster 2:

Have the greatest average income at $60209
Have 3 teens per 10 homes on average. So while these customers are not completely teen-less, they tend to live with fewer teens on average than customers in other clusters [specifically Cluster 3 (about 7 teens per 10 homes) and Cluster 2 (more than 1 teen per home)]
Spend the most on wine per month at $502
Spend the most on meat products per month at $294
Spend the second most on gold products per month at \$61, with customers in Cluster 3 spending the most (\$66)
Make the most purchases through catalogues at 4.2. The frequency of purchase is not listed in the meta information of the dataset, but it is reasonable to assume in the absence of information that points otherwise that this frequency is monthly
Make the most purchases through the Web at 5. Again, this is assuming a monthly figure
Are usually among our oldest customers (in terms of how long they’ve been with the company, not their Age), with an average of 449 days since their first purchase. However, they are not the oldest group of customers. Those belong to cluster 3 which, it should be reiterated, did not have a single customer convert

Taimur Shabbir

Python Project: Evaluating Marketing Campaign and Performing Customer Clustering

Problem exposition and Business Value

Data

1) Load and Clean Data

1.1) Description of Columns

1.2) Check for Duplicates

1.3) Find and deal with missing data

1.4) Check for extreme values that could be erroneous

2) Exploratory Data Analysis

2.1) Overall Conversion Rates (%), Cost Per Acquisition (CPA) and Profitability

2.11) Conversion Rate (%)

2.12) Cost Per Acquisition (CPA) and Profitability

2.2) Income

2.21) Number and % of Customer Conversions by Income Group

2.22) Marketing Campaign Profitability by Income Deciles

Takeaway

2.3) Age and Education

2.31) Conversion & Profitability by Age Group

Takeaway

2.32) Education

Takeaway

2.4) Living Arrangements: Individual vs With Family

Takeaway

2.5) Date of First Purchase (with some caveats)

Takeaway

3) Clustering Analysis

3.2) Correlation heatmap

3.3) Scale data

3.4) Fit K-Means Clustering Algorithm

3.41) Find optimal number of clusters (elbow method)

Takeaway

You may also enjoy

SQL Ongoing Post: SQL Problems & Solutions to LeetCode and DataLemur Problems - ‘Medium’ & ‘Hard’ Difficulties

Time Series Forecast: Predicting Daily Sales for Walmart

Python Project: Analysing Customer Purchasing Behaviour in Retail

A/B Testing Python Project: Testing whether a Smart Ad increases consumer response rates