RFM Customer Segmentation

Customer Segmentation of AdventureWorks Cycle customer

Click here for Github Repository that contains all the files in this project.
Click here for Executive Summary Presentation File for this project.

Introduction

RFM analysis is a customer segmentation technique that is used to identify groups of customers based on their purchasing behavior. RFM stands for Recency, Frequency, and Monetary Value. These three metrics are used to evaluate customer behavior and to classify customers into different groups or segments, based on their buying habits and customer value.

A brief explanation of each of the RFM components:
1. Recency: How long since the customer last purchase is? (in days)
2. Frequency: How many times the customer made purchase?
3. Monetary: How much the customer spend in total?

In this project I will segment the customer of AdventureWorks Cycle with RFM analysis.

The Data

The data used in this project came from AdventureWorks Database :

  1. AdventureWorks Database - AdventureWorksDB Installation Tutorial
  2. FactInternetSales Dataset (The data that used here, extracted from the database)- GDrive Dataset Download

Software Used

The Software I used is Jupyter Lab, a notebook to write python code and I used multiple libraries as below:

Library
Usage
Pandas
  1. Import/Exporting Data
  2. Manipulating Data
  3. Other Data Analysis Task
Numpy Working with Numerical Value
Matplotlib and Seaborn Creating Visualization

Objective

Complete RFM customer segmentation Analysis of AdventureWorks Cycle and Export the Dataset for future analysis.

Chapter 1 - Preparing the Data

Cleaning FactInternetSales Dataset

Data cleaning is important step in Data Analysis, it deals with correcting errors, inconsistencies, and inaccuracies of data , clean data = clean result.

Import the Library

import pandas as pd
pd.set_option("display.max_column", 100)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Datetime module
from datetime import datetime, date, timedelta

Import Dataset

df = pd.read_csv("FactInternetSales.csv")
df.head()

Get the list of columns

df.columns

Drop unnecessary columns

df = df[['CustomerKey','SalesOrderNumber','SalesAmount', 'OrderDate']].copy()
df.head()

Check the data info

df.info()

Change OrderDate column dtype to datetime

df['OrderDate'] = pd.to_datetime(df['OrderDate'], format="%Y/%m/%d")
df.info()

Checking if the OrderDate column data match the format by returning the min and max of day/month/year

# Check the day column
day_date = df['OrderDate'].dt.day
print("day_min: %i" %day_date.min())
print("day_max: %i \n" %day_date.max())

# Check the month column
month_date = df['OrderDate'].dt.month
print("month_min: %i" %month_date.min())
print("month_max: %i \n" %month_date.max())

# Check the month column
year_date = df['OrderDate'].dt.year
print("year_min: %i" %year_date.min())
print("year_max: %i \n" %year_date.max())

Change column name

# the column name is already pretty good, but I prefer use lowercase and use underscore instead of space, 
# for me, it's easier to read

df.rename(columns={'CustomerKey':'customer_key','SalesOrderNumber':'sales_order_number',
			'SalesAmount':'sales_amount', 'OrderDate':'order_date'}, inplace = True)
df.head()

After cleaning the Data we will have a DataFrame that contains the column of:

  1. customer_key : Customer Key unique ID
  2. sales_order_number : The order ID, used for calculate Frequency
  3. sales_amount : The amount of transaction, used for calculate Monetary
  4. order_date : The date of transaction, used for calculate Recency

Chapter 2 - Creating RFM Table

Groupby

We need to group the rfm by the customer, sales order number, order date to remove duplicates and sum the sales amount

df = df.groupby(['customer_key','sales_order_number','order_date']).sales_amount.sum()
df.head()

Reset Index

I reset the index to get the DataFrame.

df = df.reset_index()
df.head(10)

Create Recency Table

To get the recency, calculate the latest day in dataset, and add 1 day to it and then subtitute it with the latest day of each customer.
I decided to add 1 day to prevent ambigous analysis/calculation in further step
example = df.recency.min() = 0 will be 1

daydiff = (df.order_date.max() + timedelta(days=1)) - df.groupby('customer_key').order_date.max()
recency = daydiff.dt.days
recency = recency.reset_index()
recency = pd.DataFrame(recency)

recency = recency.rename(columns={'order_date':'recency'})
recency.head()

Create Frequency Table

To get the frequency of each customer, group by customer_key column and count the sales_order_number.

freq = df.groupby('customer_key').sales_order_number.count().reset_index()
freq = freq.rename(columns = {'sales_order_number':'frequency'})
freq.head()

Create Monetary Table

To get the monetary, group by customer_key column and sum the sales_amount.

monetary = df.groupby('customer_key').sales_amount.sum().reset_index()
monetary = monetary.rename(columns = {'sales_amount':'monetary'})
monetary.head()

Merge R, F, and M table

Merge the R, F, and M table to get the full rfm table/dataframe

rfm = pd.merge(recency, freq, on = 'customer_key',how = 'inner')
rfm = pd.merge(rfm, monetary, on = 'customer_key',how = 'inner')
rfm.head()

After this step, we will have the rfm dataframe that we can use for segmentation.

Chapter 3 - Calculate RFM Score

Calculate RFM Score

The R and M score will be calculate with quantiles of 4 using pd.qcut so it will be divided into 4 score, lower R will get 4 while higher M will get 4. The Frequency of 1 = 1 score, 2 = 2 score, 3 = 3 score, and > 4 = 4 score.

# R score
rfm["r"] = pd.qcut(rfm["recency"],4,labels=[4,3,2,1])

# F score
f_conditions = [(rfm['frequency'] == 1), (rfm['frequency'] == 2), (rfm['frequency'] == 3), (rfm['frequency'] >= 4)]
f_results = [1,2,3,4]
rfm['f'] = np.select(f_conditions, f_results)

# M score
rfm["m"] = pd.qcut(rfm["monetary"],4,labels=[1,2,3,4])

# RFM Combined Score
rfm["rfm_score"] = rfm["r"].astype(str) +rfm["f"].astype(str) + rfm["m"].astype(str)

rfm.head()

RFM Segmentation

The number of segment is usually determine by the needs or the strategy of the company, I divided into 7 segments and each segment have the score as image below.

rfm_seg = {
	r'444': "Champions",
	r'[2-4][3-4][2-4]': "Loyal Customers",
	r'[2-4][2-4]4': "Big Spender",
	r'4[1-3][1-4]': "Recent Customers",
	r'[2-4][1-3][1-4]': "Needs Attention",
	r'1[2-4][1-4]': "Almost Lost",
	r'11[1-4]': "Lost Customers"
}

rfm['Segment'] = rfm['r'].astype(str) + rfm['f'].astype(str) + rfm['m'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(rfm_seg, regex=True)
rfm.head()

The rfm segmentation is done, after this step we have a dataframe that contains the rfm score and their segment, we can use this for a marketing strategy on what to do with each segment or to do further analysis on the customers.

Chapter 4 - Visualization

Counting each segments

seg_count = rfm.Segment.value_counts()
seg_count

Create dataframe for visualization

seg_count = pd.DataFrame(data = seg_count)
seg_count = seg_count.reset_index()
seg_count = seg_count.rename(columns={'index':'segment','Segment':'cust_count'})
seg_count.head()

Bar Chart

palette_color = sns.color_palette("tab10")
plt.figure(figsize = (15,8))
sns.barplot(x = seg_count.segment,y = seg_count.cust_count, palette = palette_color)
plt.show()

Percentage of each segment

To get the percentage, simply divide cust_count with 18484(Total Customer) and multiply by 100.

percentage_dec = seg_count.cust_count / 18484
seg_percentage = round(percentage_dec, 4)
seg_count['percentage'] = seg_percentage * 100
seg_count

Pie Chart

palette_color = sns.color_palette("tab10")
plt.pie(seg_count.cust_count, labels = seg_count.segment, colors = palette_color, autopct='%1.1f%%')
plt.show()

Export dataframe for further analysis

rfm.to_csv("rfm_segmented.csv", index = False)

Conclusion

After finishing this segmentation I came to conclusion that RFM Segmentation is a tool that can help business to gain deeper understanding of their customers. By analyzing three key factors - Recency, Frequency, and Monetary - RFM allows companies to divide their customers into different segments, each with their own distinct characteristics and needs.

From this segmentation there are several points and insights we can take.

  1. "Lost Customers" is second highest segment with 19%. Customer in this segment have very low Recency and Frequency, it means the customers in This segment more likely to be churn or lost.
  2. "Needs Attention" occupies 35.6% of the customers, it's the highest segment. Although it's completely normal to have many medium to low RFM score customer, because they are basically regular customer, but it's better to knock those number down to have more reliable customer.
  3. It will be interesting to analyze more on what is the characteristics of the customers of each segment, maybe they came from certain area, or specific age groups, or countries, etc.

After this, usually we can apply different strategy for each segment, here are what I think they should do with their marketing strategy with different segment.

With more information such as favorite product/category of each segments, where do they live, day of transactions, etc. we could definitely be more precise and specific with the strategy.