🤖Machine Learning: Building Unsupervised Learning Model Using K-Means Clustering

Gursewak Singh
9 min readApr 30, 2023

--

In this article, we will create an Unsupervised Learning model using a K-means clustering algorithm for an E-commerce website 🛍️

Let’s start with understanding what is Unsupervised Learning. Unsupervised Learning is all about finding awesome patterns in datasets and grouping them. And one of the algorithms that is most commonly used is the K-means clustering algorithm.

Now before we go head, you need to remember ONE last thing:

If you like my work, click the GREEN FOLLOW BUTTON on the Top right side. It keeps me motivated and writing for such an amazing readers like you 🤩.

You can download the jupyter notebook with data from 👉 here 👈. You can visit this 🌐website to learn about data itself.

Problem Statement

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

In short, it’s a one-year data for a UK-based based company 💂‍♂️. Selling unique all occasion gifts.

So basically, we want to understand how many customer segments are there which are valued for the business or important for the business. 💸

Before we do something, let’s map our journey with the steps we will follow. 🚎

Steps Involved

  1. Read and understand the data
  2. Clean the data
  3. Prepare the data for modeling
  4. Modeling
  5. Final analysis and recommendations

📖Read and understand the data

Importing the necessary packages and libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

Read the dataset

# read the dataset
retail_df = pd.read_csv("Online+Retail.csv", sep=",", encoding="ISO-8859-1", header=0)
retail_df.head()

Let’s understand the data at hand.

  1. InvoiceNo: its a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.
  2. StockCode: Product (item) code, is a 5-digit integral number uniquely assigned to each distinct product.
  3. Description: Product (item) name.
  4. CustomerID: Customer number, a 5-digit integral number uniquely assigned to each customer.

The firstfew rows have the same InvoiceNo which means they all belong to the same transaction (bought in a group by a particular customer).

Now that we’ve understood what to do with the dataset, it’s always a good idea to start with cleaning the data first. 🧹🧹

🫧Clean the data

Checking the values in the dataframe

# missing values
round(100*(retail_df.isnull().sum())/len(retail_df), 2)

CustromerID have huge number of missing values (about 25%). So if drop this column then there is no point of doing the modeling, as the whole analysis is about segmenting the customer. There we go ahead and drop the records or rows of both Description and CustomerID with 0.27% and 25% missing values.

# drop all rows having missing values
retail_df = retail_df.dropna()
retail_df.shape

#==> (406829, 8)

Before we move to data preparation we need a column called amount (which is quanity*unitPrice) . Basically total amount of a particular item bought.

# new column: amount 
retail_df['amount'] = retail_df['Quantity']*retail_df['UnitPrice']
retail_df.head()

Data Preparation

Before we get into data preparation, we need to understand that clustering data on customer (especially in ecommerce) requires to see the data in terms of Recency, Frequency and Monetory.

  • R (Recency): Number of days since last purchase
  • F (Frequency): Number of transaction
  • M (Monetary): Total amount of transactions (revenue contributed)

So, basically we want four columns: customerID, recency, frequency, and monetory. Lets do that.

🍀Create column monetary

# monetary
grouped_df = retail_df.groupby('CustomerID')['amount'].sum()
grouped_df = grouped_df.reset_index()
grouped_df.head()

🍀Create column frequency

# frequency
frequency = retail_df.groupby('CustomerID')['InvoiceNo'].count()
frequency = frequency.reset_index()
frequency.columns = ['CustomerID', 'frequency']
frequency.head()

Before we move on to create recency column lets merge the two dataframe that we created for frequency and monetary.

# merge the two dfs
grouped_df = pd.merge(grouped_df, frequency, on='CustomerID', how='inner')
grouped_df.head()

Now we need to analyze the recency part. To get this column, we need to focus on InvoiceDate column. Here we are going take the maximum date or recent date as a reference date and compare other invoice date. And see how much time has passed since last invoice for a particular transaction of a particular customer and take the minimum or most recent date from the reference date.

For example, if the reference date is 4/29/2023 and a particular customer purchased an order on 4/10/2023, 4/20/2023, and 4/25/2023. We will take 4/25/2023 because the difference between the reference date and the recent purchase date is a minimum that is 4 days.

Let’s first convert, the InvoiceDate type (which is of object type) into DateTime type.

# recency
# convert to datetime
retail_df['InvoiceDate'] = pd.to_datetime(retail_df['InvoiceDate'],
format='%d-%m-%Y %H:%M')

Let’s see our dataframe

retail_df.head()

Get the max date or our reference date

# compute the max date
max_date = max(retail_df['InvoiceDate'])
max_date

Get the difference max_date and other date

# compute the diff
retail_df['diff'] = max_date - retail_df['InvoiceDate']
retail_df.head()

Now for each customer we will find the minimum difference (our recent purchase date)

# recency
last_purchase = retail_df.groupby('CustomerID')['diff'].min()
last_purchase = last_purchase.reset_index()
last_purchase.head()

Combine all the derived column under single datafram

# merge
grouped_df = pd.merge(grouped_df, last_purchase, on='CustomerID', how='inner')
grouped_df.columns = ['CustomerID', 'amount', 'frequency', 'recency']
grouped_df.head()

We are only interested in the number of days, therefore let’s get only days

# number of days only
grouped_df['recency'] = grouped_df['recency'].dt.days
grouped_df.head()

At this point, we need to handle the outliers and bring all the columns to the same scale. Notice that amount is in 1000s whereas others (frequency and recency) are on scale of 1s or 100s. This is important else the amount will overpower other columns.

Let’s first handle the outliers.

plt.boxplot(grouped_df['recency'])

Since we don’t have domain-specific knowledge, therefore we will remove the outliers statistically. We only consider data points that fall within the 1st interquartile and 3rd interquartile.

Handling outliers for amount, frequency, and recency columns.

# removing (statistical) outliers
Q1 = grouped_df.amount.quantile(0.05)
Q3 = grouped_df.amount.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.amount >= Q1 - 1.5*IQR) & (grouped_df.amount <= Q3 + 1.5*IQR)]

# outlier treatment for recency
Q1 = grouped_df.recency.quantile(0.05)
Q3 = grouped_df.recency.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.recency >= Q1 - 1.5*IQR) & (grouped_df.recency <= Q3 + 1.5*IQR)]

# outlier treatment for frequency
Q1 = grouped_df.frequency.quantile(0.05)
Q3 = grouped_df.frequency.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.frequency >= Q1 - 1.5*IQR) & (grouped_df.frequency <= Q3 + 1.5*IQR)]

Now let’s go ahead and complete the preprocessing part of standardization. We will standardize the data using standard scaler, where mean is 0 and the standard deviation is 1.

# 2. rescaling
rfm_df = grouped_df[['amount', 'frequency', 'recency']]

# instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape

Convert rfm_df_scaled (numpy array) to dataframe.

rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['amount', 'frequency', 'recency']
rfm_df_scaled.head()

Modeling

Now let’s begin the modeling part by creating the clusters using SKlearn’s K-means algorithm package. We will go with some random number of clusters (in our case 4) and a number of iterations we want is 50. And we fit the data.

# k-means with some arbitrary k
kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(rfm_df_scaled)

Finding optimal number of clusters

This is how we do it in its simplest form. But it’s not recommended to use an arbitrary number of clusters. We need to find optimal number of cluster to use. To find the optimum number of clusters, we use two techniques — the elbow curve method and the silhouette score method.

First, use elbow curve method which uses SSD (sum of squared distance).

# importing module
import warnings
warnings.filterwarnings('ignore' )

# elbow-curve/SSD
ssd = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for num_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
kmeans.fit(rfm_df_scaled)

ssd.append(kmeans.inertia_)

# plot the SSDs for each n_clusters
# ssd
plt.plot(range_n_clusters, ssd);

Notice, if we change a number of the cluster from 2 to 3 there is a significant drop in SSD. But when the number of clusters increases from 4 onward there is no noticeable drop. So it’s either 3 or max 4 (our optimal number of cluster to start with).

Let’s see another method

Silhouette Score

Its a score which ranges from 1 to -1. 1 being the best and -1 being the poor.

# silhouette analysis
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]

for num_clusters in range_n_clusters:

# intialise kmeans
kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
kmeans.fit(rfm_df_scaled)

cluster_labels = kmeans.labels_

# silhouette score
silhouette_avg = silhouette_score(rfm_df_scaled, cluster_labels)
print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))

From here on, the choice of the cluster is business based. Even though the silhouette score with 2 clusters is among the largest. But we will go with 3 clusters as it will give more meaning and context from business point view.

🧐Cluster Analysis

Finalise the model with 3 cluster.

# final model with k=3
kmeans = KMeans(n_clusters=3, max_iter=50)
kmeans.fit(rfm_df_scaled)
kmeans.labels_

Now, we need to assign the Cluster IDs that we generated to each of the data pointsthat we have with us. Let’s go ahead and do that

# assign the label
grouped_df['cluster_id'] = kmeans.labels_
grouped_df.head()

Let’s analyse the outliers with boxplot.

# plot
sns.boxplot(x='cluster_id', y='amount', data=grouped_df)

From above illustration, cluster 0 constitutes higher values customer whereas cluster 2 is among the lowest value customer in terms of amount they spent on puchasing.

Checking the frequency with boxplot

# plot
sns.boxplot(x='cluster_id', y='frequency', data=grouped_df)

Cluster 0 again shows the high-value customer who makes purchases more frequently as compared to customers in cluster 1 and 2.

Taking a look at recency

# plot
sns.boxplot(x='cluster_id', y='recency', data=grouped_df)

Here cluster 0,1, and 2 are showing reverse trend as compared to above. Cluster 0 is the high valued customer (buy more and spend more money) and also recent ones (on an average made a purchase 10 days ago). On other end cluster 2 represent low valued customer (buys less and spends less).

Summary

We saw how to create clusters using the K-means algorithm in Python with the analysis of the Online Store data set. We wanted to group the customers of the store into different clusters based on their purchasing habits. The different steps involved were:

  • Missing values treatment
  • Data transformation
  • Outlier treatment
  • Data standardisation
  • Finding the optimal value of K
  • Implementing K Means algorithm
  • Analysing the clusters of customers to obtain business insights

The only ambiguous point you may notice here is that you need to decide the number of required clusters beforehand and in fact, run the algorithm multiple times with a different number K before you can figure out the most optimal number of clusters.

So finally, this is how we use K-Means clustering to solve a business problem 🤩.

Moreover, you can go on and experiment with it by changing the number of cluster.

Big Shoutout to all Authors 🚀

That’s all folk. Cheers 🍻!

If you LIKE my article kindly SHARE it with your colleagues, and peers. Make sure to CLAP (up to 50!), and follow me on 👉 Medium to stay updated with me new articles.

I write about new technologies, Data science, Machine learning, DevOps, and programming related stuff. My goal is to make the life of other fellow developer/ engineer easier. I tries my best to bring clarity to reader through articles. Its not possible without the support of amazing readers like you. 🎋

--

--

Gursewak Singh
Gursewak Singh

Written by Gursewak Singh

🧑‍💻Software Engineer and a 🤩Passionate Data Scientist | 🌲Finds peace in writing| LinkedIn 👉www.linkedin.com/in/ gursewak-singh-cosmic

No responses yet