markets
Customer Segmentation Example with Machine Learning Clustering and Principal Component Analysis
T. Carrico, Ailantic, LLC
October 2021
Reading time: 12 minute(s) @ 200 WP
Customer Segmentation
This document demonstrates the use of R programming to perform a simple market segmentation analysis. The data source is a Starbucks survey set of data posted in open source on Kaggle http://kaggle.com representing 113 respondents with 32 questions on a customer survey. We will explore the data, apply machine learning clustering techniques and examine the most important variables in the survey that uniquely identify customers.
The survey data are shown below.
Data Preperation
In data models we require formatting, transformation and initial analyses. This is done to:
- Clean, format and identify missing data
- Understand sampling and possible bias
- Cast values into formats required by numerical methods
- Remove data that is uninformative
Encoding categorical features
Non-numeric values (e.g., Student) require a transformation to numeric for our analysis. Since our machine learning process involves linear regression, it is important that we do not simply encode ordinal numbers like “student = 1, employee = 2…” This works if the order is progressively meaningful, but otherwise in most cases we transform these with a process called One Hot Encoding. This creates a new column or feature for each category and fills it with a 1 (true) or 0 (false). The encoded survey results are shown below. In these data all of the survey responses are assigned integer categories.
Removing uniformative data
For any type classification we need to find columns in our data with little, or no variance. These columns are uninformative to our classification processes and can dilute results by adding noise to the iterative process where those features are common. Below we see 13 features where the responses are identical, no variance. We remove these.
Normalizing and measuring variability
To prepare our data for clustering, we need to scale it. Scaling will ensure all questions are equally weighted. Below we scale these data and then calculate the distance from each other in Euclidean distance. This gives us understanding of how close and far our respondents are from each other. We can see in our heat map below some of the similar and different responses to the survey.
Customer Analysis
Clustering similar customers
To identify the segments of customers we will use K-Means clustering. K-Means is a form of unsupervised machine learning. This technique requires us to provide the number of clusters. To determine this we create a function and iterate with different numbers of clusters and compare the cluster withiness, a measure of how well the clusters encapsulate our groups. Withinness is the sum of the squares of distance between customers in our clusters. This number should be as low as possible to represent tighter clustering around features. Below we create a function and call it with different numbers of clusters. Then we plot for diminishing returns with additional clusters. There is a subjective nature to this work in marketing segmentation since we desire fewer clusters, but also want to find unique features that make them identifiable and therefor addressable segments with marketing techniques
In the plot below we can see that around 5 clusters we begin to have diminishing returns in our withiness measure.
We now run the K-means algorithm for 5 clusters and plot these clusters in a reference frame that represents the most important features. In this plotting approach when the features are greater than 2 (we have 33), we plot the first two Principal Components. K-means is an unsupervised method unlike other machine learning techniques where we have a response variable. The plot provides us insight into how alike customers are within the clusters.
The plot above is just 2 dimensions. In the table below we can view average values for customers in those clusters. This allows us to start analyzing customer characteristics and how those are in common, or different from other customers.
Hierarchical clustering
Hierarchical clustering is an iterative process that does not requires selection of cluster numbers. The algorithm iterates through all of the data and joins the groups closest creating a dendrogram. The process starts with the distance analysis we performed above as a variable. In the graph below the height represents how close together customers are to each other.
We can then generate groups from this dendrogram and view the members. The customer Ids are shown below for a selected branch in the tree.
Respondant variability
A heat map as seen above in the distance calculation can provide more insight into our data. For our survey we may want to look at those respondents who do not vary their answer from one to the next. Maybe the survey was all numeric and they selected 5 out of 10 for each selection to get the free cup of coffee! To find these we can produce a heat map combined with a dendrogram based on the standard deviation across the questions.
We can see below the left columns are customers with less variance in their answers. In these plots we select red as minimum variance and order the columns left to right with those with most variance being on the right.
Alternatively, we can look at the variance in our customers themselves. Below we see the least variance in the customers income. This is a good check on finding bias in our survey results. We can also look at the variance in their ratings where we see higher variance service and ambiance ratings from customers.
Principal Component Analysis
Principal Component Analysis (PCA) is a process to find the most informative features in our data which allows us to reduce the data redundancy of highly correlated values and reduce data dimensionality. A way to think of this is to consider which features, or questions in our survey data are the most unique and important in determining which customer’s group. It may be, for example, income is not important. Our goal is to preserve the informative information in our data while reducing the dimensions of the data for computation, visualization and comprehension.
The concept behind PCA is mathematically complex, but conceptually simple. Fortunately we have several available libraries to assist with the math part.
The math behind PCA is involves orthogonal transformations of matrices. Matrix math allows us to create an n dimensional space and measure the variance to straight vectors going through this space. If all the data points did not vary, they would sit on these vectors. Imagine a line from the sun into space with all of the planets lines up. Then image them not aligned. By drawing these vectors into space we can measure how aligned/unaligned data it. Variable data along a line represents is is more descriptive than aligned data since we are seeking to understand the features in our data that discriminate them from others. These “arrows in space” along which we measure data alignment are called Eigenvectors. The measure of the data variance from these vectors is called the Eigenvalue.
By transforming the matrix we can isolate the columns with the highest variability. Over the entire data set this allows us to find the features of our data that describe the highest percent of the variation - the principal components. We get an ordered list of which features in our data describe the entire data set. In some cases it may be 2 or 3 features alone that can describe the data set our of 100 or more features. Fortunately, these types of matrix math methods are well available in our libraries.
Principal Component Analysis on our data set is achieved by running the code below (note that we previously scaled our data. If not, we can scale it in our function call). The All of the 18 features in our data are shown in the table below with their Eigenvalue, variance for that PC and cumulative variance percent. We can see the first four components describe about 50% of our data variance.
Recall the Principal Components are not the same as the features (i.e.,survey questions). These components are the result of a data-driven approach that found them in the data. We can look at how the questions contribute towards a component through a matrix rotation. The table below shows us the customer questions how much each questions contributes to a component.
To begin to interpret these results we look at how much a feature contributes to a component with the highest or lowest numbers in the principal components. PC1 has the most contributions in productRate (0.36) serviceRate (0.33), chooseRate (0.34) and loyal (-0.30). These are ratings filled in by the customers vs demographic information about them. We can consider PC1 to our overall satisfaction rating since it represents these selections. PC2 represents more demographic information on gender, age and income. We can consider this for now as the customer demographics rating.
## PC1 PC2 PC3 PC4 PC5
## gender -0.03106238 -0.29900028 0.29616323 -0.00401134 0.27556571
## age 0.11820991 0.32801935 0.15652774 -0.24448167 -0.41685500
## status 0.17986826 0.15773463 0.36554813 -0.33649621 -0.06775950
## income 0.10878036 0.41779144 0.10832462 -0.25212516 -0.11513181
## visitNo -0.22858075 -0.22625958 0.01372881 -0.24953792 0.02663407
## method -0.04003336 -0.16828009 0.54329585 0.14043172 -0.23111321
## timeSpend 0.05908316 0.25596450 -0.43886936 -0.27993484 0.22725346
## location -0.14764654 0.04473421 -0.31076635 0.18867695 -0.56763426
## membershipCard -0.26812532 -0.16413764 -0.17185629 0.12728265 -0.37601347
## spendPurchase 0.19335539 0.33840860 0.12012221 0.27363037 -0.01894841
## PC6 PC7 PC8 PC9 PC10
## gender -0.20633601 0.374156897 -0.47969379 0.26955809 -0.26370638
## age 0.09103555 0.263841710 0.05259550 0.27962288 -0.05098357
## status -0.12250783 0.069061189 -0.30577379 -0.35640726 0.30588858
## income 0.26745141 -0.089732302 -0.06751608 0.27167683 -0.12636593
## visitNo 0.47081517 0.445353342 0.01411595 -0.10844980 0.22154936
## method -0.29421739 -0.228601250 0.08339011 -0.01994242 -0.01980002
## timeSpend -0.30136808 0.007744439 -0.04010632 -0.08721118 -0.04865335
## location -0.16969415 -0.049320569 -0.41484559 0.14149037 0.23503402
## membershipCard -0.13891324 0.356156447 -0.01985786 -0.09748890 -0.11770837
## spendPurchase -0.24454165 0.248445265 0.13569099 0.01635038 -0.32326041
## PC11 PC12 PC13 PC14 PC15
## gender 0.25846212 -0.01080653 -0.061252816 -0.27218963 -0.09237527
## age -0.11376943 -0.40605737 -0.303506112 -0.15418000 0.36128414
## status -0.22874481 0.27590906 0.366060138 -0.05901339 0.06564380
## income 0.06037652 0.28137273 -0.003612662 0.06900860 -0.53053262
## visitNo 0.11189395 -0.37377377 0.247103697 0.27812292 -0.18979643
## method -0.16599318 -0.31510041 -0.173794111 0.24609556 -0.43972768
## timeSpend -0.08495036 -0.38206957 0.040777111 -0.26640234 -0.44498149
## location 0.40566605 0.01723121 0.125211532 0.15666155 -0.07988022
## membershipCard -0.57644978 0.14629462 0.074910677 -0.20337634 -0.13433772
## spendPurchase 0.15437167 -0.13428078 0.545721134 0.09960360 0.12636045
## PC16 PC17 PC18
## gender -0.04586585 -0.01281388 -0.18850055
## age 0.09157289 -0.07816560 -0.14212579
## status 0.17099966 -0.21516760 -0.10803630
## income -0.39105509 0.14287690 -0.09353801
## visitNo -0.07306379 -0.04102535 0.13658469
## method 0.07120557 -0.13912850 0.14075330
## timeSpend 0.27965084 -0.03312741 0.05268776
## location 0.14770829 -0.08359065 -0.05441842
## membershipCard -0.19869961 0.29260374 -0.01824713
## spendPurchase -0.15689695 0.03435924 0.34981488
We can also look at each customer and see how their responses are aligned with the principal components. The table below shows all customers (one per row) and the variance from each principal component. Customer #1 is aligned 0.85 with PC1 compared to customer #2 at 0.07. These are abstracts of the questions themselves since the components (arrows in space) are defined by the data. But, it does show proportional differences in customers.
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 0.8529661 -1.77036629 -0.75371744 -0.8521789 1.9743031 0.45239286
## 2 0.0672139 -2.20158359 0.92824128 0.1360574 0.4268041 0.64073833
## 3 0.7221831 0.08873405 -1.09567555 -0.2312862 -0.1104122 -0.02501779
## 4 -3.9589085 -1.02217937 0.33309816 -0.2373517 -0.6948346 -0.88667282
## 5 -1.8722242 1.26098939 -0.05392417 1.7734130 0.3264122 -1.25619758
## 6 0.7025237 -2.27717905 -1.94851037 -0.3863118 -0.5428095 -0.17092736
## 7 2.9631473 -1.66004998 -0.61864795 0.4695165 1.7007916 0.83945648
## 8 -0.8821823 1.78369895 -0.51628074 -0.3949705 0.3195202 1.12152920
## 9 1.4705346 -0.75181835 -0.26343371 1.9184342 0.1366377 0.01261399
## 10 -0.1246025 0.52273150 0.34242932 1.8681059 -1.1889317 -0.69501540
## PC7 PC8 PC9 PC10 PC11 PC12
## 1 0.12187894 0.76403388 1.12190916 0.20925179 0.4144088 -0.66386200
## 2 -0.74702288 0.44297057 1.30968436 -0.16698097 0.7529483 -0.87319091
## 3 -1.22200506 -0.01615925 -0.19812180 1.22652068 0.3726233 0.41213361
## 4 0.02391782 0.01697528 0.03272142 -0.86975906 0.7950617 -0.09239584
## 5 0.32591702 1.84518528 -0.16434309 -0.13402648 -0.6872136 -0.10382441
## 6 0.88980381 -0.47023190 0.55244542 0.07899229 0.6290469 -0.23892989
## 7 1.56415576 0.26322055 0.59399329 -0.75014203 0.3491163 -0.03019637
## 8 -0.80803693 -0.51806396 -0.41375599 1.47414131 0.5929043 0.05778179
## 9 0.89185639 -0.09072060 0.84250425 -0.75051299 1.5398838 -1.06318429
## 10 -0.31792495 -0.35743723 -0.98100672 0.74133093 -1.0050094 0.16266349
## PC13 PC14 PC15 PC16 PC17 PC18
## 1 -0.28868784 -0.8461393 0.46247395 -0.40819731 0.21559296 0.22892007
## 2 -0.01317355 0.1163454 -0.08648667 0.95247399 0.88549905 -0.02797806
## 3 0.65192868 -0.4815671 0.71413363 1.04041956 -0.24515137 -0.58899529
## 4 -0.73284999 0.1263621 0.14103232 -0.56130397 -0.37781989 0.21630649
## 5 -0.55726490 0.4017822 -0.39204356 0.67988232 0.66021295 0.10891472
## 6 -0.24693562 -0.6011324 -0.05556375 -0.18261301 1.20191761 0.24068965
## 7 -0.64860411 1.0279029 0.83009486 -0.24996988 0.46010085 0.84101819
## 8 0.72572364 0.1406796 -0.35655482 -0.06240683 -0.01423836 -1.10453569
## 9 1.17385607 1.0209409 0.38723312 -0.30299889 -0.23994262 0.10151588
## 10 0.41666402 0.1401775 0.25833695 -0.33792257 -0.49644905 0.33574888
The Scree plot graphs the eigenvalues ordered from largest to the smallest and shows individual and cumulative components. The cumulative proportion shows us how far we need to go in order to get the level of data description desired.
Next we look at a circular plot for our PCA. In the plot below we are looking at the first two principal components and how these two components are related to the questions in the survey. Those with the longest vectors are most aligned with PC1 or PC2 in the positive direction. Vectors closely grouped are correlated and opposite side vectors are negatively correlated.
We interpret those vectors in the upper right corner as being strongly represented by PC1 and PC2. Interestingly we see that those in the lower right quadrant are high in dimension 1 and low in dimension 2. High in dimension 1 means they are filling out the ratings with high marks, but low in dimension 2 means they are not in the the higher income, older age group. They are younger with less income. With the higher WiFi and ambiance ratings also correlated to these, it could mean these are people that enjoy spending more time in the cafe. Are the older higher income people carrying our primarily and not using the WiFi? This type of breakdown affords us the opportunity to develop some insight and more specific questions.
Next we add customers to this plot. Below we can customer Id numbers along with our principal components. This would provide the opportunity now to pull up individual responses.
Understanding customers
Lets examine customer #29 in the upper right quadrant. We can see our analysis supports our assessment of this type of customer. There are over 40, making a high income and went through the drive-through. Clearly ambiance and WiFi service are unimportant to this customer. And we deduced this from the evidence in the data!
question | response |
---|---|
|
Male |
|
40 and above |
|
Self-employed |
|
More than RM150,000 |
|
Weekly |
|
Drive-thru |
|
Below 30 minutes |
|
1km - 3km |
|
Yes |
|
Coffee |
|
More than RM40 |
|
4 |
|
5 |
|
3 |
|
3 |
|
3 |
|
4 |
|
5 |
|
Starbucks Website/Apps;Social Media;Through friends and word of mouth;In Store displays |
|
Yes |
Let’s look at customer #67 in the lower right quadrant. We can see they have a lower income and do dine in the cafe. WiFi and ambiance are important to them and rated highly. We are more concerned about this customer rating these characteristics higher than the previous customer.
question | response |
---|---|
|
Female |
|
From 20 to 29 |
|
Self-employed |
|
RM25,000 - RM50,000 |
|
Monthly |
|
Dine in |
|
Below 30 minutes |
|
within 1km |
|
Yes |
|
Coffee;Cold drinks;Pastries;Sandwiches |
|
Around RM20 - RM40 |
|
4 |
|
3 |
|
4 |
|
5 |
|
4 |
|
4 |
|
4 |
|
Starbucks Website/Apps;Social Media;Emails;Through friends and word of mouth |
|
Yes |
Now customer #23… yes, we all have a customer #23. Customer #23 is less satisfied and answers they will not come back. We can see they are low in Dim 1 (ratings) and Dim 2 (demographics). These means they are making less income and are younger and provided lower satisfaction ratings. We also note they stay for 3 hours in the cafe and are most interested in sales and promotions. How do we respond to this information? Seeing they are not holding a membership card, maybe this is an opportunity to promote savings through a membership that may be meaningful to this type of customer. We definitely are very interested in these types of customers and responses. Or, is this not our market? The point is we now have actionable information and understand our customers in context of their results, demographics and other customers. We have segmented our customers for market analysis.
question | response |
---|---|
|
Male |
|
From 20 to 29 |
|
Employed |
|
RM25,000 - RM50,000 |
|
Rarely |
|
Dine in |
|
More than 3 hours |
|
within 1km |
|
No |
|
Coffee |
|
Less than RM20 |
|
5 |
|
2 |
|
5 |
|
5 |
|
2 |
|
4 |
|
3 |
|
Through friends and word of mouth |
|
No |
Market segments
Next we want to bring back in our cluster analysis by see how the 5 clusters we found align with the principal components. Below we plot the 5 clusters from our K-Means clustering in an XY scatter plot of the first 2 principal components. This allows us to understanding where our clustering algorithm identifies market segments. We can visualize how clusters 1, 2 and 3 are defined along the PC1 and PC2. But, clusters 5 and 6 and not well discriminated by PC1 and PC2. That is OK since we have the additional principal components that we saw above account for other differences. But it is consistent to see that our first 3 clusters are can be represented by PC1 and PC2.
Customer features
We can also look at a specific feature in our data to see where it falls with the principal components. Below we examine how gender falls along our components. We note that we have good representation along PC1 (ratings), but our survey included females with less income and lower age then the males in our survey. This may reveals bias in our survey collection method and/or the fact that our female customers tend to be younger with less income.
Summary
Customer segmentation analysis with existing descriptive data (e.g., Sales Force) or augmented with survey data can provide marketing decision information on your customers. Understanding data-derived segments and the features of those customers illuminates characteristics that may be too subtle to detect in current business operations and reports. Identifying those features that define good or desirable customers can help focus marketing activities to capture more like-customers, or identify new, or under-engaged segments.
The combination of clustering and performing Principal Component Analysis provides us a set of data-driven models that we can further query, filter and create meaningful quantitative visualizations and numerical data. With this approach we can:
define customer segments idenitfy survey representation and bias measure the variability of customers and responses identify opportunities to expand segments and improve service identify important features in the segments gather insight into customer segment demographics, satisfaction and service value.
This customer segmentation analysis is a repeatable process that can include survey data along with other available data on customers. For example, features from Sales Force can also be clustered and analyzed for the principal components to find insight into what true segments are (by data) and what features of those customers define their segment. With this we can identify opportunities to expand by specific features (e.g., annual revenue) or measure gaps to under-represented market segments.