We now have our clusters created either by a CHAID model (post 2a), or from the Kohonen clustering algorithm (post 2b). The next step is to see which of these clusters is the most valuable – i.e. which one generates the most revenue.
There are two parts to this:
- Calculate how long these individuals are likely to stick with you (this post),
- Calculate the revenue from each point in time that they’re with you and combine with 1. above (next post).
In this post, I’ll run through a simple approach to calculating how long customers are likely to stick with you, based on historical data. Of course, if you have no historical data, then it’s anyone’s guess! Also, the work on this page can either be used with the clusters previously defined or you can just treat everyone as the same cohort, and apply this analysis to the whole group (for example, if you only have a handful of customers, the clustering process might be overkill).
The method I’m going to use for this first part of the lifetime value calculation is the Kaplan-Meier Estimator. I was going to show this in SPSS, but I think it’s actually simpler to use a spreadsheet – it’s a pretty simple algorithm based on very basic principles, so it’s more useful to show its inner workings in a spreadsheet (unlike, say CHAID, where showing the inner workings wouldn’t add anything, except a headache).
The only data needed for this process, is the following, for each row (i.e. customer) in your data table:
- When did the customer start using your service?
- Is the customer still using it?
- If she isn’t, when did she stop using it?
But, why do we need something as seemingly obscure as the Kaplan-Meier Estimator? Why not just do something simple, such as working out the average time it takes for people to leave?
The problem is one known as right-censoring. If you’re using recent historical data, then the chances are you still have some customers on your service (hopefully!). Combine this with the fact that people will have joined your service at all sorts of different times (e.g. some might have just joined yesterday) and you hit the problem of right-censoring. Essentially, you know the the start date for each customer, but, for a decent proportion of that group, you don’t know their end date (i.e. when they will leave, in the future). This problem is called the right-censoring problem – for example, if you were trying to figure out how many people left your service after 3 months, then there will be a group of people who joined your service in the last 2 months. Should these people be involved in this calculation? How? The Kaplan-Meier Estimator allows you to do this:
In this diagram, hollow diamonds on the right hand side show where customers are right-censored (rows D, F and H) because they were still with your service at the end of the measure period (i.e. now!) – we don’t know what happen to them in the future, but we want to use their data for as long as we can in our calculations.
NB: Row E is also right-censored, because it dropped out of the experiment for some other reason (e.g. they stopped paying their bills, so you ejected them from the service), and row A is left-censored, because we don’t know when they started the service (perhaps you bought the online service from someone else, and they lost their records!).
With this in mind, the process for calculating lifetime length graphs (or survival graphs, as they’re more commonly known), is actually very simple:
- For every point in time, t, find the number of people who are still around at that point, n, and then the number of people who quit your service, d, between t and t+1,
- Then, for time t, calculate the proportion of people who left that month as d/n,
- Calculate the conditional probability of surviving at time t as 1- this value, i.e. 1-d/n,
- Calculate the Survival function up to time, t, as the product of all these probabilities up to that point.
Using a simple example to illustrate this, if on the day you started your service, you had 100 customers then, during the first month, you lost 10 of those customers, then the proportion who left is 10%. One minus this value is 90%, and represents the proportion of customers who make it through to month 2. Then, at the start of month 2, you have 90 customers left, and you find that 9 of them disappear. Here, the proportion who left is 9/90 or 10% again. Hence the conditional probability of getting through month 2, given that you made it through month 1, is 90%. Finally, to work out the survival function – i.e. the chance that you’ll get to the end of month 2, if you start the service at all, you multiply 90% x 90%, giving you 81%.
The spreadsheet below carries out this calculation for our example dataset. I’ve added some extra columns to the input data, and I’ve also appended some cluster classifications from the previous posts.
I’ll walk through the steps needed to understand this sheet below, then provide some further explanation as to why this might be interesting!
Understanding the Spreadsheet
1. Once downloaded, open the document and look at sheet “Source Data”. In particular the last few columns in the table, AE:AI – explanations of these columns are below the diagram:
Columns A to AD: This is the original example dataset.
Column AE (Cluster_ID): The cluster ID assigned by the Kohonen Network clustering app.
Column AF (StartDate): The date the customer started as a customer (in this example, with personal loans, more generally, when they started on the service).
Column AG (OutcomeDate): For any customer, two things can happen – they either stay with you, the service provider, in which case there is not “Outcome” (nothing has happened!). Hence this column is empty. If they leave you, then this column contains the date they left.
Column AH (OutcomePeriod): The number of months between the customer starting on the service and leaving (obviously blank if they’ve never left).
Column AI (LifetimePeriod): If the customer has left, then this is the same value as OutcomePeriod. If the customer hasn’t left, then this is the number of months from when they started to, effectively, now.
NB: Columns AH and AI are calculated from previous columns.
2. Moving to the sheet “Survival Calculations” – this uses these columns to carry out the lifetime calculations outlined above:
a) Row 5 – a row simply indicating the number of months of service use, t.
b) Row 6 – the first calculated row. This row calculates, for each time period, the number of customers that were around at that point. As described above, when calculating the probability of a customer staying from time t to time t+1, you need to know how many customers were around at time t, for the denominator. This is calculated in row 6, using the data in column AI, from the previous sheet.
c) Row 7 – In this row, we count the number of customers who left in the given time period.
d) Row 9 – Here we calculate the percentage of customers that left this month. For example, if there were 100 customers at time t, and 10 left, then this value would be 10%.
e) Row 10 – one minus the figure above.
f) Row 11 – the final survival function. This is calculated by finding the product of all of the values in row 10 from the far left (cell B10), to the current time, t.
g) Also worth mentioning cell B2 – this is used to select a cluster so that the graph is only shown for that specific cluster. More info below.
These values are then plotted out in the graph below:
Interpretation of the Graph
A few points can be made about this chart:
- Almost nobody leaves before 12 months. This is because this dataset was for customers taking out loans with an initial 1 year tie-in period (i.e. if they left within 1 year, they would incur heavy penalties).
- At the 12 month break, a lot of people do start leaving quite rapidly – there’s a real discontinuity here, obviously indicating that some people wanted to disappear as soon as they could get out of the loan tie-in period.
- But – the attrition soon slows right down after a few months, levelling off by around month 16. This indicates that there is a proportion of people (around 60%) who are perfectly happy continuing with the loan after the tie-in period has expired.
- The chart does only go up to t = 19 months. Our service has only been running for 19 months, so we simply cannot say anything about what will happen after 19 months. Of course we could project forward, but, personally, I’m very skeptical about any sort of future forecasting in this sort of non-parametric environment – we have no idea if there was some feature of the loan that meant everyone would stop disappearing after 20 months, 24 months, or whatever.
What is also interesting however, is if you select two different clusters for comparison:
As can be seen, customers in cluster 7 are far more likely to stick with the service (80% likely rather only 60% for cluster 5). Looking at the cluster definitions for 5 and 7:
As can be seen, the LoanAmount value for cluster 5 is much higher than for cluster 7 (almost 4 times higher). Generally, customers with higher loan amounts are more price-sensitive (because they’re paying higher premiums) – this makes them much more likely to shop around and move their loan elsewhere when they can. In contrast, if you have a smaller loan (cluster 7), then it’s unlikely to be worth your while moving your loan elsewhere.
Similarly, you might find, when analysing survival rates for your cluster groups, that more price-sensitive groups paying you more in monthly revenue, are more likely to churn. Of course, up until that point of churn they are (in theory) paying you more (see next post on calculating this). However, if they then starting churning heavily at the first available opportunity, you have to examine your customer acquisition costs very carefully – any profit gained during the service period could easily be wiped out by heavy upfront marketing costs as you desperately try to feed the top of the pipeline, as older customers are churning out the other end.
Next time (and in the final part of this series – phew!), I’ll look at calculating the revenue at each stage (i.e. at each time, t), then combining that with the charts above to finally, get a calculation for Lifetime Value (LTV) for each customer segment. Armed with this, and knowledge about how expensive it is to acquire people in each group, you can then start directing your marketing efforts to the more profitable customer segments – the ones that are likely to stick with you for a long time.
PS Just as I was about to post this, I found another great article about “Cancellation rates” – i.e. the calculations of how many people churn from your service at different times. Find it at: