User retention is one of the most commonly asked questions in digital analytics and is usually answered by day since the last visit or cohort table. However, the day since the last visit is overly simple and not useful. While the cohort table looks better and more promising, it does not scale, imagine a 52-week cohort table. More critical when we want to analyse user retention on a post-login portal or app, those cookie-based reporting approaches are not able to provide us with a true view of user retention.
Let’s think again about what we usually want to know about user retention. One is ‘how often users access the portal/app’ and the number of visits per week/month could help. But 10 visits in one single day is very different from 10 visits in 10 days, so the question could usually become ‘How many days have users accessed the portal/app in a week/month’.
I tried to create calculated metrics in Adobe Analytics using the approximate count distinct function on ‘Date’ and report in a freeform table against the user ID dimension. It works but unable to go further if I want to have the average and even further to get a monthly trend. So time to get the raw data again.
Using data feed raw data is absolute overkill for the task, as it only requires two data, user ID and date. The same data can be easily retrieved using the data warehouse function, in which the data file size is much smaller and comes quickly. However, as I already have the data feed raw data, why bother to create another data warehouse request to retrieve data again?
The main objective is to calculate the access rate per login user, which is the number of days users log in weekly or monthly.
I have an evar storing the user ID and the standard ‘date_time’ dimension for the date. I also need the day of the week, week of the year, and month of the year for different grouping which should be extracted from the date_time dimension using various functions.
date_user = (
hit_data
.withColumn('date', F.split(hit_data['date_time'], ' ').getItem(0))
.withColumn('month', F.substring(hit_data['date_time'], 1, 7))
.withColumn('dayofweek', F.dayofweek(hit_data['date_time']))
.withColumn('weekofyear', F.weekofyear(hit_data['date_time']))
.withColumnRenamed('post_evarXX', 'user_id')
.select('date_time', 'date', 'month', 'weekofyear', 'dayofweek', 'user_id')
)
Two notes that the new column ‘month’ is YYYY-MM instead of the month only and the pyspark’s ‘dayofweek’ function has the first day of the week on Monday and no way to change it, which is not quite desirable so I am not going to use the weekly number in actual reporting.
The resulting date_user is as follows and we can start doing our calculation for access rate.
Starting with a weekly base access rate as a demonstration even though I’m not too fond of the week starting on Monday, it is easier as each week has a consistent 7 days, unlike months could have 28 or up to 31 days.
First, we need to get the distinct combination of user_id, weekofyear, and date from the hit-based date_user data frame, then group by user_id and pivot the long form to the wide form with the count of the day, and finally, fill in 0 for those weeks the user didn’t log in. This also assumes we have users logging on every single week, else we may see missing weeks in the data frame.
weekly_user = (
date_user
.select('user_id', 'weekofyear', 'date').distinct()
.groupBy('user_id').pivot('weekofyear').count().fillna(value=0)
)
Then we need to turn it back to long form as the wide form is not friendly for further calculation and charting. This wide-to-long form transformation is a bit more complicated as there is no built-in function but utilising the pyspark explode function which takes an array or map column and creates a new row for each item.
weekly_user = weekly_user.withColumn('combined', F.create_map(sum([[F.lit(_), weekly_user[_]] for _ in weekly_user.schema.names[1:]], [])))
weekly_user = weekly_user.select('user_id', F.explode(weekly_user['combined']).alias('weekofyear', 'count'))
weekly_user = weekly_user.withColumn('weekofyear', weekly_user['weekofyear'].cast('int'))
The first line above creates a map column by combining all weekly columns where the week number is key and the count as value, then explodes the map column to rows and finally casts the weekofyear as an integer.
Finally, we can calculate the overall weekly access rate and access rate per user.
weekly_access_rate = weekly_user.groupBy('weekofyear').agg(F.avg('count').alias('access_rate'))
user_access_rate = weekly_user.groupBy('user_id').agg(F.avg('count').alias('access_rate'))
Of course, we will visualise the data to better understand the distribution and trend.
After the weekly access rate, we can get the monthly access rate similarly with a different grouping. The access rate above is not ‘rate’ but the number of days. However, months have a different number of days, we need to get the rate by the number of days with access divided by the number of days of the corresponding month.
monthly_user = (
date_user
.select('user_id', 'month', 'date')
.distinct()
.groupBy('user_id')
.pivot('month')
.count()
.fillna(value=0)
)
monthly_user = monthly_user.withColumn(
'combined',
F.create_map(
sum(
[
[F.lit(_), monthly_user[_]]
for _
in monthly_user.schema.names[1:]
],
[]
)
)
)
monthly_user = (
monthly_user
.select(
'user_id',
F.explode(monthly_user['combined']).alias('month', 'count')
)
)
month_days = spark.createDataFrame(
pd.DataFrame(
set([
(_.strftime('%Y-%m'), monthrange(_.year, _.month)[1])
for _
in dates
]),
columns=['month', 'days'])
)
monthly_user = (
monthly_user.join(month_days, monthly_user.month == month_days.month)
.select('user_id', monthly_user['month'], 'count', 'days')
.withColumn('access_rate', monthly_user['count'] / month_days['days'])
)
monthly_access_rate = monthly_user.groupBy('month').agg(F.avg('access_rate').alias('access_rate'))
user_access_rate = monthly_user.groupBy('user_id').agg(F.avg('access_rate').alias('access_rate'))
The month_days data frame contains the number of days in each month and is joined to the monthly_user to calculate the access_rate, which the access_rate 1 means the user access the portal/app on every single day in the corresponding month.
From the above monthly access rate, we understand that users’ access rate changed over time, if we look at the whole period of a year, the access rate at the beginning of the year may not be as important as of late in the year. Calculating the overall access rate of an individual user by averaging the monthly access rate cannot reflect the latest behaviour.
To calculate the access rate with more weights in recent months than earlier, we applied a time-decayed weighting that the latest month weight for 1 and each month before weight half, such that 50% for the second latest and 25% for the third latest.
def decayAccessRate(df):
df = df.sort_values('month', ascending=False)
df['weight'] = [2**_ for _ in range(len(df))]
df['decay_rate'] = df['access_rate'] / df['weight']
return pd.DataFrame({
'user_id': df['user_id'].unique(),
'decay_rate': sum(df['decay_rate']) / sum(1 / df['weight'])
})
monthly_decay_rate = (
monthly_user
.groupBy('user_id')
.applyInPandas(decayAccessRate, schema='user_id string, decay_rate float')
)
This is calculated by applying a UDF with the applyInPandas function after groupBy user_id. For each user_id group, we sorted the dataframe by month, created a new weight column with a sequence of 1, 2, 4, 8…, and then calculated the decay_rate.
The new monthly_decay_rate gives the decay_rate which considers more the recent user access. The 1/2 decay weighting may not suit everyone’s taste and it can be easily updated in the UDF to whatever weighting is appropriate.
Leave a Reply