Testing and personalisation are crucial aspects of modern business and marketing strategies. In today’s rapidly evolving digital landscape, where competition is fierce and consumer expectations are high. Businesses need to ensure their products, services, customer experience and marketing efforts are effective and efficient.
All testing and personalisation activities should have a clear goal on what it wants to achieve, one or multiple KPIs with proper tracking and statistical calculation to identify which variants performed best toward the defined goal significantly.
Using Adobe Target alone, we can generate reports within Adobe Target with predefined measurements to identify the winner experience. Integrating with Adobe Analytics via A4T (Analytics for Target) opens up the possibility of identifying winning experiences with any metric available by Adobe Analytics without configuring it in Adobe Target first. We can select (almost) any metrics in the A4T panel for the performance analysis of any testing and personalisation activities.
However, there is still a shortfall in the A4T integration, we cannot use calculated metrics in the A4T panel. That could be very useful in many cases, like measuring success by the page views of a set of pages.
Moreover, the default line chart in the A4T panel shows experiences’ performance on daily, weekly, or other granularity as configured which is less interesting than seeing an accumulated performance from the start of the activity.
How A4T calculates success
Adobe has documentation on dimensions/metrics becoming available from A4T integration and even how to perform offline calculations on How Do I Use Reporting in A4T? | Adobe Target, and also a page detailing the statistical calculation of Welch’s t-test being used, How Can I Learn About the Statistical Calculations Used in A/B Test Activities? | Adobe Target. On both pages, there are links to download an Excel file to perform such calculations offline.
Unfortunately, things are not that easy and less than perfect.
The objective is the calculation of testing results based on some calculated metrics. Those calculated metrics need to be associated with individual visitor levels by the visitor ID. There is currently no way to export these two data from Adobe Analytics at the same time. Calculated metrics are only available via the Adobe Analytics UI and API where no visitor ID is available. The visitor IDs are only available via Data Warehouse and Data Feed, with no calculated metrics. Therefore, the only option is an offline calculation using raw or semi-raw data with the following steps:
- Get visitor-level data with Visitor IDs using Data Warehouse or Data Feed.
- Perform the calculated metrics logic offline.
- Perform the statistics calculation for the Target result.
The calculation
First, let’s try to do this with Data Warehouse.
Configuring a Data Warehouse is simple by selecting all necessary information required to identify individual visitors, performing a recalculation of calculated metrics, and the Target activity/experience. Moreover, since I want to calculate a running result from the start of the Target activity, I also need to configure the Data Warehouse request with daily granularity to have the date column included.
In general, I prefer to look for testing results at the visitor level, instead of visit or impression, as available as a normalizing metric in the A4T panel. It is simply because the same visitor will see the same allocated experience within the Target profile expiry period.
We should use “Visitor ID” to identify each visitor instead of using “Experience Cloud ID” (ECID). The “Visitor ID” is the final ID used by Adobe Analytics and the ECID has been available only since the Experience Cloud Identity Service launched. However, since the Identity Service launched for quite some time already, I believe most Adobe customers have already using it for quite some time as well. They should be almost identical but let’s stick with the recommendation anyway.
In this example, I want to measure the sample testing activity by the page view on some specific pages. So the data warehouse output has the following columns:
- Date
- Target Activity > Experience
- Visitor_ID
- Page
- Page Views
df['date'] = pd.to_datetime(df["date"])
df = df[df["experience"].isin(["Default", "New"])]
df = df[df['page'].notna()]
df['match_target'] = df['page'].str.contains("|".join(target_pages))
df["pv"] = df["pv"].mask(df["match_target"] == False, 0)
Apply some simple data operations to filter only relevance testing experiences (“Default” and “New”). And also page views to those specific pages for measurement as specified in the target_pages list.
There is no need to implement Welch’s t-test from the ground as the Python module scipy.stats already have this implemented. We need to:
- Loop through all days of the testing activity running.
- On each iteration, grouping the data into only three columns: experience, visitor ID, and page views.
- And perform the t-test to check if there is a significant difference between experiences.
df_result = pd.DataFrame()
for date in df.date.unique():
df_target_result = df[df.date <= date]
df_target_result = (
df_target_result
.groupby(['experience', 'visitor_id'])
.agg({'pv': 'sum'})
.reset_index()
)
pv_default = df_target_result[df_target_result['experience']
== 'Default']['pv'].tolist()
pv_new = df_target_result[df_target_result['experience']
== 'New']['pv'].tolist()
test_result = stats.ttest_ind(pv_default, pv_new, equal_var=False)
default_ci = stats.t.interval(
.95,
df=len(pv_default) - 1,
loc=np.mean(pv_default),
scale=np.std(pv_default) / np.sqrt(len(pv_default))
)
new_ci = stats.t.interval(
.95,
df=len(pv_new) - 1,
loc=np.mean(pv_new),
scale=np.std(pv_new) / np.sqrt(len(pv_new))
)
df_result = pd.concat(
[
df_result,
pd.DataFrame({
'date': [date],
'default_n': [len(pv_default)],
'default_pv': [np.sum(pv_default)],
'new_n': [len(pv_new)],
'new_pv': [np.sum(pv_new)],
'lift': [(np.average(pv_new) - np.average(pv_default)) / np.average(pv_default)],
'pvalue': [test_result[1]],
'default_low': [default_ci[0]],
'default_high': [default_ci[1]],
'new_low': [new_ci[0]],
'new_high': [new_ci[1]]
})
],
ignore_index=True
)
df_result['confidence'] = 1 - df_result['pvalue']
Some additional information is created apart from the t-test result. Lift and confidence interval of page views, for better visualization of the testing activity at the end. Running totals of visitors and page views help to provide a sense check if the data looks right or not.
With the df_result, we can plot the result over time.
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 5))
ax1.plot(df_result['date'], df_result['lift'], label='Lift', color='blue')
ax1.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1, decimals=0))
ax1.set_ylabel('Lift')
ax1.legend(loc='upper left', bbox_to_anchor=(0.0, 1.2))
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1_confidence = ax1.twinx()
ax1_confidence.plot(df_result['date'], df_result['confidence'], label='Confidence', color='red')
ax1_confidence.axhline(y=0.95, color='red', linestyle='dotted')
ax1_confidence.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1, decimals=0))
ax1_confidence.set_ylabel('Confidence')
ax1_confidence.legend(loc='upper right', bbox_to_anchor=(1, 1.2))
ax1_confidence.xaxis.set_visible(False)
ax1_confidence.yaxis.set_visible(False)
ax2.fill_between(df_result['date'], df_result['default_low'], df_result['default_high'], color='blue', alpha=0.3, label='default')
ax2.fill_between(df_result['date'], df_result['new_low'], df_result['new_high'], color='orange', alpha=0.3, label='new')
ax2.set_ylabel('pageviews')
ax2.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=2)
ax2.xaxis.set_visible(False)
ax2.yaxis.set_visible(False)
plt.tight_layout()
plt.show()
All x- and y-axes hidden for sensitivity. The dotted line on the upper chart is the 95% confidence level. We can see this testing activity has a promising result where the new experience outperforms the default experience.
Turning into the binary calculation
Here comes a question, should we simply compare these two experiences by the number of page views? The new experience in this testing is a new UI design and hoping to drive more visitors to those specific pages. It seems irrelevant how many pages visitors viewed, as long as they viewed those specific pages. This becomes a binary conversion test.
To start, we can change all page views to 1 if it is more than 0, and remain 0 if 0.
df_target_result["pv"] = df_target_result["pv"].apply(lambda x: 1 if x > 0 else x)
This makes the result column binary. Then run the same looping, calculation, and plotting again to see if there are any changes.
There are changes and the result is even more prominent. The gap between the confidence intervals of the default and new experience is wider and the confidence flattened on top earlier. Just to note now the confidence interval is the conversion rate instead of page views.
Using proportion test
By turning the testing objective from numeric value to binary value, it changed the fundamental question, from conversion to proportion test. The difference can be calculated via a proportion test instead of a t-test. We can perform the proportion test in Adobe Analytics directly as there is no need for individual visitor information. However, doing this in Adobe Analytics could be troublesome so let’s continue with the offline calculation.
We first produced a crosstab table of experience and converted/non-converted counts, then ran the standard score test.
df_result = pd.DataFrame()
for date in df.date.unique():
df_target_result = df[df.date <= date]
df_target_result = df_target_result.groupby(['experience', 'visitor_id']).agg({
'pv': 'sum'}).reset_index()
df_target_result.columns = ['experience', 'visitor_id', 'converted']
df_target_result["converted"] = df_target_result["converted"].apply(
lambda x: "yes" if x > 0 else "no")
num_current = df_target_result[df_target_result.experience ==
"Default"].shape[0]
num_new = df_target_result[df_target_result.experience == "New"].shape[0]
converted = pd.crosstab(
df_target_result.experience,
df_target_result.converted)['yes'].tolist()
zstats, pvalue = proportions_ztest(converted,
[num_current, num_new],
alternative='two-sided'
)
default_ci = proportion_confint(converted[0], num_current)
new_ci = proportion_confint(converted[1], num_new)
df_result = pd.concat(
[
df_result,
pd.DataFrame({
'date': [date],
'default_n': [num_current],
'default_converted': [converted[0]],
'new_n': [num_new],
'new_converted': [converted[1]],
'lift': [(converted[1] / num_new - converted[0] / num_current) / (converted[0] / num_current)],
'pvalue': [pvalue],
'default_low': [default_ci[0]],
'default_high': [default_ci[1]],
'new_low': [new_ci[0]],
'new_high': [new_ci[1]]
})
],
ignore_index=True
)
df_result['confidence'] = 1 - df_result['pvalue']
The standard score test is done by the Python module statsmodels.stats.proportion. The df_result data frame shares the same structure as the t-test example above. So we can plot the running result with the same code.
The result is almost identical to the binary t-test. The count, conversion, and lift remain the same as they are not statistic model-related. The p-value and confidence intervals are so close that the differences from the binary t-test are only 0.01%, which we can’t tell in the plot.
Using Data Feed
Using Data Warehouse data could encounter file size limitations. If the testing is running for a long time, or the calculated metrics involve lots of dimensions/metrics. We will turn to the data feed as the data source for the calculation.
I am not going to repeat the data feed configuration. A reminder that the data feed is the clickstream raw data of Adobe Analytics. It is huge and usually managed on a data lake and accessed by Apache Spark or Databrick. And the structure is different from what we have from the data warehouse. There is no t-test or proportion test in the PySpark module so we could only use PySpark to access the data and calculate the calculated metrics required. Convert the final and small dataset into a Pandas data frame. Then perform statistic calculation and testing using the scipy and statsmodels modules as above.
The thing need to know is where to find information from A4T in the data feed and Adobe has that well documented as well Where Can I Find Information About Classifications with A4T? | Adobe Target. We only need the “post_tnt” column from the data feed to understand which Adobe Target activity is currently associated with the hit. The “post_tnt_action” column provides a bit more interesting information with the “event” where it is activity impressions, visits, conversions, etc. The “post_tnt” is coded with the activity ID which can be found in the Adobe Target UI. The experience ID is simply a zero-based serial number from 0 as the first experience as defined in Adobe Target.
It is important to identify the type of testing and then to identify the appropriate type of statistical tests to get the result. When the measurement is about some numeric values, such as order size, revenue, time spent on pages, etc, we can use the basic t-test. However, if the testing is more on the direction of conversion and yes/no outcome, a binary t-test or proportion test could be more suitable and provide a better result.
Leave a Reply