Converting event_list And post_event_list In Adobe Analytics Data Feed

The event_list and post_event_list columns in Adobe Analytics Data Feed contain all events sent into Adobe Analytics in a concatenated and encoded format like:

20200=1609173893270,20201=1609173893457,20202=1609173893457,20203=1609173893457,20204=1609173893457,20205=1609173893485,20206=1609173893485,20207=1609173893503,20208=1609173893513,20209=1609173893625,20210=1609173894224,20211=1609173894224,20212=1609173894238,20213=1609173895613,20214=1609173895614,20215=1609173895629,20216=187,20217=187,20218=187,20219=187,20220=215,20221=215,20222=233,20223=243,20224=355,20225=954,20226=954,20227=968,20228=2343,20229=2344,20230=2359,100,101,102,103,104,105,106,112

They are difficult to use directly as the code listing is very unnatural and should be mapped using the event.tsv file and further mapping are required for custom dimensions and metrics.

It will be much easier to use in any further analysis if we can map those codes into meaningful and interpretable names and convert them into individual columns per event.


To start the mapping, we have to read in the event.tsv as a pandas DataFrame.

event_tsv_pd = pd.read_csv(
    "event.tsv",
    sep='\t',
    names=['event_code','event_name'],
    dtype = {'event_code': str, "event_name": str}
)

The event.tsv has no column name so we just assign the names event_code and event_name. We also cast the event_code to str and all future manipulation and reference to code will be treated as a string. There is no calculation or any numerical usage of event_code so treating it as a string could be a better choice.

Dataframe event_tsv_pd
event_tsv_pd

Then we create a custom function to split the event_list content and map using the event_tsv_pd. The first parameter should be ‘event_list’ or ‘post_event_list’ for which column to convert and the second parameter allows to drop any event from the list which does not require conversion.

@F.udf
def convert_event_list(evt_list, drop=None):
    events = pd.DataFrame([event.split('=') for event in evt_list.split(',')])
    if events.shape[1] == 1:
        events['1'] = None
    events.columns = ['event_code', 'event_value']
    if drop is not None:
        events = events[~events['event_code'].isin(drop)]
    events = (
        events
        .merge(event_tsv_pd)
    )

    return ','.join(events[['event_name', 'event_value']].apply(lambda x: '='.join(list(filter(None, x))), axis=1))

Line 3 split the event_list content by ‘,’ and ‘=’ for event code and event value. Line 4 to 5 to ensure there are two columns in the events DataFrame in case all events have no assigned value.

In lines 7 to 8 dropping any event_code is not required.

Lines 9 to 12 merge the events DataFrame with event_tsv_pd by event_code which is the common column on both DataFrames.

Finally, line 14 returns the reassembled event_list content using the newly mapped name and the original event value.

The function can be used in DataFrame.select like:

(
    hit_data
    .select(convert_event_list('event_list'), convert_event_list('post_event_list'))
    .show(truncate=False, vertical=True)
)

and produce more meaningful and interpretable event_list content like:

Custom Event 201=1609173893270,Custom Event 202=1609173893457,Custom Event 203=1609173893457,Custom Event 204=1609173893457,Custom Event 205=1609173893457,Custom Event 206=1609173893485,Custom Event 207=1609173893485,Custom Event 208=1609173893503,Custom Event 209=1609173893513,Custom Event 210=1609173893625,Custom Event 211=1609173894224,Custom Event 212=1609173894224,Custom Event 213=1609173894238,Custom Event 214=1609173895613,Custom Event 215=1609173895614,Custom Event 216=1609173895629,Custom Event 217=187,Custom Event 218=187,Custom Event 219=187,Custom Event 220=187,Custom Event 221=215,Custom Event 222=215,Custom Event 223=233,Custom Event 224=243,Custom Event 225=355,Custom Event 226=954,Custom Event 227=954,Custom Event 228=968,Custom Event 229=2343,Custom Event 230=2344,Custom Event 231=2359,Instance of eVar1,Instance of eVar2,Instance of eVar3,Instance of eVar4,Instance of eVar5,Instance of eVar6,Instance of eVar7,Instance of eVar13

It will be even better if we can further convert those custom events XXX and eVarYY to the name we will see in the Adobe Analytics user interface. This requires us to further map those ‘Custom Event XXX’ and ‘Instance of eVarYYY’ to the name as defined in Adobe Analytics for the corresponding report suite.

Of course, we have to prepare the additional mapping tables for custom events and custom dimensions. We only use custom events and eVar but no prop, so there are only two mapping tables. Moreover, since we have the standardized custom events and eVar across all report suites, there are two and only two mapping tables.

To keep things simple and consistent, mapping tables for the custom event and eVar are also tsv files without header and can be read with similar code like above into custom_evar_pd and custom_event_pd.

custom_evar_pd = pd.read_csv(
    "evar.tsv",
    sep='\t',
    names=['custom_evar_code','custom_evar_name'],
    dtype = {'custom_evar_code': str, "custom_evar_name": str}
)
custom_event_pd = pd.read_csv(
    "event.tsv",
    sep='\t',
    names=['custom_event_code','custom_event_name'],
    dtype = {'custom_event_code': str, "custom_event_name": str}
)
Dataframe custom_evar_pd
custom_evar_pd

However, reading the mapping table is not enough as the custom_evar_code and custom_event_code are simply the code which cannot be mapped those those ‘Custom Event XXX’ and ‘Instance of eVarYYY’. So to let those DataFrames map, we need to create the mapping column with contents matching ‘Custom Event XXX’ and ‘Instance of eVarYYY’.

custom_evar_pd['event_name'] = [f"Instance of eVar{v}" for v in custom_evar_pd['custom_evar_code']]
custom_evar_pd['custom_evar_name'] = custom_evar_pd['custom_evar_name'] + ' (v' + custom_evar_pd['custom_evar_code'] + ')'
custom_event_pd['event_name'] = [f"Custom Event {e}" for e in custom_event_pd['custom_event_code']]
custom_event_pd['custom_event_name'] = custom_event_pd['custom_event_name'] + ' (e' + custom_event_pd['custom_event_code'] + ')'

Dataframe custom_evar_pd
custom_evar_pd

Then the convert_event_list is amended to add additional mapping to the new custom_evar_pd and custom_event_pd DataFrames.

@F.udf
def convert_event_list(evt_list, drop=None):
    events = pd.DataFrame([event.split('=') for event in evt_list.split(',')])
    if events.shape[1] == 1:
        events['1'] = None

    events.columns = ['event_code', 'event_value']
    if drop is not None:
        events = events[~events['event_code'].isin(drop)]
    events = (
        events
        .merge(event_tsv_pd)
        .merge(custom_evar_pd.drop('custom_evar_code', axis=1), how='left')
        .merge(custom_event_pd.drop('custom_event_code', axis=1), how='left')
    )
    events['final'] = events['custom_evar_name'].fillna(events['custom_event_name']).fillna(events['event_name'])
    
    return ','.join(events[['final', 'event_value']].apply(lambda x: '='.join(list(filter(None, x))), axis=1))

Lines 10 to 15 expanded with additional mapping to the new DataFrames and dropping the original custom_evar_code and custom_event_code which are not necessary.

Line 16 chains up two fillna to get the final event name, as only one of the mapped custom_evar_name and custom_event_name have value or both have no value if that particular event is Adobe default event.

So the final mapped event_list content becomes the following with names of actual custom dimensions and events masked for privacy.

N…t (e201)=1609173893270,F…t (e202)=1609173893457,D…t (e203)=1609173893457,D…d (e204)=1609173893457,C…t (e205)=1609173893457,C…d (e206)=1609173893485,R…t (e207)=1609173893485,R…t (e208)=1609173893503,R…d (e209)=1609173893513,D…g (e210)=1609173893625,D…e (e211)=1609173894224,D…t (e212)=1609173894224,D…d (e213)=1609173894238,D…e (e214)=1609173895613,L…t (e215)=1609173895614,L…d (e216)=1609173895629,F…t (e217)=187,D…t (e218)=187,D…t (e219)=187,C…t (e220)=187,C…t (e221)=215,R…t (e222)=215,R…t (e223)=233,R…t (e224)=243,D…t (e225)=355,D…t (e226)=954,D…t (e227)=954,D…t (e228)=968,D…t (e229)=2343,L…t (e230)=2344,L…t (e231)=2359,S…e (v1),S…e (v2),S…y (v3),S…s (v4),P…e (v5),P…L (v6),T…X (v7),E…e (v13)

That is not the end yet. As mentioned at the beginning, mapping is only halfway and it is better to break the event_list content into individual columns for each event.

Let’s forget all the mapping above and start with the originally coded event_list.

@F.udf(returnType=MapType(StringType(), FloatType()))
def split_event_list(event_list, prefix=''):
    result = {}
    for event in event_list.split(','):
        if "=" in event:
            event_parts = event.split('=')
            result[f'{prefix}{event_parts[0].strip()}'] = float(event_parts[1])
        else:
            result[f'{prefix}{event.strip()}'] = 1.0

    return result

This custom function converts the event_list column from a string column to a Map column with the event code as key and event value as value (of course) with 1 assigned as the value for those events without any value assigned specifically. The ‘prefix’ parameter is used to differentiate if that is the ‘event_list’ or ‘post_event_list’ column if necessary.

(
    hit_data
    .withColumn('pel', split_event_list('post_event_list', F.lit('pe_')))
    .show(truncate=False, vertical=True)
)

The ‘split’ content will become:

{pe_20213 -> 1.60917396E12, pe_20214 -> 1.60917396E12, pe_20215 -> 1.60917396E12, pe_20216 -> 187.0, pe_20210 -> 1.60917396E12, pe_20211 -> 1.60917396E12, pe_20212 -> 1.60917396E12, pe_20230 -> 2359.0, pe_112 -> 1.0, pe_20217 -> 187.0, pe_20218 -> 187.0, pe_20219 -> 187.0, pe_20224 -> 355.0, pe_20202 -> 1.60917396E12, pe_20225 -> 954.0, pe_20203 -> 1.60917396E12, pe_20226 -> 954.0, pe_20204 -> 1.60917396E12, pe_20227 -> 968.0, pe_20205 -> 1.60917396E12, pe_20220 -> 215.0, pe_20221 -> 215.0, pe_20222 -> 233.0, pe_20200 -> 1.60917396E12, pe_20223 -> 243.0, pe_20201 -> 1.60917396E12, pe_102 -> 1.0, pe_200 -> 1.0, pe_101 -> 1.0, pe_104 -> 1.0, pe_103 -> 1.0, pe_106 -> 1.0, pe_105 -> 1.0, pe_20228 -> 2343.0, pe_20206 -> 1.60917396E12, pe_20229 -> 2344.0, pe_20207 -> 1.60917396E12, pe_100 -> 1.0, pe_20208 -> 1.60917396E12, pe_20209 -> 1.60917396E12}

Then we need to ‘cast’ the Map column into separated columns and one for each key.

keys = (
    hit_data
    .select(F.explode('pel'))
    .select('key')
    .distinct()
    .rdd.flatMap(lambda x: x)
    .collect()
)

(
    hit_data
    .withColumns(dict([(k, F.col("pel").getItem(k).alias(k)) for k in keys]))
    .show()
)

There is no direct way to ‘cast’ the Map column to multiple columns.

Line 1 to 8, consolidate the list of keys from the Map column to find out how many output columns and their name.

Line 10 to 14, using the keys to create a dictionary with (key, Column) for the ‘withColumns’ function to create columns in the DataFrame.

Finally, combining the mapping and splitting is straightforward, we just need to use the convert_event_list to map the event_list to meaningful and interpretable names first then split it into multiple with the above approach.

hit_data = (
    hit_data
    .select(convert_event_list('post_event_list').alias('post_event_list'))
    .withColumn('pel', split_event_list('post_event_list'))
)

keys = (
    hit_data
    .select(F.explode('pel'))
    .select('key')
    .distinct()
    .rdd.flatMap(lambda x: x)
    .collect()
)

(
    hit_data
    .withColumns(dict([(k, F.col("pel").getItem(k).alias(k)) for k in keys]))
    .show()
)

Now, the data is clean and easy to use for further analysis.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *