import pandas as pd, numpy as np
df = pd.read_csv('data/Final_Project_Baseline_Values.csv',
index_col = False,
header = None,
names= ['Metric_Definition', 'Baseline_Val'] )
df
# Labelling the metrics for ease of reference
df['Metric'] = ['Cookies', 'Clicks', 'Enrollments (Clicks * Gross_Conversion)',
'CTP (Clicks/Cookies)',
'Gross_Conversion (Enrollments/Clicks)',
'Retention (Paid/Enrollments)','Net_Conversion(Paid/Clicks)']
df
# Add Practical Significance level column
df['dmin'] = [3000,240,50,0.01,0.01,0.01,0.0075]
df
For all the calculations to follow we need to scale our collected counts estimates of metrics with the sample size we specified for variance estimation. In this case, from 40000 unique cookies to visit the course overview page per day, to 5000.
In order to estimate variance analytically, we can assume metrics which are probabilities ( $\hat{p}$ ) are binomially distributed, so we can use this formula for the standard deviation:
This assumption is only valid when the unit of diversion of the experiment is equal to the unit of analysis (the denominator of the metric formula). In the cases when this is not valid, the actual variance might be different and it is recommended to estimate it empirically.
For each metric, we need to plug two variables into the formula:
$ \hat{p} $ = baseline probability of the event to occur
n = sample size
#Predetermined from baseline values
ctp = 0.08
gross_conv = 0.206250
retention = 0.53
net_conv = 0.109313
#scale data based on sample size 5000:
sum_cookies = 5000
sum_clicks = ctp * sum_cookies # n size for Gross Conversion & Net Conversion
sum_enrolled = sum_clicks * gross_conv # n size for Retention
# Calculate SD for evaluation metrics:
sd_gross_conv = round(np.sqrt((gross_conv * (1-gross_conv))/sum_clicks),4)
sd_retention = round(np.sqrt((retention * (1-retention))/sum_enrolled),5)
sd_net_conv = round(np.sqrt((net_conv * (1-net_conv))/sum_clicks),4)
df["SD"] = ['NA','NA', 'NA', 'NA', sd_gross_conv, sd_retention, sd_net_conv]
df
To calculate the sample size required for the experiment, the largest sample size required for one of the evaluation metrics will effectively be the size to go with. I plug in the following values into the online calculator for sample size:
df_eval = df.iloc[4: , 1:]
df_eval.set_index(['Metric'], inplace = True)
df_eval.rename(columns = {'Baseline_Val': 'p'} , inplace = True)
# Sample size using the online calculator
df_eval ['sample_size'] = [25835, 39155, 27413]
df_eval
Retention
Total cookies required in order to have 39155 enrollments per group (control and experiment):
$$ \frac{Enrollments * 2}{\ GrossConversion * ctp} $$
Net Conversion
Total cookies required in order to have 27413 clicks per group (control and experiment):
$$ \frac{Clicks * 2}{\ ctp} $$
pageviews_gc = round(25835 * 2 / 0.08)
pageviews_ret = round(39155 * 2 / (0.206250 * 0.08))
pageviews_nc = round(27413 * 2 / 0.08)
df_eval ['page_views'] = [pageviews_gc,pageviews_ret,pageviews_nc]
df_eval
4 million page views is significantly beyond the estimated 40K views we get on average daily. That would take us at least 100 days to collect the data, and typically any experiments taking longer than a few weeks is not reasonable, hence I decided to drop Retention as a metric. Net Conversion has the largest number of page views of the remaining 2 evaluation metrics.
Now, let's calculate the duration at different exposure rates:
Duration100 = round(pageviews_nc / 40000)
Duration75 = round(pageviews_nc / (40000 * 0.75))
Duration50 = round(pageviews_nc / (40000 * 0.5))
print ('Duration at 100% exposure: {} days'.format(Duration100))
print ('Duration at 75% exposure: {} days'.format(Duration75))
print ('Duration at 50% exposure: {} days'.format(Duration50))
I decided to go with 75% exposure since a 3 week duration to run the experiment is a reasonable length. 50% exposure rate with over a month long duration is not necessary as the risk is low since we do not expect a big drop in net conversion which may impact the company's revenue. I personally try to avoid 100% exposure as I find that sometimes there are some business risks or technolgy issues resulting from running the experiment, and it is always good to hold back some traffic from the change.
There are similar analysis using different exposure rates than mine with good justifications as well, so definitely do your own reasoning to choose the right exposure.
Before analyzing results from the experiment, sanity checks should be performed. These checks help to verify if the experiment was conducted as expected and that other factors did not influence the data which we collected. This also makes sure that data collection was correct.
For invariant metrics we expect equal diversion into the experiment and control group. We will test this at the 95% confidence interval.
Two of these metrics are simple counts like number of cookies or number of clicks and the third is a probability (CTP). We will use two different ways of checking whether these observed values are within expectations.
# Load experiment results into dataframe
df_ctr = pd.read_csv('data/Final_Project_Results_Control.csv')
df_ctr.head()
df_exp = pd.read_csv('data/Final_Project_Results_Experiment.csv')
df_exp.head()
# Assigning variable names to total of each columns in the experiment results
pageviews_exp = df_exp['Pageviews'].sum()
pageviews_ctr = df_ctr['Pageviews'].sum()
clicks_exp = df_exp['Clicks'].sum()
clicks_ctr = df_ctr['Clicks'].sum()
I use binomial distribution (p = 0.5) to determine the probability for Number of Cookies and Number of Clicks is within the margin of error at 95% confidence interval since the cookies are randomly assigned to either control or experiment group.
What we want to test is whether our observed fraction, $\hat {p}$ (number of samples in control or experiment group divided by total number of samples in both groups) is not significantly different than p=0.5 . If the observed $\hat {p}$ is within the margin of error range acceptable at a 95% confidence level, we passed the sanity checks! =)
df_inv = pd.DataFrame({
'Experiment': [pageviews_exp, clicks_exp],
'Control': [pageviews_ctr, clicks_ctr]
}, index = ['Number of Cookies', 'Number of Clicks'])
df_inv
p = 0.5
alpha = 0.05
#Standard Deviation using binomial probability p = 0.5
sd_cookies = np.sqrt((0.5*(1-0.5))/(pageviews_exp + pageviews_ctr))
sd_clicks = np.sqrt((0.5*(1-0.5))/(clicks_exp + clicks_ctr))
df_inv['SD'] = [sd_cookies, sd_clicks]
#Margin of Error = Z score * Standard Deviation
#z score is 1.96 at 95% confidence interval
moe_cookies = 1.96 * sd_cookies
moe_clicks = 1.96 * sd_clicks
df_inv['MOE'] = [1.96 * sd_cookies, 1.96 * sd_clicks]
#Lower and Upper Bound (p +- MOE)
df_inv['Lower_Bound'] = [0.5 - moe_cookies, 0.5 - moe_clicks]
df_inv['Upper_Bound'] = [0.5 + moe_cookies, 0.5 + moe_clicks]
df_inv
# observed fraction, p_observed using either experiment or control group (I ran the calcs with experiment group)
p_observed_cookies = pageviews_exp/(pageviews_exp+pageviews_ctr)
p_observed_clicks = clicks_exp/(clicks_exp+clicks_ctr)
df_inv['p_observed'] = [p_observed_cookies,p_observed_clicks]
df_inv["Pass_Sanity"] = df_inv.apply(lambda x: (x['p_observed'] > x['Lower_Bound'])
and (x['p_observed'] < x['Upper_Bound']),axis = 'columns' )
df_inv
# Same calc as above but using control group to calc observed fraction , p_observed :
p_observed_cookies = pageviews_ctr/(pageviews_exp+pageviews_ctr)
p_observed_clicks = clicks_ctr/(clicks_exp+clicks_ctr)
df_inv['p_observed'] = [p_observed_cookies,p_observed_clicks]
df_inv["Pass_Sanity"] = df_inv.apply(lambda x: (x['p_observed'] > x['Lower_Bound'])
and (x['p_observed'] < x['Upper_Bound']),axis = 'columns' )
df_inv
I ran both sets of calculations to show you that the result is the same: Observed fraction is within bounds, passing sanity checks for both metrics.
Click-through-probability of the Free Trial Button
In this case, we want to make sure the proportion of clicks given a pageview (our observed CTP) is about the same in both groups (since this was not expected to change due to the experiment). In order to check this out we will calculate the CTP in each group and calculate a confidence interval for the expected difference between them. In other words, we expect to see no difference ( CTPexp−CTPcont=0 ), with an acceptable margin of error, dictated by our calculated confidence interval. The changes we should notice are for the calculation of the standard error, which in this case is a pooled standard error.
$$ SD_{pool}= \sqrt {\hat {p_{pool}}(1−\hat {p_{pool}})(\frac{\ 1}{Ncont}+\frac{\ 1}{Nexp})}$$$$ \hat {p_{pool}} = \frac{\ X_{cont} + X_{exp}}{N_{cont}+ N_{exp}} $$We should understand that CTP is a proportion in a population (amount of events x in a population n) like the amount of clicks out of the amount of pageviews.
# CTP probability per group
ctp_ctr = round(clicks_ctr/pageviews_ctr,4)
ctp_exp = round(clicks_exp/pageviews_exp,4)
ctp_diff = round(ctp_exp - ctp_ctr,4)
#pooled CTP probability
ctp_pool = round((clicks_ctr + clicks_exp) / (pageviews_ctr + pageviews_exp),4)
SD_pool = round(np.sqrt ( (ctp_pool*(1-ctp_pool)/pageviews_ctr) + (ctp_pool*(1-ctp_pool)/pageviews_exp)),4)
MOE = round(1.96* SD_pool,4)
df_ctp = pd.DataFrame({
'CTP_Experiment': [ctp_exp],
'CTP_Control': [ctp_ctr],
'Ppool': [ctp_pool],
'Diff_in_CTP': [ctp_diff],
'SDpool':[SD_pool],
'MOEpool': [MOE]
}, index = ['Click through Probability'])
df_ctp
#Lower & Upper Bound with p using either ctp control or experiment group (let's use ctp control as example)
df_ctp['Lower_Bound'] = round(ctp_ctr - MOE, 4)
df_ctp['Upper_Bound'] = round(ctp_ctr + MOE,4)
df_ctp["Pass_Sanity"] = df_ctp.apply(lambda x: (x['CTP_Control'] > x['Lower_Bound'])
and (x['CTP_Control'] < x['Upper_Bound']),axis = 'columns' )
df_ctp
#If we use ctp experiment instead:
df_ctp['Lower_Bound'] = round(ctp_exp - MOE, 4)
df_ctp['Upper_Bound'] = round(ctp_exp + MOE,4)
df_ctp["Pass_Sanity"] = df_ctp.apply(lambda x: (x['CTP_Experiment'] > x['Lower_Bound'])
and (x['CTP_Experiment'] < x['Upper_Bound']),axis = 'columns' )
df_ctp
We passed the sanity checks for click through probability metric as the CTP for either groups are within bounds.
Based on the experiment results, we have 23 days of enrollment, so to calculate the probability of the evaluation metrics, the data points should be retrieved from those 23 days only.
The next step is looking at the changes between the control and experiment groups with regard to our evaluation metrics to make sure the difference is there, that it is statistically significant and most importantly practically significant (the difference is "big" enough to make the experimented change beneficial to the company).
Now, all that is left is to measure for each evaluation metric, the difference between the values from both groups. Then, we compute the confidence interval for that difference and test whether or not this confidence interval is both statistically and practically significant.
df_ctr.notnull().sum()
df_exp.notnull().sum()
Based on the experiment results, we have 23 days of enrollment and payment data, so to calculate the probability of the evaluation metrics, we should use the corresponding pageviews and clicks from those days, and not all of them.
cond = (df_exp['Enrollments'].notnull()) & (df_exp['Payments'].notnull())
df_exp23 = df_exp[cond]
cond = (df_ctr['Enrollments'].notnull()) & (df_ctr['Payments'].notnull())
df_ctr23 = df_ctr[cond]
df_23 = pd.DataFrame({
'Experiment': df_exp23[['Pageviews','Clicks','Enrollments','Payments']].sum(),
'Control': df_ctr23[['Pageviews','Clicks','Enrollments','Payments']].sum()
} )
df_23
df_23.loc['Gross_Conversion'] = df_23.loc['Enrollments']/df_23.loc['Clicks']
df_23.loc['Net_Conversion'] = df_23.loc['Payments']/df_23.loc['Clicks']
df_23['Total'] = df_23['Experiment'].iloc[:4] + df_23['Control'].iloc[:4]
df_23
Just in case you need to refer back to the formula to calculate pooled probability and std deviation pool: $$ SD_{pool}= \sqrt {\hat {p_{pool}}(1−\hat{p_{pool}})(\frac{\ 1}{Ncont}+\frac{\ 1}{Nexp})}$$
$$ \hat {p_{pool}} = \frac{\ X_{cont} + X_{exp}}{N_{cont}+ N_{exp}} $$#Add Ppool as a row:
df_23['Ppool'] = np.nan
Ppool_gc = df_23.loc['Enrollments']['Total'] / df_23.loc['Clicks']['Total']
Ppool_nc = df_23.loc['Payments']['Total'] / df_23.loc['Clicks']['Total']
df_23['Ppool'].loc['Gross_Conversion'] = Ppool_gc
df_23['Ppool'].loc['Net_Conversion'] = Ppool_nc
df_23['dmin'] = [3000,240,50,np.nan,0.01,0.0075]
# Std Deviation pool
SDpool_gc = round(np.sqrt( (Ppool_gc*(1-Ppool_gc)/df_23.loc['Clicks']['Experiment']) +
(Ppool_gc*(1-Ppool_gc)/df_23.loc['Clicks']['Control'])
),6)
SDpool_nc = round(np.sqrt( (Ppool_nc*(1-Ppool_nc)/df_23.loc['Clicks']['Experiment']) +
(Ppool_nc*(1-Ppool_nc)/df_23.loc['Clicks']['Control'])
),6)
df_23['SDpool'] = [np.nan,np.nan,np.nan,np.nan, SDpool_gc, SDpool_nc]
# Add Margin of Error at 95% Confidence Interval, z-score is 1.96
df_23['MOE'] = 1.96 * df_23['SDpool']
df_23
To determine the practical significance level, the probability of difference(Pdiff) between the control and experiment group has to be larger than dmin.
Compute difference between Gross Conversion Experiment and Control group. Repeat the same for Net Conversion.
df_23['Pdiff'] = df_23['Experiment'] - df_23['Control']
df_23['Pdiff'].iloc[:4] = np.nan
df_23
Gross Conversion metric is practically significant as the probability difference between experiment and control group, Pdiff is -2%, which is greater than the 1% dmin change.
Net Conversion metric is NOT practically significant as the probability difference between experiment and control group, Pdiff is -0.4%, which is lower than the 0.75% dmin change.
Statistically significant means a result is unlikely due to chance. The p-value is the probability of obtaining the difference we saw from a sample (or a larger one) if there really isn’t a difference for all users.
Statistical significance doesn’t mean practical significance. Only by considering context can we determine whether a difference is practically significant; that is, whether it requires action.
It is statistically significant if:
With large sample sizes, you’re virtually certain to see statistically significant results, in such situations it’s important to interpret the size of the difference.
Small sample sizes often do not yield statistical significance; when they do, the differences themselves tend also to be practically significant; that is, meaningful enough to warrant action.
refer further details here: https://measuringu.com/statistically-significant/
# Find the confidence interval range
df_23['Lower'] = df_23['Pdiff'] - df_23['MOE']
df_23['Upper'] = df_23['Pdiff'] + df_23['MOE']
df_23
Gross Conversion metric is statistically significant since Pdiff is -0.02, which is within the 95% confidence interval range: [-0.029124, -0.011986] and the CI does not include 0.
Net Conversion metric is NOT statistically significant since Pdiff is -0.004, which is a very small decrease and as such is not statistically significant. The 95% confidence interval range: [-0.011604, 0.001857] indicates the CI does include 0.
In a sign test, we check if the trend of change we observed (increase or decrease) was evident in the daily data.
Based on above results analysis, I expect to see experiment group with lower gross conversion rate and net conversion rate than the control group. Compute the Gross Conversion and Net Conversion daily per group, then count how many days each metric was lower in the experiment group and this will be the number of successes for the binomial test to calculate the two-tail P value.
I use an online binomial tool to calculate the two-tail P value. You can implement the calculations behind it by referring to Tammy Rotem's kaggle solution.
# Merge both groups of data with date as join using the 23 day worth of data established previously:
df_sign = pd.merge(df_exp23, df_ctr23, on = 'Date')
#Experiment group:
df_sign['GC_exp'] = df_sign['Enrollments_x']/df_sign['Clicks_x']
df_sign['NC_exp'] = df_sign['Payments_x']/df_sign['Clicks_x']
#Control group:
df_sign['GC_ctr'] = df_sign['Enrollments_y']/df_sign['Clicks_y']
df_sign['NC_ctr'] = df_sign['Payments_y']/df_sign['Clicks_y']
df_sign.head()
#Select only relevant columns for easier read across:
df_sign = df_sign[['Date','GC_exp', 'GC_ctr', 'NC_exp', 'NC_ctr']]
df_sign.head()
# Gross conversion sign test: True (pass or success) if GC Experiment is lower
df_sign['GC_sign_result'] = df_sign['GC_exp'] < df_sign['GC_ctr']
df_sign['GC_sign_result'].value_counts()
# Net conversion sign test: True (pass or success) if NC Experiment is lower
df_sign['NC_sign_result'] = df_sign['NC_exp'] < df_sign['NC_ctr']
df_sign['NC_sign_result'].value_counts()
So, P-Value of the test is 0.0026. Since the probability to pass the test daily is 1-0.0026 = 0.9974 which is greater than 95%, this result does not happen by chance (statistically significant) and **it passes the sign test**.
So, P-Value of the test is 0.6776. Since the probability to pass the test daily is 1-0.6776 = 0.3224 which is lower than 95% , **it does NOT pass the sign test.** The experiment will not have statistical significance impact on Net Conversion.