CoCalc -- 2022-08-21-134510.ipynb

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Project: Untitled 2022-08-21

Path: 2022-08-21-134510.ipynb

Views: ³⁵³
Image: ubuntu2004

Kernel: Python 3 (system-wide)

In [0]:

In [0]:

#Group Member: Alyssa, Meraal

In [46]:

#1 
%matplotlib inline
import matplotlib as mat
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:

OatBran=[3.84,5.57, 5.85, 4.80, 3.68, 2.96, 4.41, 3.72, 3.49, 3.84, 5.26, 3.73, 1.84, 4.14]

In [4]:

dataframe1 = pd.read_csv("oatbran.txt")
# storing this dataframe in a csv file
dataframe1.to_csv('oatbran.txt', index = None)

dataframe1

Out[4]:

In [5]:

#1A:visualize the data by group using an appropriate plot. Explain your choice.
# We will use a beeswarm plot because it will show the distribution of all different types of advertisments against eachother, since we have a small amount of data points with little overlapping. Then we will use A histogram to see the shape of the distribution.
p=sns.displot(data=dataframe1, kde=True, color="green")

Out[5]:

In [6]:

p=sns.swarmplot(data=dataframe1)
p.set(xlabel="Oatbran and Corn Flakes")+p.set(ylabel="treatment")

Out[6]:

[Text(0.5, 0, 'Oatbran and Corn Flakes'), Text(0, 0.5, 'treatment')]

In [0]:

In [7]:

diet1 = list(dataframe1["CornFlakes"])
diet2 = list(dataframe1["Oatbran"])
display("CornFlakes List", diet1, "OatBranList", diet2);

Out[7]:

'CornFlakes List'

[4.61,
42,
4,
54,
98,
82,
01,
34,
8,
56,
35,
89,
25,
24]

'OatBranList'

[3.84,
57,
85,
8,
68,
96,
41,
72,
49,
84,
26,
73,
84,
14]

In [11]:

#1 b:Describe the appropriate measure to summarize the groups, measure to compare the groups, sample size, null hypothesis, and NHST box model method. Include your reasonsfor each decision.

#We will use Median for group summary as the data is not uniform, it is skewed and one distribution is not uni modal.he median gives us the midpoint of the data which is actual information whereas the mean does not. We will not be able to use the mean since the mean can only be used for data that is symmetric, thin tailed and unimodal. Since the data we are observing does not fit this standard bell curve distribution we can not use mean to summarize the data.
#We will not be using the mode since the mode does not tell us about the data except for it's most common value which can often be misleading and hence we will not use this method.

#Null Hypothsis
# The null hypothesis is that the two groups are not different and their medians are same.

# We will use decentered two box model as the data is not uniform nor are they the same shape and variation

In [12]:

#1c Show how to calculate the appropriate measure to compare the groups (effect size) by hand, using the observed data (either write on paper or type as text, not code). Explain what you’re doing in each step. Confirm this using Python.

#To calculate median, we need to first sort the array in ascending order.  Then we find number of items in the list.  Divide the number by 2 to get the middle element of the list. This is point where 50% of data points are less value than the value at this point and 50% data points have higher value that this point. Since the total number of items in this list is even (14), we median falls between 7th and 8th element.  We take average ot 7th and 8th value and this is our median

#The effect size is the difference between the medians of two datasets

dsort = diet1
dsort.sort()                                   # sort values in ascending order
mid_i = int(len(dsort)/2-1)                    # find the middle element of the list.
median1 = (dsort[mid_i] + dsort[mid_i+1])/2    # take average of two middle values since the list has even items
print('Median for Cornflakes = ', median1)

dsort = diet2
dsort.sort()
mid_i = int(len(dsort)/2-1)
median2 = (dsort[mid_i] + dsort[mid_i+1])/2
print('Median for Oatbran = ', median2)

print('Effect size = difference in median of two diets =', median1 - median2)

print('Python calculated median for diet 1', np.median(diet1))
print('Python calculated median for diet 2', np.median(diet2))
print('Difference in median of two diets =', np.median(diet1) - np.median(diet2))

Out[12]:

Median for Cornflakes =  4.4399999999999995
Median for Oatbran =  3.84
Effect size = difference in median of two diets = 0.5999999999999996
Python calculated median for diet 1 4.4399999999999995
Python calculated median for diet 2 3.84
Difference in median of two diets = 0.5999999999999996

In [9]:

#data size
print('Size of Cornflakes subjects:', len(diet1))
print('Size of Oatbran subjects:', len(diet2))

Out[9]:

Size of Cornflakes subjects: 14
Size of Oatbran subjects: 14

In [13]:

Median_1 = np.median(diet1)
Median_2 = np.median(diet2)
display("Median of Corn Flakes List", Median_1, "Median of Oatbran List", Median_2)

Out[13]:

'Median of Corn Flakes List'

4.4399999999999995

'Median of Oatbran List'

3.84

In [14]:

MAD_set1 = []
MAD_set2 = []
absolute1 = []
absolute2 = []
MAD_1 = []
MAD_2 = []
differenceMedian = Median_1 - Median_2
for i in diet1:
    MAD_set1.append(i-Median_1)
for i in diet2:
    MAD_set2.append(i-Median_2)
absolute1.append(np.abs(MAD_set1))
absolute2.append(np.abs(MAD_set2))
MAD_1.append(np.median(absolute1))
MAD_2.append(np.median(absolute2))
display("MAD for diet 1 List",MAD_1,"MAD for diet 2 List", MAD_2, "Observed Difference", differenceMedian)

Out[14]:

'MAD for diet 1 List'

[0.5599999999999998]

'MAD for diet 2 List'

[0.45999999999999996]

'Observed Difference'

0.5999999999999996

In [0]:

#1D Calculate p-value for NHST using the appropriate method. Include your histogram of simulation results with observed result indicated. Comment on any lines that are substantially different from the standard Big Box method (i.e., different in more than variable name from what we did in Lab 4). If appropriate, determine which groups are significantly different from others.

In [15]:

diet1_centered = np.array(diet1) - Median_1
diet2_centered = np.array(diet2) - Median_2
total = 10000
diet_difference = np.zeros(total)
differenceMedian = Median_1 - Median_2
other_limit = -differenceMedian
for i in range(total):
    Random_diet1 = np.random.choice(diet1_centered, len(diet1_centered))
    Random_diet2 = np.random.choice(diet2_centered, len(diet2_centered))
    diet_difference[i] = np.median(Random_diet1) - np.median(Random_diet2)
p = sns.distplot(diet_difference, kde=False, axlabel="Difference in diet 1 and diet 2 Medians")
p.set(ylabel="Count")
p.axvline(differenceMedian, color="red");
p.axvline(other_limit, color="red");
p.axvline(np.median(diet_difference), color="blue");
p.set_title("Difference in Median LDL cholestrol levels When Using Two different diets");
pvalue = (sum(diet_difference>=differenceMedian)+sum(diet_difference<=other_limit))/total
display("P Value", pvalue)

Out[15]:

/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

'P Value'

0.1285

In [0]:

In [18]:

#1E Calculate the 99% confidence interval for the effect size(s) using the appropriate method. Include the histogram of simulation results with observed result and limits of confidence interval indicated. Comment on any lines that are substantially different from the standard confidence interval method (as we did in Lab 4).
diff_median_obs = np.median(diet1) - np.median(diet2)
sim_diff = []
for i in range(10000):
    d1_random = np.random.choice(diet1,len(diet1))
    d2_random = np.random.choice(diet2,len(diet2))
    diff_i = np.median(d1_random) - np.median(d2_random)
    sim_diff.append(diff_i)

sim_diff.sort()
M_low = sim_diff[49] # 50-1 = 49th index
M_up = sim_diff[9949]# 9950-1 = 9949th index
CI_low = 2*diff_median_obs - M_up
CI_up = 2*diff_median_obs - M_low

print ("The median for difference is",diff_median_obs, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")

Out[18]:

The median for difference is 0.5999999999999996 , with 99% confidence interval [ -0.19500000000000028 , 2.064999999999999 ]

In [19]:

p = sns.displot(sim_diff, kde = False)
p.set(title="Histogram of Simulation Results",xlabel = "Median for Difference", ylabel = "Count");
plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI
plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI
plt.axvline(diff_median_obs, color = "red")

Out[19]:

<matplotlib.lines.Line2D at 0x7f5b17060c40>

In [0]:

In [20]:

p = sns.displot(sim_diff, kde = False)
p.set(title="Histogram of Simulation Results",xlabel = "Median for Difference", ylabel = "Count");
plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI
plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI
plt.axvline(diff_median_obs, color = "red")

Out[20]:

<matplotlib.lines.Line2D at 0x7f5b16b930d0>

In [21]:

#1F:Interpret your results in terms you could publish in a scientific journal, including any numbers and references to figures that should be reported. Use ⍺ = 0.01.

#Methods
#Statistical Comparisons:  We used resampling methods to test Null Hypthosis that two groups have same median values.
# By subtracting the median values from each of two groups, we created two data sets with the same median values, and same variabilities as the actual data set.  We then resampled from these two datasets to estimate how large a difference we could expect under Null Hypothesis.  This was repeated 10000 times and p-value of the simulation was calculated as number of simulations producing results same or more extreme than observed value. This was divided by 10000
#We also calculated the Confidence Interval by resampling the two original groups and calculating the difference of their medians. This was repeated 10000 times and Confidence Interval was calculated for 99% confidence.

#Results:
#Our calculated p-value from the simulation is 0.13 which is greater than the ⍺ value of 0.01 chosen for this study. Therefore we fail to reject the Null Hypothsis that the groups are statistically different. Our calculated 99% Confidence Index (-0.62 , 1.54) on the difference of medians of the two data sets also contains 0 meaning no effect size. Therefore we conclude that oatbran cereal does not helps to lower serum cholesterol.

In [0]:

#1G:Interpret your results in terms your friend with no training in statistics might understand.
#We conducted a 4 week study to see if oatbran cereal helps to lower serum cholesterol levels. We select a group of fourteen male individuals with high cholesterol were randomly placed on a diet that included either oat bran or corn flakes and then we measured their their LDL cholesterol levels after two weeks.  Each individual was then switched to the other diet; after two weeks, the LDL levels were recorded. We conducted statistical methods to determine if oatbran made a difference in their serum cholesterol levels.

#We carried out statitical simulations to check this. Our results showed that there are about 13% simulations showing any differences in these two cases is purely due to chance. This is much higher than the recognized rate of acceptence. Therefore we cannot prove that oatbran cereal helps to lower chelesterol, as compared to cornflakes.

In [22]:

#1H:Explain in terms your friend with no training in statistics would understand why you had each individual use both the oat bran and cornflake diets, rather than just have each patient do one or the other.



#We have to make sure that we use the same individuals so that we are able to control several other factors that may affect a person's reaction to the cereal. For example, different people may have different metabolisms and rates of cholesterol intake which would disrupt the results of the study and make it less fair. We also had to have each indiviual use both the oat and cornflake diets inorder for this to be a cross over design(the most perferred paired design). We had to have two sepreate treatments where the randomized indiviuals do one and then the same individuals do the alternate.

In [23]:

#2
sales= pd.read_csv("grocery-sales.csv")
sales
n= 10

In [24]:

#2A: visualize the data
# We will use a beeswarm plot because it will show the distribution of all different types of advertisments against eachother, since we have a small amount of data points with little overlapping.
p = sns.swarmplot(data=sales, orient = "vertical", color="purple")

Out[24]:

In [0]:

#We chose beeswarm as our main plot because its show the distribution of the dataset the best. We will also use a histogram to look at the shape of our distribution.

In [25]:

ad1 = list(sales["Ad1"])
ad2 = list(sales["Ad2"])
ad3 = list(sales["Ad3"])
p=sns.displot(data=ad1, kde=True, bins=4)
p=sns.displot(data=ad2, kde=True, color="green", bins=4)
p=sns.displot(data=ad3, kde=True, color="orange", bins=4)

Out[25]:

In [26]:

#2B:Describe the appropriate measure to summarize the groups, measure to compare the groups, sample size, null hypothesis, and NHST box model method. Include your reasonsfor each decision.

#We will be using the median and MAD as our meausre to summarize the groups.This method is free from any type of disortion and not limited so a specific type of distribution.The median gives us the midpoint of the data which is actual information whereas the mean does not. We will not be able to use the mean since the mean can only be used for data that is symmetric, thin tailed and unimodal. Since the data we are observing does not fit this standard bell curve distribution we can not use mean to summarize the data.
#We will not be using the mode since the mode does not tell us about the data except for it's most common value which can often be misleading and hence we will not use this method.
# we will not use standard deviation or variance since it might distort the data since it involves squaring the data, but this distortion does not happen when using the MAD.
#We will also use the F statistic to see if there are any differences across groups relative to the difference of the within group, but we will have to use HOC for multiple comparison correction.

#Each sample size for the three advertismentsis 10, our null hypothesis for this data set could be that there is no relationship between the different advertisments and the mean sales for the store. We also decided to use the box model for this dataset since shape and variation between all groups is very similier

#2C Show how to calculate the appropriate measure to compare the groups (effect size) by hand, using the observed data (either write on paper or type as text, not code). Explain what you’re doing in each step. Confirm this using Python. #We are going to compute the f-like statistic which is basically across differences/within differences. #First we choose the box model which in this case we chose big box resampling, next we repeating resampling and calculate our f statistic many times. Then lastly we calculate p-value by comparing fobs to the null distribution

#to calculate the median, we would arrange the data in order from the smallest to largest and find the value in the centre of the data. For our values, the centre of the data for the Ad1 is 58.5, Ad2 is 67.5 and Ad3 is 41.0. To find the grand median, we will compute all of the means for each data set and then divide the total by the number of groups. For the F-statistic, we are trying to find the difference across group, relative to the difference within groups. We will apply the F-statistics formula to divide the variation between both the groups.

In [27]:

medians1= sales.median()#finding each groups individual median
medians1

Out[27]:

Ad1    58.0
Ad2    67.5
Ad3    41.0
dtype: float64

In [28]:

grand_median = np.median(sales)#finding grand median for F statistic
grand_median

Out[28]:

61.0

In [29]:

#n = len(sales)#finding the numerator for the f-statistic
#variation_between_groups = (n*abs(medians["Ad1"]-grand_median))+(n*abs(medians["Ad2"]-grand_median))+(n*abs(medians["Ad3"]-grand_median))
#variation_between_groups

In [30]:

variation_between_groups = np.sum(np.abs(medians1 - grand_median)*n)
variation_between_groups

Out[30]:

295.0

In [31]:

variation_within_groups = sum(np.sum(abs(sales-medians1)))#finding the denominator for the f-statistic
variation_within_groups

Out[31]:

445.0

In [32]:

F_observed = (variation_between_groups)/(variation_within_groups)
F_observed

Out[32]:

0.6629213483146067

In [33]:

#2 D:Calculate p-value for NHST using the appropriate method. Include your histogram of simulation results with observed result indicated. Comment on any lines that are substantially different from the standard Big Box method (i.e., different in more than variable name from what we did in Lab 4). If appropriate, determine which groups are significantly different from others.
#creating my function that will compute the F-statistic
def F_like_function(data):
    n = len(data)
    variation_within_groups = sum(np.sum(abs(data-data.median())))
    variation_between_groups = np.sum(np.abs(medians1 - grand_median)*n)
    #variation_between_groups = (n*abs(data.median()["Ad1"]-np.median(data)))+(n*abs(data.median()["Ad2"]-np.median(data)))+(n*abs(data.median()["Ad3"]-np.median(data)))
    F_like = (variation_between_groups)/(variation_within_groups)
    return(F_like)

In [34]:

F_like_function(sales)

Out[34]:

0.6629213483146067

In [35]:

alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]])#put all data from the samples into a 1D-array
alldata

Out[35]:

array([63, 39, 60, 38, 63, 46, 56, 75, 42, 62, 62, 80, 62, 43, 76, 66, 98,
       69, 86, 40, 84, 81, 44, 59, 20, 73, 26, 38, 19, 31])

In [0]:

In [36]:

a_random = np.random.choice(alldata,len(sales["Ad1"]))#sampling 3 random groups and assigning varibales
b_random = np.random.choice(alldata,len(sales["Ad2"]))
c_random = np.random.choice(alldata,len(sales["Ad3"]))
display("New Ad1 Group", a_random,"New Ad2 Group", b_random,"New Ad3 Group", c_random)

Out[36]:

'New Ad1 Group'

array([84, 81, 62, 75, 69, 38, 80, 40, 63, 60])

'New Ad2 Group'

array([42, 56, 39, 62, 43, 19, 31, 62, 73, 44])

'New Ad3 Group'

array([80, 98, 69, 38, 63, 69, 76, 69, 42, 44])

In [37]:

column = np.column_stack((a_random,b_random,c_random))#data frame for the three samples
random_data_frame = pd.DataFrame(column)
random_data_frame

Out[37]:

In [38]:

F_like_function(random_data_frame)# f-statistic for the resampled data

Out[38]:

0.7662337662337663

In [39]:

F_like = []
alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]])
total =10
for i in range(total):
    a_random = np.random.choice(alldata,len(sales["Ad1"]))
    b_random = np.random.choice(alldata,len(sales["Ad2"]))
    c_random = np.random.choice(alldata,len(sales["Ad3"]))
    column = np.column_stack((a_random,b_random,c_random))
    random_data_frame = pd.DataFrame(column)
    new_F_like = F_like_function(random_data_frame)
    F_like.append(new_F_like)
F_like

Out[39]:

[0.7662337662337663,
667420814479638,
5673076923076923,
48360655737704916,
6704545454545454,
7040572792362768,
6171548117154811,
6357758620689655,
5750487329434698,
6371490280777538]

In [40]:

F_like = []
alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]])
total = 10000
for i in range(total):
    a_random = np.random.choice(alldata,10)
    b_random = np.random.choice(alldata,10)
    c_random = np.random.choice(alldata,10)
    column = np.column_stack((a_random,b_random,c_random))
    random_data_frame = pd.DataFrame(column)
    new_F_like = F_like_function(random_data_frame)
    F_like.append(new_F_like)
pvalue = (sum(F_like>=F_observed))/total
display("P-Value", pvalue)

Out[40]:

'P-Value'

0.4077

In [41]:

#2 E: Calculate the 99% confidence interval for the effect size(s) using the appropriate method. Include the histogram of simulation results with observed result and limits of confidence interval indicated. Comment on any lines that are substantially different from thestandard confidence interval method (as we did in Lab 4).

numsims = 10000
medarray_ad1=np.zeros(numsims) #create array of zeros to store results
array_ad1=np.array(ad1)
med_ad1 = np.median(array_ad1)
for i in range(numsims):
    #for loop to perform 10,000 simulations
    resample = np.random.choice(array_ad1, len(array_ad1)) #resample, with replacement
    resample_median = np.median(resample) #calculate median
    medarray_ad1[i] = resample_median #store your result
medarray_ad1.sort() #sort your list
M_low = medarray_ad1[49] # 50-1 = 49th index
M_up = medarray_ad1[9949]# 9950-1 = 9949th index
CI_low = 2*med_ad1-M_up
CI_up = 2*med_ad1-M_low

print ("The median for Ad One",med_ad1, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")

Out[41]:

The median for Ad One 58.0 , with 99% confidence interval [ 53.0 , 77.0 ]

In [42]:

p = sns.displot(medarray_ad1, kde = False)
p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Oone", ylabel = "Count");
plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI
plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI
plt.axvline(med_ad1, color = "red")

Out[42]:

<matplotlib.lines.Line2D at 0x7f5b16895f70>

In [48]:

numsims = 10000
medarray_ad2=np.zeros(numsims) 
array_ad2=np.array(ad2)
med_ad2 = np.median(array_ad2)
for i in range(numsims):
    resample = np.random.choice(array_ad2, len(array_ad2)) 
    resample_median = np.median(resample)
    medarray_ad2[i] = resample_median
medarray_ad2.sort()
M_low = medarray_ad2[49] 
M_up = medarray_ad2[9949]
CI_low = 2*med_ad1-M_up
CI_up = 2*med_ad1-M_low

print ("The median for Ad two",med_ad2, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")

Out[48]:

The median for Ad two 67.5 , with 99% confidence interval [ 30.0 , 73.0 ]

In [49]:

p = sns.displot(medarray_ad2, kde = False, color="purple")
p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Two", ylabel = "Count");
plt.axvline(CI_low, color = "green") 
plt.axvline(CI_up, color = "green") 
plt.axvline(med_ad2, color = "red")

Out[49]:

<matplotlib.lines.Line2D at 0x7f5b166eed90>

In [51]:

numsims = 10000
medarray_ad3=np.zeros(numsims) 
array_ad3=np.array(ad3)
med_ad3 = np.median(array_ad3)
for i in range(numsims):
    resample = np.random.choice(array_ad3, len(array_ad3)) 
    resample_median = np.median(resample)
    medarray_ad3[i] = resample_median
medarray_ad3.sort()
M_low = medarray_ad3[49] 
M_up = medarray_ad3[9949]
CI_low = 2*med_ad3-M_up
CI_up = 2*med_ad3-M_low

print ("The median for Ad three",med_ad2, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")

Out[51]:

The median for Ad three 67.5 , with 99% confidence interval [ 1.0 , 62.0 ]

In [52]:

p = sns.displot(medarray_ad3, kde = False, color="orange")
p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Two", ylabel = "Count");
plt.axvline(CI_low, color = "green") 
plt.axvline(CI_up, color = "green") 
plt.axvline(med_ad3, color = "red")

Out[52]:

<matplotlib.lines.Line2D at 0x7f5b1666dd90>

In [0]:

#2 F:Interpret your results in terms you could publish in a scientific journal, including any numbers and references to figures that should be reported. Use ⍺ = 0.01.

#Statistical Comparisons
# Because the data did not meet the requirements of normality, multi-group comparison were made using resampling methods.  Briefly we considered the ratio of across group differences to within group differences in the data, and then compared that ratio to similarly calculated ratio observed under the Null Hypothysis of no difference amoung the groups as realize by 10,000 randomized versions of the same data.  Differnces were considered not signficant statistically as p is 0.409, which is greater than our chosen alpha value of 0.01.

#Results
#The Omnibus test for the multi-group comparison did not proved to be significant (p=0.409) against ⍺ = 0.01 indicating that there were no significant differences amoung the groups.  We did not carry out the Post Hoc Analysis as the omnibus test was negative.


#Supplemental Methods
#1 The Median was chosen for the descriptors of the three groups because dataset was not normally distributed
#2 We carried out the Omnibus test of the three groups. For across group differences we summed the absolute vales of the differnces of the Group Medians from the Grand Median and for within-group differences, we summed the absolute values of the differences between each datapoint and its group median. We defined F as a ratio of across group to within group difference and compared F as calculated from the data to 10,000 values of F calculated from randomized versions of the same data.
#We found that the Omnibus test for the multi-group comparison was not significant.  The p-value of our simulation was 0.409 which means that almost 40 percent of our simulations results were equal or less than the observed F value.  The Omnibus test is negative meaning there is no significant differences in the three types of advertisment

In [0]:

#2 G:Interpret your results in terms your friend with no training in statistics might understand.

#The grocery chain wanted to know if three different types of advertisements affect their sales differently. We took the data for each type of advertisement from 10 different stores for one month and measured total sales for each store at the end of the month. We carried out statistical analysis simulations to check if there was a link between type of advertisment and change in sales volumes.  Our analysis showed that there are about 40% simulations showing that any difference is purely by chance.  This is much higher than the accepted rate of acceptance.  There we cannot prove that there is a difference in the sales from different type of advertisment

In [0]:

#3a
#We can accept 0.05 alpha value since we are only conducting small number of pair-wise comparisons. Since this test is only being conducted 6 times, there is a lower chance of the results being false positive due to chance, which means that a higher alpha value would be acceptable. If the number of comparisons were higher, we could consider a smaller alpha value. If the number of tests were higher (more pairwise comparisons), the chances of a false positive would increase and we could consider a smaller aplha value. 

#Since the cost of a false positive while finding the wrong probability of sunscreen preference is low, we can afford a higher alpha value.

In [0]:

#3b
#A pairwise comparison refers to determining if the differences between two groups is significant. First, we conduct the omnibus test as a preliminary screening to compare three or more groups and look at the total agregate differences amongst the groups to determine if there is a statistically significant aggregate difference. Since in this case omnibus test passed, we then compare each group with others and determine which groups are statistically different from other. The pairwise comparison in this case will mean determining the effect size of the data in pairs-S1 against S2, S2 against S3, etc. We will run simulations for each pair in this group (6) to determine the p-value of the observed effect size and determine if the difference is statistically significant. We should also use Benjimini-Hochberg Correction to reduce false positive rate.

In [53]:

#3c - Benjimin-Hochberg correction code

pvals=[0.23,0.021,0.017,0.04,0.004,0.32]
pvals.sort()
m=6
alpha=0.05
for k in range(6):
    ak=alpha*(k+1)/m
    p=pvals[k]*m/(k+1)
    if (pvals[k] < ak):
        print(pvals[k],p,ak,'True')
    else:
        print(pvals[k],p,ak,'False')

Out[53]:

004 0.024 0.008333333333333333 True
017 0.051000000000000004 0.016666666666666666 False
021 0.042 0.025000000000000005 True
04 0.06 0.03333333333333333 False
23 0.276 0.041666666666666664 False
32 0.32 0.05000000000000001 False

In [0]:

#From our code for Benjimini-Hochberg Correction, the alphaBH was 0.008 (kBh=0.024) as this is the only pair that has lower p than the corrected alpha (p* =< alpha (0.05). This means that only difference between S2 vs S4 (p=0.004) is statistically significant. Differences between other pairs is below the (kBH) and cannot be considered statistically difference.  Hence, there is no difference when comparing other sunscreens with each other.

#Hand calculations (below) also show same result

In [54]:

#3c Python function to confirm
import statsmodels.stats.multitest as smm
pvals=[0.23,0.021,0.017,0.04,0.004,0.32]
smm.multipletests(pvals,alpha=0.05,method='fdr_bh')

Out[54]:

(array([False,  True,  True, False,  True, False]),
 array([0.276, 0.042, 0.042, 0.06 , 0.024, 0.32 ]),
 0.008512444610847103,
 0.008333333333333333)

In [0]:

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.