Contact Us!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download
Views: 353
Image: ubuntu2004
Kernel: Python 3 (system-wide)
#Group Member: Alyssa, Meraal
#1 %matplotlib inline import matplotlib as mat import seaborn as sns import numpy as np import matplotlib.pyplot as plt import pandas as pd
OatBran=[3.84,5.57, 5.85, 4.80, 3.68, 2.96, 4.41, 3.72, 3.49, 3.84, 5.26, 3.73, 1.84, 4.14]
dataframe1 = pd.read_csv("oatbran.txt") # storing this dataframe in a csv file dataframe1.to_csv('oatbran.txt', index = None) dataframe1
#1A:visualize the data by group using an appropriate plot. Explain your choice. # We will use a beeswarm plot because it will show the distribution of all different types of advertisments against eachother, since we have a small amount of data points with little overlapping. Then we will use A histogram to see the shape of the distribution. p=sns.displot(data=dataframe1, kde=True, color="green")
Image in a Jupyter notebook
p=sns.swarmplot(data=dataframe1) p.set(xlabel="Oatbran and Corn Flakes")+p.set(ylabel="treatment")
[Text(0.5, 0, 'Oatbran and Corn Flakes'), Text(0, 0.5, 'treatment')]
Image in a Jupyter notebook
diet1 = list(dataframe1["CornFlakes"]) diet2 = list(dataframe1["Oatbran"]) display("CornFlakes List", diet1, "OatBranList", diet2);
'CornFlakes List'
[4.61, 6.42, 5.4, 4.54, 3.98, 3.82, 5.01, 4.34, 3.8, 4.56, 5.35, 3.89, 2.25, 4.24]
'OatBranList'
[3.84, 5.57, 5.85, 4.8, 3.68, 2.96, 4.41, 3.72, 3.49, 3.84, 5.26, 3.73, 1.84, 4.14]
#1 b:Describe the appropriate measure to summarize the groups, measure to compare the groups, sample size, null hypothesis, and NHST box model method. Include your reasonsfor each decision. #We will use Median for group summary as the data is not uniform, it is skewed and one distribution is not uni modal.he median gives us the midpoint of the data which is actual information whereas the mean does not. We will not be able to use the mean since the mean can only be used for data that is symmetric, thin tailed and unimodal. Since the data we are observing does not fit this standard bell curve distribution we can not use mean to summarize the data. #We will not be using the mode since the mode does not tell us about the data except for it's most common value which can often be misleading and hence we will not use this method. #Null Hypothsis # The null hypothesis is that the two groups are not different and their medians are same. # We will use decentered two box model as the data is not uniform nor are they the same shape and variation
#1c Show how to calculate the appropriate measure to compare the groups (effect size) by hand, using the observed data (either write on paper or type as text, not code). Explain what you’re doing in each step. Confirm this using Python. #To calculate median, we need to first sort the array in ascending order. Then we find number of items in the list. Divide the number by 2 to get the middle element of the list. This is point where 50% of data points are less value than the value at this point and 50% data points have higher value that this point. Since the total number of items in this list is even (14), we median falls between 7th and 8th element. We take average ot 7th and 8th value and this is our median #The effect size is the difference between the medians of two datasets dsort = diet1 dsort.sort() # sort values in ascending order mid_i = int(len(dsort)/2-1) # find the middle element of the list. median1 = (dsort[mid_i] + dsort[mid_i+1])/2 # take average of two middle values since the list has even items print('Median for Cornflakes = ', median1) dsort = diet2 dsort.sort() mid_i = int(len(dsort)/2-1) median2 = (dsort[mid_i] + dsort[mid_i+1])/2 print('Median for Oatbran = ', median2) print('Effect size = difference in median of two diets =', median1 - median2) print('Python calculated median for diet 1', np.median(diet1)) print('Python calculated median for diet 2', np.median(diet2)) print('Difference in median of two diets =', np.median(diet1) - np.median(diet2))
Median for Cornflakes = 4.4399999999999995 Median for Oatbran = 3.84 Effect size = difference in median of two diets = 0.5999999999999996 Python calculated median for diet 1 4.4399999999999995 Python calculated median for diet 2 3.84 Difference in median of two diets = 0.5999999999999996
#data size print('Size of Cornflakes subjects:', len(diet1)) print('Size of Oatbran subjects:', len(diet2))
Size of Cornflakes subjects: 14 Size of Oatbran subjects: 14
Median_1 = np.median(diet1) Median_2 = np.median(diet2) display("Median of Corn Flakes List", Median_1, "Median of Oatbran List", Median_2)
'Median of Corn Flakes List'
4.4399999999999995
'Median of Oatbran List'
3.84
MAD_set1 = [] MAD_set2 = [] absolute1 = [] absolute2 = [] MAD_1 = [] MAD_2 = [] differenceMedian = Median_1 - Median_2 for i in diet1: MAD_set1.append(i-Median_1) for i in diet2: MAD_set2.append(i-Median_2) absolute1.append(np.abs(MAD_set1)) absolute2.append(np.abs(MAD_set2)) MAD_1.append(np.median(absolute1)) MAD_2.append(np.median(absolute2)) display("MAD for diet 1 List",MAD_1,"MAD for diet 2 List", MAD_2, "Observed Difference", differenceMedian)
'MAD for diet 1 List'
[0.5599999999999998]
'MAD for diet 2 List'
[0.45999999999999996]
'Observed Difference'
0.5999999999999996
#1D Calculate p-value for NHST using the appropriate method. Include your histogram of simulation results with observed result indicated. Comment on any lines that are substantially different from the standard Big Box method (i.e., different in more than variable name from what we did in Lab 4). If appropriate, determine which groups are significantly different from others.
diet1_centered = np.array(diet1) - Median_1 diet2_centered = np.array(diet2) - Median_2 total = 10000 diet_difference = np.zeros(total) differenceMedian = Median_1 - Median_2 other_limit = -differenceMedian for i in range(total): Random_diet1 = np.random.choice(diet1_centered, len(diet1_centered)) Random_diet2 = np.random.choice(diet2_centered, len(diet2_centered)) diet_difference[i] = np.median(Random_diet1) - np.median(Random_diet2) p = sns.distplot(diet_difference, kde=False, axlabel="Difference in diet 1 and diet 2 Medians") p.set(ylabel="Count") p.axvline(differenceMedian, color="red"); p.axvline(other_limit, color="red"); p.axvline(np.median(diet_difference), color="blue"); p.set_title("Difference in Median LDL cholestrol levels When Using Two different diets"); pvalue = (sum(diet_difference>=differenceMedian)+sum(diet_difference<=other_limit))/total display("P Value", pvalue)
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
'P Value'
0.1285
Image in a Jupyter notebook
#1E Calculate the 99% confidence interval for the effect size(s) using the appropriate method. Include the histogram of simulation results with observed result and limits of confidence interval indicated. Comment on any lines that are substantially different from the standard confidence interval method (as we did in Lab 4). diff_median_obs = np.median(diet1) - np.median(diet2) sim_diff = [] for i in range(10000): d1_random = np.random.choice(diet1,len(diet1)) d2_random = np.random.choice(diet2,len(diet2)) diff_i = np.median(d1_random) - np.median(d2_random) sim_diff.append(diff_i) sim_diff.sort() M_low = sim_diff[49] # 50-1 = 49th index M_up = sim_diff[9949]# 9950-1 = 9949th index CI_low = 2*diff_median_obs - M_up CI_up = 2*diff_median_obs - M_low print ("The median for difference is",diff_median_obs, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")
The median for difference is 0.5999999999999996 , with 99% confidence interval [ -0.19500000000000028 , 2.064999999999999 ]
p = sns.displot(sim_diff, kde = False) p.set(title="Histogram of Simulation Results",xlabel = "Median for Difference", ylabel = "Count"); plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI plt.axvline(diff_median_obs, color = "red")
<matplotlib.lines.Line2D at 0x7f5b17060c40>
Image in a Jupyter notebook
p = sns.displot(sim_diff, kde = False) p.set(title="Histogram of Simulation Results",xlabel = "Median for Difference", ylabel = "Count"); plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI plt.axvline(diff_median_obs, color = "red")
<matplotlib.lines.Line2D at 0x7f5b16b930d0>
Image in a Jupyter notebook
#1F:Interpret your results in terms you could publish in a scientific journal, including any numbers and references to figures that should be reported. Use ⍺ = 0.01. #Methods #Statistical Comparisons: We used resampling methods to test Null Hypthosis that two groups have same median values. # By subtracting the median values from each of two groups, we created two data sets with the same median values, and same variabilities as the actual data set. We then resampled from these two datasets to estimate how large a difference we could expect under Null Hypothesis. This was repeated 10000 times and p-value of the simulation was calculated as number of simulations producing results same or more extreme than observed value. This was divided by 10000 #We also calculated the Confidence Interval by resampling the two original groups and calculating the difference of their medians. This was repeated 10000 times and Confidence Interval was calculated for 99% confidence. #Results: #Our calculated p-value from the simulation is 0.13 which is greater than the ⍺ value of 0.01 chosen for this study. Therefore we fail to reject the Null Hypothsis that the groups are statistically different. Our calculated 99% Confidence Index (-0.62 , 1.54) on the difference of medians of the two data sets also contains 0 meaning no effect size. Therefore we conclude that oatbran cereal does not helps to lower serum cholesterol.
#1G:Interpret your results in terms your friend with no training in statistics might understand. #We conducted a 4 week study to see if oatbran cereal helps to lower serum cholesterol levels. We select a group of fourteen male individuals with high cholesterol were randomly placed on a diet that included either oat bran or corn flakes and then we measured their their LDL cholesterol levels after two weeks. Each individual was then switched to the other diet; after two weeks, the LDL levels were recorded. We conducted statistical methods to determine if oatbran made a difference in their serum cholesterol levels. #We carried out statitical simulations to check this. Our results showed that there are about 13% simulations showing any differences in these two cases is purely due to chance. This is much higher than the recognized rate of acceptence. Therefore we cannot prove that oatbran cereal helps to lower chelesterol, as compared to cornflakes.
#1H:Explain in terms your friend with no training in statistics would understand why you had each individual use both the oat bran and cornflake diets, rather than just have each patient do one or the other. #We have to make sure that we use the same individuals so that we are able to control several other factors that may affect a person's reaction to the cereal. For example, different people may have different metabolisms and rates of cholesterol intake which would disrupt the results of the study and make it less fair. We also had to have each indiviual use both the oat and cornflake diets inorder for this to be a cross over design(the most perferred paired design). We had to have two sepreate treatments where the randomized indiviuals do one and then the same individuals do the alternate.
#2 sales= pd.read_csv("grocery-sales.csv") sales n= 10
#2A: visualize the data # We will use a beeswarm plot because it will show the distribution of all different types of advertisments against eachother, since we have a small amount of data points with little overlapping. p = sns.swarmplot(data=sales, orient = "vertical", color="purple")
Image in a Jupyter notebook
#We chose beeswarm as our main plot because its show the distribution of the dataset the best. We will also use a histogram to look at the shape of our distribution.
ad1 = list(sales["Ad1"]) ad2 = list(sales["Ad2"]) ad3 = list(sales["Ad3"]) p=sns.displot(data=ad1, kde=True, bins=4) p=sns.displot(data=ad2, kde=True, color="green", bins=4) p=sns.displot(data=ad3, kde=True, color="orange", bins=4)
Image in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebook
#2B:Describe the appropriate measure to summarize the groups, measure to compare the groups, sample size, null hypothesis, and NHST box model method. Include your reasonsfor each decision. #We will be using the median and MAD as our meausre to summarize the groups.This method is free from any type of disortion and not limited so a specific type of distribution.The median gives us the midpoint of the data which is actual information whereas the mean does not. We will not be able to use the mean since the mean can only be used for data that is symmetric, thin tailed and unimodal. Since the data we are observing does not fit this standard bell curve distribution we can not use mean to summarize the data. #We will not be using the mode since the mode does not tell us about the data except for it's most common value which can often be misleading and hence we will not use this method. # we will not use standard deviation or variance since it might distort the data since it involves squaring the data, but this distortion does not happen when using the MAD. #We will also use the F statistic to see if there are any differences across groups relative to the difference of the within group, but we will have to use HOC for multiple comparison correction. #Each sample size for the three advertismentsis 10, our null hypothesis for this data set could be that there is no relationship between the different advertisments and the mean sales for the store. We also decided to use the box model for this dataset since shape and variation between all groups is very similier

#2C Show how to calculate the appropriate measure to compare the groups (effect size) by hand, using the observed data (either write on paper or type as text, not code). Explain what you’re doing in each step. Confirm this using Python. #We are going to compute the f-like statistic which is basically across differences/within differences. #First we choose the box model which in this case we chose big box resampling, next we repeating resampling and calculate our f statistic many times. Then lastly we calculate p-value by comparing fobs to the null distribution

#to calculate the median, we would arrange the data in order from the smallest to largest and find the value in the centre of the data. For our values, the centre of the data for the Ad1 is 58.5, Ad2 is 67.5 and Ad3 is 41.0. To find the grand median, we will compute all of the means for each data set and then divide the total by the number of groups. For the F-statistic, we are trying to find the difference across group, relative to the difference within groups. We will apply the F-statistics formula to divide the variation between both the groups.

medians1= sales.median()#finding each groups individual median medians1
Ad1 58.0 Ad2 67.5 Ad3 41.0 dtype: float64
grand_median = np.median(sales)#finding grand median for F statistic grand_median
61.0
#n = len(sales)#finding the numerator for the f-statistic #variation_between_groups = (n*abs(medians["Ad1"]-grand_median))+(n*abs(medians["Ad2"]-grand_median))+(n*abs(medians["Ad3"]-grand_median)) #variation_between_groups
variation_between_groups = np.sum(np.abs(medians1 - grand_median)*n) variation_between_groups
295.0
variation_within_groups = sum(np.sum(abs(sales-medians1)))#finding the denominator for the f-statistic variation_within_groups
445.0
F_observed = (variation_between_groups)/(variation_within_groups) F_observed
0.6629213483146067
#2 D:Calculate p-value for NHST using the appropriate method. Include your histogram of simulation results with observed result indicated. Comment on any lines that are substantially different from the standard Big Box method (i.e., different in more than variable name from what we did in Lab 4). If appropriate, determine which groups are significantly different from others. #creating my function that will compute the F-statistic def F_like_function(data): n = len(data) variation_within_groups = sum(np.sum(abs(data-data.median()))) variation_between_groups = np.sum(np.abs(medians1 - grand_median)*n) #variation_between_groups = (n*abs(data.median()["Ad1"]-np.median(data)))+(n*abs(data.median()["Ad2"]-np.median(data)))+(n*abs(data.median()["Ad3"]-np.median(data))) F_like = (variation_between_groups)/(variation_within_groups) return(F_like)
F_like_function(sales)
0.6629213483146067
alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]])#put all data from the samples into a 1D-array alldata
array([63, 39, 60, 38, 63, 46, 56, 75, 42, 62, 62, 80, 62, 43, 76, 66, 98, 69, 86, 40, 84, 81, 44, 59, 20, 73, 26, 38, 19, 31])
a_random = np.random.choice(alldata,len(sales["Ad1"]))#sampling 3 random groups and assigning varibales b_random = np.random.choice(alldata,len(sales["Ad2"])) c_random = np.random.choice(alldata,len(sales["Ad3"])) display("New Ad1 Group", a_random,"New Ad2 Group", b_random,"New Ad3 Group", c_random)
'New Ad1 Group'
array([84, 81, 62, 75, 69, 38, 80, 40, 63, 60])
'New Ad2 Group'
array([42, 56, 39, 62, 43, 19, 31, 62, 73, 44])
'New Ad3 Group'
array([80, 98, 69, 38, 63, 69, 76, 69, 42, 44])
column = np.column_stack((a_random,b_random,c_random))#data frame for the three samples random_data_frame = pd.DataFrame(column) random_data_frame
F_like_function(random_data_frame)# f-statistic for the resampled data
0.7662337662337663
F_like = [] alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]]) total =10 for i in range(total): a_random = np.random.choice(alldata,len(sales["Ad1"])) b_random = np.random.choice(alldata,len(sales["Ad2"])) c_random = np.random.choice(alldata,len(sales["Ad3"])) column = np.column_stack((a_random,b_random,c_random)) random_data_frame = pd.DataFrame(column) new_F_like = F_like_function(random_data_frame) F_like.append(new_F_like) F_like
[0.7662337662337663, 0.667420814479638, 0.5673076923076923, 0.48360655737704916, 0.6704545454545454, 0.7040572792362768, 0.6171548117154811, 0.6357758620689655, 0.5750487329434698, 0.6371490280777538]
F_like = [] alldata = np.concatenate([sales["Ad1"], sales["Ad2"], sales["Ad3"]]) total = 10000 for i in range(total): a_random = np.random.choice(alldata,10) b_random = np.random.choice(alldata,10) c_random = np.random.choice(alldata,10) column = np.column_stack((a_random,b_random,c_random)) random_data_frame = pd.DataFrame(column) new_F_like = F_like_function(random_data_frame) F_like.append(new_F_like) pvalue = (sum(F_like>=F_observed))/total display("P-Value", pvalue)
'P-Value'
0.4077
#2 E: Calculate the 99% confidence interval for the effect size(s) using the appropriate method. Include the histogram of simulation results with observed result and limits of confidence interval indicated. Comment on any lines that are substantially different from thestandard confidence interval method (as we did in Lab 4). numsims = 10000 medarray_ad1=np.zeros(numsims) #create array of zeros to store results array_ad1=np.array(ad1) med_ad1 = np.median(array_ad1) for i in range(numsims): #for loop to perform 10,000 simulations resample = np.random.choice(array_ad1, len(array_ad1)) #resample, with replacement resample_median = np.median(resample) #calculate median medarray_ad1[i] = resample_median #store your result medarray_ad1.sort() #sort your list M_low = medarray_ad1[49] # 50-1 = 49th index M_up = medarray_ad1[9949]# 9950-1 = 9949th index CI_low = 2*med_ad1-M_up CI_up = 2*med_ad1-M_low print ("The median for Ad One",med_ad1, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")
The median for Ad One 58.0 , with 99% confidence interval [ 53.0 , 77.0 ]
p = sns.displot(medarray_ad1, kde = False) p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Oone", ylabel = "Count"); plt.axvline(CI_low, color = "green") #Draw the line of lower limit of 99% CI plt.axvline(CI_up, color = "green") #Draw the line of upper limit of 99% CI plt.axvline(med_ad1, color = "red")
<matplotlib.lines.Line2D at 0x7f5b16895f70>
Image in a Jupyter notebook
numsims = 10000 medarray_ad2=np.zeros(numsims) array_ad2=np.array(ad2) med_ad2 = np.median(array_ad2) for i in range(numsims): resample = np.random.choice(array_ad2, len(array_ad2)) resample_median = np.median(resample) medarray_ad2[i] = resample_median medarray_ad2.sort() M_low = medarray_ad2[49] M_up = medarray_ad2[9949] CI_low = 2*med_ad1-M_up CI_up = 2*med_ad1-M_low print ("The median for Ad two",med_ad2, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")
The median for Ad two 67.5 , with 99% confidence interval [ 30.0 , 73.0 ]
p = sns.displot(medarray_ad2, kde = False, color="purple") p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Two", ylabel = "Count"); plt.axvline(CI_low, color = "green") plt.axvline(CI_up, color = "green") plt.axvline(med_ad2, color = "red")
<matplotlib.lines.Line2D at 0x7f5b166eed90>
Image in a Jupyter notebook
numsims = 10000 medarray_ad3=np.zeros(numsims) array_ad3=np.array(ad3) med_ad3 = np.median(array_ad3) for i in range(numsims): resample = np.random.choice(array_ad3, len(array_ad3)) resample_median = np.median(resample) medarray_ad3[i] = resample_median medarray_ad3.sort() M_low = medarray_ad3[49] M_up = medarray_ad3[9949] CI_low = 2*med_ad3-M_up CI_up = 2*med_ad3-M_low print ("The median for Ad three",med_ad2, ", with 99% confidence interval [", CI_low, ",", CI_up, "]")
The median for Ad three 67.5 , with 99% confidence interval [ 1.0 , 62.0 ]
p = sns.displot(medarray_ad3, kde = False, color="orange") p.set(title="Histogram of Simulation Results",xlabel = "Median for Ad Two", ylabel = "Count"); plt.axvline(CI_low, color = "green") plt.axvline(CI_up, color = "green") plt.axvline(med_ad3, color = "red")
<matplotlib.lines.Line2D at 0x7f5b1666dd90>
Image in a Jupyter notebook
#2 F:Interpret your results in terms you could publish in a scientific journal, including any numbers and references to figures that should be reported. Use ⍺ = 0.01. #Statistical Comparisons # Because the data did not meet the requirements of normality, multi-group comparison were made using resampling methods. Briefly we considered the ratio of across group differences to within group differences in the data, and then compared that ratio to similarly calculated ratio observed under the Null Hypothysis of no difference amoung the groups as realize by 10,000 randomized versions of the same data. Differnces were considered not signficant statistically as p is 0.409, which is greater than our chosen alpha value of 0.01. #Results #The Omnibus test for the multi-group comparison did not proved to be significant (p=0.409) against ⍺ = 0.01 indicating that there were no significant differences amoung the groups. We did not carry out the Post Hoc Analysis as the omnibus test was negative. #Supplemental Methods #1 The Median was chosen for the descriptors of the three groups because dataset was not normally distributed #2 We carried out the Omnibus test of the three groups. For across group differences we summed the absolute vales of the differnces of the Group Medians from the Grand Median and for within-group differences, we summed the absolute values of the differences between each datapoint and its group median. We defined F as a ratio of across group to within group difference and compared F as calculated from the data to 10,000 values of F calculated from randomized versions of the same data. #We found that the Omnibus test for the multi-group comparison was not significant. The p-value of our simulation was 0.409 which means that almost 40 percent of our simulations results were equal or less than the observed F value. The Omnibus test is negative meaning there is no significant differences in the three types of advertisment
#2 G:Interpret your results in terms your friend with no training in statistics might understand. #The grocery chain wanted to know if three different types of advertisements affect their sales differently. We took the data for each type of advertisement from 10 different stores for one month and measured total sales for each store at the end of the month. We carried out statistical analysis simulations to check if there was a link between type of advertisment and change in sales volumes. Our analysis showed that there are about 40% simulations showing that any difference is purely by chance. This is much higher than the accepted rate of acceptance. There we cannot prove that there is a difference in the sales from different type of advertisment
#3a #We can accept 0.05 alpha value since we are only conducting small number of pair-wise comparisons. Since this test is only being conducted 6 times, there is a lower chance of the results being false positive due to chance, which means that a higher alpha value would be acceptable. If the number of comparisons were higher, we could consider a smaller alpha value. If the number of tests were higher (more pairwise comparisons), the chances of a false positive would increase and we could consider a smaller aplha value. #Since the cost of a false positive while finding the wrong probability of sunscreen preference is low, we can afford a higher alpha value.
#3b #A pairwise comparison refers to determining if the differences between two groups is significant. First, we conduct the omnibus test as a preliminary screening to compare three or more groups and look at the total agregate differences amongst the groups to determine if there is a statistically significant aggregate difference. Since in this case omnibus test passed, we then compare each group with others and determine which groups are statistically different from other. The pairwise comparison in this case will mean determining the effect size of the data in pairs-S1 against S2, S2 against S3, etc. We will run simulations for each pair in this group (6) to determine the p-value of the observed effect size and determine if the difference is statistically significant. We should also use Benjimini-Hochberg Correction to reduce false positive rate.
#3c - Benjimin-Hochberg correction code pvals=[0.23,0.021,0.017,0.04,0.004,0.32] pvals.sort() m=6 alpha=0.05 for k in range(6): ak=alpha*(k+1)/m p=pvals[k]*m/(k+1) if (pvals[k] < ak): print(pvals[k],p,ak,'True') else: print(pvals[k],p,ak,'False')
0.004 0.024 0.008333333333333333 True 0.017 0.051000000000000004 0.016666666666666666 False 0.021 0.042 0.025000000000000005 True 0.04 0.06 0.03333333333333333 False 0.23 0.276 0.041666666666666664 False 0.32 0.32 0.05000000000000001 False
#From our code for Benjimini-Hochberg Correction, the alphaBH was 0.008 (kBh=0.024) as this is the only pair that has lower p than the corrected alpha (p* =< alpha (0.05). This means that only difference between S2 vs S4 (p=0.004) is statistically significant. Differences between other pairs is below the (kBH) and cannot be considered statistically difference. Hence, there is no difference when comparing other sunscreens with each other. #Hand calculations (below) also show same result
#3c Python function to confirm import statsmodels.stats.multitest as smm pvals=[0.23,0.021,0.017,0.04,0.004,0.32] smm.multipletests(pvals,alpha=0.05,method='fdr_bh')
(array([False, True, True, False, True, False]), array([0.276, 0.042, 0.042, 0.06 , 0.024, 0.32 ]), 0.008512444610847103, 0.008333333333333333)