CoCalc -- RealGroupProject.ipynb

²¹⁷ views
ubuntu2004

Kernel: Python 3 (system-wide)

In [0]:

#Group Members: Yufei Li-2E, Yufei Wang-2D, Chris Carmichael-2C
#all other contributions were group work.

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mtpt

In [9]:

#Question 1

Corn=[4.61, 6.42, 5.4, 4.54, 3.98, 3.82, 5.01, 4.34, 3.8, 4.56, 5.35, 3.89, 2.25, 4.24]
Oat=[3.84, 5.57, 5.85, 4.8, 3.68, 2.96, 4.41, 3.72, 3.49, 3.84, 5.26, 3.73, 1.84, 4.14]
#We had each individual switch their diet to create a cross-over test, this allows us to see the effect of each diet on the individual while at the same time allowing us a smaller sample group (14 individuals in this case). Cross-over tests also allow for more precise p-values as well as reduce within-group differences in our study design. Additionally, such cross-over tests allow us to decrease the effect of individual factors on the experiment, as in theory we should only see the change due to our treatments (assuming the individuals are living a similar life during both periods of treatment [dieting]); This essentially allows us to effectively control outside factors, as well as having each individual linked to their own data, allowing us to have a better understanding of the effects of our treatment.
def slopePlot(data, labels=["", ""], line_color="gray", point_color="black"):
    import matplotlib.pyplot as plt
    from numpy import array

    dataArr = array(data)
    fig, ax = plt.subplots(figsize=(4, 3))

    x1=0.8
    x2=1.2
    n = dataArr.shape[0]
    for i in range(n):
        ax.plot([x1, x2], [dataArr[i,0], dataArr[i,1]], color=line_color)

        # Plot the points
        ax.scatter(n*[x1-0.01], dataArr[:,0], color=point_color, s=25, label=labels[0])
        ax.scatter(n*[x2+0.01], dataArr[:,1], color=point_color, s=25, label=labels[1])

    # Fix the axes and labels
    ax.set_xticks([x1, x2])
    _ = ax.set_xticklabels(labels, fontsize='x-large')

    return ax

In [10]:


data = {'Corn':  [4.61, 6.42, 5.4, 4.54, 3.98, 3.82, 5.01, 4.34, 3.8, 4.56, 5.35, 3.89, 2.25, 4.24],
        'Oat': [3.84, 5.57, 5.85, 4.8, 3.68, 2.96, 4.41, 3.72, 3.49, 3.84, 5.26, 3.73, 1.84, 4.14],}

df = pd.DataFrame(data)

print (df)

Out[10]:

In [14]:

slopePlot(df, labels=["Corn", "Oat"], line_color="red", point_color="black")

Out[14]:

<AxesSubplot:>

In [15]:

#Question 1A, I used a slope plot to visualize the data as we have a paired design where each participant was given Corn-Oat or vice versa. In such a graph I can see the distribution of the two group as well as the effects of the diets on each individual.

In [21]:

#Question 1B

#we think that median and MAD would be most appropriate to summarise the groups, due to the fact that the groups seem to not fall under a perfect bell curve.
#The measre we can use is the F statistic of the data, through an Omnibus test as there are 2 seperate groups.
#the null hypothesis is that there is no difference in diets on the patients cholestorol levels.
#our sample size is 14, as there are 14 individuals, however in total we have 28 data points. As each individual parook in the trial for both Oat and Corn.

#reasons for choosing NHST box model method: We choose the de-centered multiple-box model because we have 10 data points in each group greater than 7, so we do not use ranks. Also, the distributions of these data are asymmetric, the shape and variability (MAD) are not the same, so we cannot use the big-box model. Instead, we use the de-centered multiple-box model here.


DCorn=df["Corn"]
DOat=df["Oat"]

MSC=np.median(DCorn)
MSO=np.median(DOat)
ASC=np.mean(DCorn)
ASO=np.mean(DOat)

print("Corn median=", MSC)
print("Oat median=", MSO)

print("Corn mean=", ASC)
print("Oat mean=", ASO)

def MAD(thing):
    result=[]
    med=np.median(thing)
    for i in thing:
        result.append(np.abs(i-med))
    return(np.median(result))

print("The MAD for Ad1 is= ",MAD(DCorn))
print("The MAD for Ad2 is= ",MAD(DOat))

Out[21]:

Corn median= 4.4399999999999995
Oat median= 3.84
Corn mean= 4.443571428571429
Oat mean= 4.0807142857142855
The MAD for Ad1 is=  0.5599999999999998
The MAD for Ad2 is=  0.45999999999999996

In [23]:

print(len(Oat))

Out[23]:

14

In [0]:

#Question 1C

In [0]:

#Question 1D

In [0]:

#Question 1E

In [0]:

#Question 1F

In [0]:

#Question 1G

In [0]:

#Question 1H

In [5]:

Sample=pd.read_csv("grocery-sales.csv")
pd.read_csv("grocery-sales.csv")

Out[5]:

In [16]:

#Question 2A
bob=sns.swarmplot(data=Sample)
bob.set(xlabel="Group", ylabel="Sales $1000/month", title="Beeswarm Plot of Data of Sales")

Out[16]:

[Text(0.5, 0, 'Group'),
 Text(0, 0.5, 'Sales $1000/month'),
 Text(0.5, 1.0, 'Beeswarm Plot of Data of Sales')]

In [0]:

#We like the beeswarm plot b/c our sample size is less than 20 in each group, this leaves us with a dot plot or beeswarm plot as they show each data point separately. This allows us to see the distribution and approximate the skew of each graph.

In [37]:

#Question 2B
#we think that median and MAD would be most appropriate to summarise the groups, due to the fact that AD1 is left-skewed, AD2 and AD3 are right-skewed, as such mean is not appropriate in this situation, so we revert to median as it better represents the central tendencies of the stores.
#The measre we can use is the F statistic of the data, through an Omnibus test as there are 3 seperate groups.
#the null hypothesis is that there is no difference in the monthly sales of the stores between the three differnt types of advertisement.
#our sample size is 10, as there are 10 stores in each group, however in total we have 30 sotores. 10 will be re-sampled at a time.
#reasons for choosing NHST box model method: We choose the de-centered multiple-box model because we have 10 data points in each group greater than 7, so we do not use ranks. Also, the distributions of these data are asymmetric, the shape and variability (MAD) are not the same, so we cannot use the big-box model. Instead, we use the de-centered multiple-box model here.

In [39]:

#Question 2C
#To get the median for Ad1 we have a sorted order of 38,39,42,46,56,60,62,63,63,75 with 56 and 60 being in the middle. (56+60)/2 = 58
#To get the median for Ad2 we have a sorted order of 40,43,62,62,66,69,76,80,86,98, with 66 and 69 being in the middle (66+69)/2 = 67.5
#To get the median for Ad2 we have a sorted order of 19,20,26,31,38,44,59,73,81,84, with the middle valus being 38 and 44 (38+44)/2 =41

SA1=Sample["Ad1"]
SA2=Sample["Ad2"]
SA3=Sample["Ad3"]
MSA1=np.median(SA1)
MSA2=np.median(SA2)
MSA3=np.median(SA3)

print("Ad1 median=", MSA1)
print("Ad2 median=", MSA2)
print("Ad3 median=", MSA3)

Out[39]:

Ad1 median= 58.0
Ad2 median= 67.5
Ad3 median= 41.0

In [75]:

#to find the MAD we have to take the calculated median above and subtract it from each point, and find the absolute value of the difference... sort thease values and then find the median of them

# for MAD of Ad1 we have 38,39,42,46,56,60,62,63,63,75 abs(-58) to get 20, 19, 16, 12, 2, 2, 4, 5, 5, 17. Sorted we get 2, 2, 4, 5, 5, 12, 16, 17, 19, 20. between 5 and 12/2 we get 8.5 

# for MAD of Ad2 we have 40,43,62,62,66,69,76,80,86,98, abs(-67.5) to get 27.5, 24.5, 5.5, 5.5, 1.5, 1.5, 6.5, 12.5, 18.5, 30.5 sorted- 1.5, 1.5, 5.5, 5.5, 6.5, 13.5, 18.5, 24.5, 27.5, 30.5. Between 6.5 and 13.5/2 =10.5

#for MAD of Ad3 we have 19,20,26,31,38,44,59,73,81,84 abs(-41) to get 22, 21, 15, 10, 3, 3, 18, 32, 40, 43 sorted we get- 3, 3, 10, 15, 18, 21, 22, 32, 40, 43 18+21/2=19.5

def MAD(thing):
    result=[]
    med=np.median(thing)
    for i in thing:
        result.append(np.abs(i-med))
    return(np.median(result))

print("The MAD for Ad1 is= ",MAD(SA1))
print("The MAD for Ad2 is= ",MAD(SA2))
print("The MAD for Ad3 is= ",MAD(SA3))

Out[75]:

The MAD for Ad1 is=  8.5
The MAD for Ad2 is=  10.5
The MAD for Ad3 is=  19.5

In [76]:

print("The sample size for Ad1 is= ", len(SA1))
print("The sample size for Ad2 is= ",len(SA2))
print("The sample size for Ad3 is= ",len(SA2))

Out[76]:

The sample size for Ad1 is=  10
The sample size for Ad2 is=  10
The sample size for Ad3 is=  10

In [61]:

#Question 2d
def F_statistics(x): # a function to calculate F-statistic for comparing 3 or more groups
    G_median=np.median(x) # median for data in al groups
    amonggroups=sum(abs(x.median()-G_median))*len(x) # among-group differences
    withingroups=sum(np.sum(abs(x-x.median()))) # within-group differences
    F=amonggroups/withingroups
    return F
Fobs=F_statistics(Sample)
decen_Ad1 = np.array(Sample["Ad1"]) - np.median(Sample["Ad1"]) # de-center data by substracting the median
decen_Ad2 = np.array(Sample["Ad2"]) - np.median(Sample["Ad2"])
decen_Ad3 = np.array(Sample["Ad3"]) - np.median(Sample["Ad3"])
sim=[]
for i in range(10000):
    re_A=np.random.choice(decen_Ad1,10) # resample the data for Ad1 from the de-cemtered Ad1
    re_B=np.random.choice(decen_Ad2,10) # resample the data for Ad2 from the de-cemtered Ad2
    re_C=np.random.choice(decen_Ad3,10) # resample the data for Ad3 from the de-cemtered Ad3
    re_dataframe=pd.DataFrame(np.column_stack((re_A,re_B,re_C))) # store the resampled data in a dataframe
    sim.append(F_statistics(re_dataframe))
plot=sns.displot(data=sim)
plot.set(xlabel="F statistic", ylabel="Count", title="Simulations for NHST")
plt.axvline(Fobs, color="red")
extreme=np.sum(np.array(sim) >= Fobs) # one-tailed p-value for F-statistic
p_value=extreme/10000
p_value

Out[61]:

0.1814

In [70]:

#Question 2e
import matplotlib.pyplot as mtpt

def CI(x,y): #take in two pandaz
    simlist=[]
    differenceMedian=x.median()-y.median() #Find the difference of the medians
    for i in range(10000): #10,000 times
        re_x=np.random.choice(x,len(x)) #take random choice of x
        re_y=np.random.choice(y,len(y)) #take random choice of y
        P_mediandiff=np.median(re_x)-np.median(re_y) #find the median difference of the groups that we just created
        simlist.append(P_mediandiff) #append this new median difference
    simlist #now e have our values
    simlist.sort() #sort them numerically
    Lower=simlist[49] #our lower is the 49th element
    Upper=simlist[9949] #our upper limit is 9950th limit
    CI_L=2*differenceMedian-Upper #find our Upper limit
    CI_R=2*differenceMedian-Lower #find out Lower limit
    results=sns.histplot(data=simlist) #make a hisgogram
    results.set(xlabel="Difference", ylabel="# of instances")   
    mtpt.axvline(differenceMedian, color="red") #have an axis for mediandiffernce
    mtpt.axvline(Lower, color="blue") #have a line for our lower limit
    mtpt.axvline(Upper,color="blue") #have a line for our upper limmit
    print("(",CI_L,CI_R,")") #print out limits
print("the 99% CI of Ad1 and Ad2 groups are:")
CI(SA1, SA2)

Out[70]:

the 99% CI of Ad1 and Ad2 groups are:
( -29.5 18.5 )

In [71]:

print("the 99% CI of Ad1 and Ad3 groups are:")
CI(SA1,SA3)

Out[71]:

the 99% CI of Ad1 and Ad3 groups are:
( -5.0 64.0 )

In [72]:

print("the 99% CI of Ad2 and Ad3 groups are:")
CI(SA2,SA3)

Out[72]:

the 99% CI of Ad2 and Ad3 groups are:
( -1.0 69.0 )

In [0]:

#as all of our 99% confidence intervals have 0 within them we can determine that none of our groups are significantly differnet from others.

In [56]:

#Question 2F
#An Omnibus test for the multi-group comparison proved not significant (p = 0.181), indicating that there were no significant differences among the groups. So, we do not need a post hoc analysis.

#All calculated 99% confidence intervals(Ad1&Ad2: (-29.5,18.5); Ad1&Ad3: (-5.0, 64.5); Ad2&Ad3 (-1.0,69.0)) contain zero, which means that it's no significant different between these three groups. Three different types of advertisements don’t affect mean sales.

#With our high p-value and our confidence intervals having 0's we can conclude that we fail to reject the null hypothesis. Thus the advertisements had no significant change in the monthly income due to the advertisement campaigns.

In [65]:

#Question 2G
#So the p-value of an event is the probability of that event (or more extreme) happening purely by chance (i.e. if the null hypothesis is true). looking at our p-value of .18 there is an 18% chance that our result is within the null hypothesis. Normally a p-value of 0.05 or 0.01 is used as this would be a 5% or 1% chance of happening by chance.
#the confidence interval tells us how much we can expect our values to vary, from the upper to lower confidence interval. For us we calculated a 99% CI that tells us with 99% confidence that our actual median is within said interval. Having a confidence interval that contains 0 indicates that there is no difference between the two groups, as such we can tell that the groups are not statistically different confirming our null hypothesis.
#As such with an 18% chance that our results are due to random chance, and our confidence intervals indicating no difference between the groups we can assuredly say that the groups have no significant difference and as such fail to reject the null hypothesis meaning that the advertisement campaigns have no effect of sales.

In [66]:

#Question 3A
#From the standpoint of sunscreen sellers, false negatives are more harmful than false positives. As false negatives reduce sales.
#Both Type 1 and Type 2 errors cannot cause terrible effects on the human body, as such we know there is no need to pay close attention to false positives; allowing us to use relatively larger ⍺-values like 0.05.

In [67]:

#Question 3B
#Pairwise comparisons were made to see which groups were significantly different from each other. Comparisons were made on the basis of positive omnibus tests
#Each two groups would be compared through big box model,  de-centered two box model, or the rank based model.

In [74]:

#Question 3C
#our p-values listed in order are       .004, 0.017, 0.021, 0.04, 0.23, 0.32.
#the corrected 𝛂* values are .05*k/6 = .00833, .0166, .025, .033, .04166, .05. comparing to the original we see that 0.00833<0.004, 0.017>0.0166 but .021<.025 making the 0.017 result significant.
#As such using the Benjamini-Hochberg method we see that (S1 vs S4), (1 VS S3) and (S2 vs S4) are all significat.

# 0.23 (S1 vs S2), 0.021 (S1 vs S3), 0.017 (S1 vs S4), 0.04 (S2 vs S3), 0.004 (S2 vs S4), and 0.32 (S3 vs S4)

import statsmodels.stats.multitest as smm
pvals = [.004, 0.017, 0.021, 0.04, 0.23, 0.32]
smm.multipletests(pvals, alpha=0.05, method='fdr_bh')

Out[74]:

(array([ True,  True,  True, False, False, False]),
 array([0.024, 0.042, 0.042, 0.06 , 0.276, 0.32 ]),
 0.008512444610847103,
 0.008333333333333333)

In [0]:

#Using the Benjamini-Hochberg correction, we can determine that S1 vs S3, S1 vs S4, and S2 vs S4 are significantly different from each other using ⍺ = 0.05.

Product

Resources

Company