Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
1&2, adde
Image: ubuntu2004
Q 1a
measure used: proportion of babies born with anomalies
null hypothesis: Any difference in the proportions of babies born to vaccinated mothers with anomalies and babies born to unvaccinated mothers with anomalies is due to pure randomness.
sample size: 6731, number of babies with anomalies born to mothers that have been vaccinated.
1D
The p value of 0.87 found does not allow us to reject the null hypothesis, and means that any difference in the unvaccinated population proportions of babies with anomalies to the vaccinated proportions of babies with anomalies happens, by pure randomness, 87% of the time. We used an alpha of 0.01 as we usually would during this class, and our two-sided p-value largely surpassed this. We used a two-sided p value because we are not only interested in the proportion of babies with anomalies born to vaccinated mothers that are lower than the observed value compared to the proportion of babies with anomalies born to unvaccinated mothers, but also the proportion of babies with anomalies born to vaccinated mothers that are higher than the observed proportion of babies with anomalies born to unvaccinated mothers.
The confidence interval, shown within the green lines above on the histogram, lying between the proportion of about 0.0183 and 0.0289, tells us that if you ran the simulation many (10,000) times, the true value of the proportion of babies with anomalies born to unvaccinated mothers lays between 0.0175 and 0.0277 99% of the time. This range surrounds the observed value (shown in the red line), which makes sense, considering the extremely high p-value. Since the p-value was so high and we have failed to reject the null that the difference in the proportions of anomaly babies born to vaccinated vs unvaccinated mothers is due to randomness, it makes sense that the confidence interval contains values that are both higher and lower than the observed value, further suggesting that the differences are due to pure randomness.
1E
Because the p value we calculated was so high and suggests that the discrepancy in proportions was due to random chance, you would expect the 99% confidence interval for the vaccinated mothers to be extremely similar to the confidence interval for the unvaccinated mothers. Since we fail to reject the null, we expect the proportions of the two populations to be extremely similar, resulting in similar confidence intervals.
2f
This simulation produced a p-value of 0.038, which, when using an alpha of 0.01, the general alpha used for this course, does not produce statistically significant results. We used resampling with replacement because the data was not normally distributed, running a simulation 10,000 times in which we drew random values from the original data set to produce a pseudo-cornflakes group and a pseudo-oatbran group to test the null, that any discrepancy found between the differences in cholesterol levels between the cornflakes and oatbran groups was due to pure randomness, not from an effect from either cereal. As seen in the paired slopeplot, individuals experienced both an increase in cholesterol and a decrease in cholesterol levels after switching cereals, indicating that the change in cereals itself did not have an outstanding effect on the cholesterol levels of the people in the study. With a 99% confidence interval calculated to be between -0.445 and 1.83, the true effect size calculated between the two groups will lay between these values 99% of the time. Since the confidence interval includes 0, or the value of the null hypothesis, we can further fail to reject the null hypothesis.
2g
We created a "big box" that includes all values in the study, unlabled as to whether they belonged to the cornflake or oatbran group. Then, we redrew a fake cornflake group and oatbran group from the big box to see if the differences in medians (middle values) that we saw in the original dataset happened by pure randomness or if it was because of the different cereals that each person ate. We redid this in a simulation 10,000 times, keeping track of each difference in medians that was observed. When we did this, we found that there was about a 4% chance of seeing the differences originally observed happening by chance. While this value initially seems small, that is the same as 1 in 25 times, this occured by chance. In 10,000 simulations, we saw results as extreme or more extreme than this about 400 times. We then used this knowledge to construct an interval of values that the median for any given simulation is likely to fall between. The difference in medians for any given simulation has a 99% chance of falling between -0.445 and 1.83. If the medians had a difference of 0, this means that the type of cereal that each person ate did not play any role in their cholesterol levels. Since the interval includes the value of 0, we know that it is possible that the type of cereal does not play a part in the difference in cholesterol levels observed in each person.
As shown in both the work done by hand and in python, the first three p-values, (s2 vs s4, s1 vs s4, and s1 vs s3) remain statistically significant after multiple testing corrections