SharedIntro to Data Analytics.ipynbOpen in CoCalc

Summary Statistics Tutorial

When calculating summary statistics for continuous data, we can categorize these statistics into two categories: central tendency or dispersion. Statistics that describe central tendencies are mean and median. For the spread and dispersion of data, we can look at standard deviation, variance and range.

input.dat <- read.csv('data.csv');
str(input.dat);
'data.frame': 569 obs. of 33 variables: $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ... $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ... $ radius_mean : num 18 20.6 19.7 11.4 20.3 ... $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ... $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ... $ area_mean : num 1001 1326 1203 386 1297 ... $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ... $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ... $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ... $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ... $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ... $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ... $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ... $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ... $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ... $ area_se : num 153.4 74.1 94 27.2 94.4 ... $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ... $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ... $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ... $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ... $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ... $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ... $ radius_worst : num 25.4 25 23.6 14.9 22.5 ... $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ... $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ... $ area_worst : num 2019 1956 1709 568 1575 ... $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ... $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ... $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ... $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ... $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ... $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ... $ X : logi NA NA NA NA NA NA ...

Let's calculate the mean, median, standard deviation, variance and range of the radius_worst variable.

mean(input.dat$radius_worst); # calculates the mean
16.2691898066784
median(input.dat$radius_worst); # calculates the median
14.97
sd(input.dat$radius_worst); # calculates the standard deviation
4.83324158046932
var(input.dat$radius_worst); # calculates the variance
23.3602241751776
range(input.dat$radius_worst);
  1. 7.93
  2. 36.04

Correlation Tutorial

Correlation measures the strength of relationship between two vectors of numerics. From the correlation coefficient, we can gather information such as the magnitude and the direction of the relationship. Correlation coefficient ranges between -1 and 1 with -1 showing a strong negative relationship and +1 showing a strong positive relationship.

input.dat <- read.csv('data.csv'); # read in our input data
cor(input.dat$smoothness_mean, input.dat$compactness_mean); # correlation between mean smoothness and mean compactness
0.659123215215923

There is a strong correlation between the mean smoothness and mean compactness measures.

cor(input.dat$texture_mean, input.dat$symmetry_mean); # correlation between mean texture and mean symmetry
0.071400980483317

However, there is little to no correlation between mean texture and mean symmetry measures.

Hypothesis Testing (Student's T-Test) Tutorial

To test whether the mean measurements between two groups are significantly different from 0, we can use the Student's t-test to calculate the p-value

input.dat <- read.csv('data.csv'); # read in our input data
# calculates whether there is a difference in the mean of worst smoothness measures between
# patients with benign or malignant tumours
t.test(smoothness_worst ~ diagnosis, input.dat);
Welch Two Sample t-test data: smoothness_worst by diagnosis t = -10.82, df = 412.57, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02349864 -0.01627284 sample estimates: mean in group B mean in group M 0.1249595 0.1448452

Correlation Plot Tutorial

# load necessary packages
library(ggplot2);
library(corrplot);
corrplot 0.84 loaded
input.dat <- read.csv('data.csv'); # read in our input data
# identify which column has NAs
colnames(input.dat)[apply(input.dat, 2, anyNA)]; 

# remove feature that has missing data
input.dat <- subset(input.dat, select = -c(X));

# convert diagnosis to factors
input.dat$diagnosis <- as.factor(input.dat$diagnosis);
'X'
# calculate feature-feature correlation
corr.matrix <- cor(input.dat[,3:ncol(input.dat)]);

# create a heatmap of the correlation matrix using hierarchical clustering
# order = "hclust": using hierarchical clustering order
# tl.cex = 0.4: setting the size of the font to 0.4
# addrect = 8: adding 8 rectangles to correlation plot
corrplot(corr.matrix, order = "hclust", tl.cex = 0.5, addrect = 8);