Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 169
Kernel: R (R-Project)

Summary Statistics Tutorial

When calculating summary statistics for continuous data, we can categorize these statistics into two categories: central tendency or dispersion. Statistics that describe central tendencies are mean and median. For the spread and dispersion of data, we can look at standard deviation, variance and range.

input.dat <- read.csv('data.csv');
str(input.dat);
'data.frame': 569 obs. of 33 variables: $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ... $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ... $ radius_mean : num 18 20.6 19.7 11.4 20.3 ... $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ... $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ... $ area_mean : num 1001 1326 1203 386 1297 ... $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ... $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ... $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ... $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ... $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ... $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ... $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ... $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ... $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ... $ area_se : num 153.4 74.1 94 27.2 94.4 ... $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ... $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ... $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ... $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ... $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ... $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ... $ radius_worst : num 25.4 25 23.6 14.9 22.5 ... $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ... $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ... $ area_worst : num 2019 1956 1709 568 1575 ... $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ... $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ... $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ... $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ... $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ... $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ... $ X : logi NA NA NA NA NA NA ...

Let's calculate the mean, median, standard deviation, variance and range of the radius_worst variable.

mean(input.dat$radius_worst); # calculates the mean
16.2691898066784
median(input.dat$radius_worst); # calculates the median
14.97
sd(input.dat$radius_worst); # calculates the standard deviation
4.83324158046932
var(input.dat$radius_worst); # calculates the variance
23.3602241751776
range(input.dat$radius_worst);
  1. 7.93
  2. 36.04

Correlation Tutorial

Correlation measures the strength of relationship between two vectors of numerics. From the correlation coefficient, we can gather information such as the magnitude and the direction of the relationship. Correlation coefficient ranges between -1 and 1 with -1 showing a strong negative relationship and +1 showing a strong positive relationship.

input.dat <- read.csv('data.csv'); # read in our input data
cor(input.dat$smoothness_mean, input.dat$compactness_mean); # correlation between mean smoothness and mean compactness
0.659123215215923

There is a strong correlation between the mean smoothness and mean compactness measures.

cor(input.dat$texture_mean, input.dat$symmetry_mean); # correlation between mean texture and mean symmetry
0.071400980483317

However, there is little to no correlation between mean texture and mean symmetry measures.

Hypothesis Testing (Student's T-Test) Tutorial

To test whether the mean measurements between two groups are significantly different from 0, we can use the Student's t-test to calculate the p-value

input.dat <- read.csv('data.csv'); # read in our input data
# calculates whether there is a difference in the mean of worst smoothness measures between # patients with benign or malignant tumours t.test(smoothness_worst ~ diagnosis, input.dat);
Welch Two Sample t-test data: smoothness_worst by diagnosis t = -10.82, df = 412.57, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02349864 -0.01627284 sample estimates: mean in group B mean in group M 0.1249595 0.1448452

Correlation Plot Tutorial

# load necessary packages library(ggplot2); library(corrplot);
corrplot 0.84 loaded
input.dat <- read.csv('data.csv'); # read in our input data
# identify which column has NAs colnames(input.dat)[apply(input.dat, 2, anyNA)]; # remove feature that has missing data input.dat <- subset(input.dat, select = -c(X)); # convert diagnosis to factors input.dat$diagnosis <- as.factor(input.dat$diagnosis);
'X'
# calculate feature-feature correlation corr.matrix <- cor(input.dat[,3:ncol(input.dat)]); # create a heatmap of the correlation matrix using hierarchical clustering # order = "hclust": using hierarchical clustering order # tl.cex = 0.4: setting the size of the font to 0.4 # addrect = 8: adding 8 rectangles to correlation plot corrplot(corr.matrix, order = "hclust", tl.cex = 0.5, addrect = 8);
Image in a Jupyter notebook

Other Plots Tutorial

library(ggplot2);
input.dat <- read.csv('data.csv'); # read in our input data
qplot( input.dat$texture_mean, # plotting mean texture bins = 100, # break continuous data into 100 bins for visualization ylab = 'Count', # changing y-axis and x-axis labels xlab = '' );
{"output_type":"display_data"}
Image in a Jupyter notebook
ggplot( input.dat, aes(x=texture_mean) # plotting mean texture ) + geom_density(aes(fill = '', colour = '') # enabling fill and colour of density curve but no labels ) + scale_fill_manual(values = 'darkgray' # changing fill colour to gray ) + scale_colour_manual(values = 'darkgray') + theme(legend.position="none"); # removing legend
{"output_type":"display_data"}
Image in a Jupyter notebook
ggplot( input.dat, aes(x=diagnosis) # plotting mean texture data categorized into different categories ) + geom_bar(stat="count"); # plot
{"output_type":"display_data"}
Image in a Jupyter notebook
ggplot( input.dat, aes(y = texture_mean, x = diagnosis) ) + geom_boxplot();
{"output_type":"display_data"}
Image in a Jupyter notebook