Wine Data Investigation

  • Compare distributions of high and low quality wines to determine important measurements
  • Self documenting plots
  • Save and document workflow

Raw data from UC Irvine Machine Learning Project

https://archive.ics.uci.edu/ml/datasets/wine

Read the csv file into a data table.

In [1]:
wine_data = read.table('winequality-red.csv', header=TRUE, sep=';')
In [2]:
colnames(wine_data)
Out[2]:
  1. 'fixed.acidity'
  2. 'volatile.acidity'
  3. 'citric.acid'
  4. 'residual.sugar'
  5. 'chlorides'
  6. 'free.sulfur.dioxide'
  7. 'total.sulfur.dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'
  12. 'quality'
In [3]:
colnames(wine_data[-12])
Out[3]:
  1. 'fixed.acidity'
  2. 'volatile.acidity'
  3. 'citric.acid'
  4. 'residual.sugar'
  5. 'chlorides'
  6. 'free.sulfur.dioxide'
  7. 'total.sulfur.dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'

The following steps create two new data tables with data selected from the total file. One table contains all wines with quality greater than or equal to eigth, and the other all wines less than or equal to four. We'll compare the distribution of measurements for each group to see whether that parameter is important to quality.

In [4]:
# high quality wines
hi_qual = wine_data[ which(wine_data[,'quality'] >= 8), ]

# low quality wines
lo_qual = wine_data[ which(wine_data[,'quality'] <= 4), ]

Save in native R format. Useful for sharing or future work.

In [5]:
save(hi_qual, file="highquality-red.rds")
save(lo_qual, file="lowquality-red.rds")

Compare distributions for several columns of data in the list.

In [6]:
# compare parameters in the list
# compare = c('fixed.acidity', 'volatile.acidity', 'citric.acid', 'residual.sugar', 'chlorides', 'pH', 'sulphates', 'alcohol')
# compare all columns except the last
compare = colnames(wine_data[-12])
In [7]:
# arrange the plots in rows and columns
par(mfrow=c(4,3))

# repeat the plot for every column in the list
for (col in compare) {
    # calculate density before plotting, to assist in finding min, max, axes, etc.
    dens_hi = density(hi_qual[,col])
    dens_lo = density(lo_qual[,col])

    # find max of each plot
    max_hi = max(dens_hi$y)
    max_lo = max(dens_lo$y)
    max = max(c(max_hi, max_lo))

    # plot
    plot(dens_hi, ylim=c(0, max), col="blue", main="Quality Comparison", xlab=col)
    lines(dens_lo, col="red", lty=2)
    legend( "topright", inset=0.02, legend=c("High", "Low"),
      col=c("blue", "red"), lty=1:2, cex=0.6)
}
Out[7]:

Repeat the plot, but save rather than display.

In [8]:
png("winecomparison.png")
par(mfrow=c(4,3))

for (col in compare) {
    dens_hi = density(hi_qual[,col])
    dens_lo = density(lo_qual[,col])

    max_hi = max(dens_hi$y)
    max_lo = max(dens_lo$y)
    max = max(c(max_hi, max_lo))

    plot(dens_hi, ylim=c(0, max), col="blue", main="Quality Comparison", xlab=col)
    lines(dens_lo, col="red", lty=2)
    legend( "topright", inset=0.02, legend=c("High", "Low"),
      col=c("blue", "red"), lty=1:2, cex=0.6)
}
dev.off()
Out[8]:
png: 2
The citric acid distributions certainly looks important. Let's try a t-test on the two distributions.
In [9]:
t.test(hi_qual[,'citric.acid'], lo_qual[,'citric.acid'])
Out[9]:
	Welch Two Sample t-test

data:  hi_qual[, "citric.acid"] and lo_qual[, "citric.acid"]
t = 4.042, df = 28.375, p-value = 0.0003683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1073205 0.3276002
sample estimates:
mean of x mean of y 
0.3911111 0.1736508 
In [0]: