Kernel: R (R-Project)
Cancer Classifier
Part 1: Load all the required libraries
In [10]:
Loading required package: lattice
Loading required package: ggplot2
Attaching package: ‘ggplot2’
The following object is masked from ‘package:randomForest’:
margin
Part 2: Read and prepare the dataset
In [2]:
radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | fractal_dimension_mean | ⋯ | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | diagnosis |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ⋯ | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | M |
20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ⋯ | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | M |
19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ⋯ | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | M |
11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ⋯ | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | M |
20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ⋯ | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | M |
12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | 0.2087 | 0.07613 | ⋯ | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | M |
Part 3: Prepare the train and test data
In [3]:
- 398
- 31
- 171
- 31
Part 4a: Logistic Regression | Model Training
Now that we have created our training and testing datasets, we can start training our model. We will start with a simple logistic regression model using one feature
In [4]:
Call:
glm(formula = diagnosis ~ radius_mean, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4735 -0.4527 -0.1456 0.1361 2.8655
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -15.3566 1.6072 -9.555 <2e-16 ***
radius_mean 1.0290 0.1119 9.200 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 523.17 on 397 degrees of freedom
Residual deviance: 223.77 on 396 degrees of freedom
AIC: 227.77
Number of Fisher Scoring iterations: 6
Now lets try to add another feature
In [5]:
Call:
glm(formula = diagnosis ~ radius_mean + texture_mean, family = binomial,
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0280 -0.3640 -0.1176 0.1032 2.7925
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.43883 2.08013 -9.345 < 2e-16 ***
radius_mean 1.03511 0.11899 8.699 < 2e-16 ***
texture_mean 0.20180 0.04352 4.637 3.53e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 523.17 on 397 degrees of freedom
Residual deviance: 200.01 on 395 degrees of freedom
AIC: 206.01
Number of Fisher Scoring iterations: 7
lets try one more feature
In [6]:
Call:
glm(formula = diagnosis ~ radius_mean + texture_mean + perimeter_mean,
family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.10313 -0.23575 -0.08194 0.06157 3.14037
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -17.92977 2.37981 -7.534 4.92e-14 ***
radius_mean -6.18087 1.19503 -5.172 2.31e-07 ***
texture_mean 0.22497 0.05435 4.139 3.49e-05 ***
perimeter_mean 1.08816 0.18610 5.847 5.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 523.17 on 397 degrees of freedom
Residual deviance: 146.06 on 394 degrees of freedom
AIC: 154.06
Number of Fisher Scoring iterations: 7
What happens if we fit all the features?
In [7]:
Warning message:
“glm.fit: algorithm did not converge”Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
Call:
glm(formula = diagnosis ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.326e-04 -2.000e-08 -2.000e-08 2.000e-08 5.132e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.135e+02 1.824e+06 0.000 1.000
radius_mean -1.700e+02 4.607e+05 0.000 1.000
texture_mean 1.614e+01 7.689e+03 0.002 0.998
perimeter_mean -9.951e+00 5.647e+04 0.000 1.000
area_mean 1.001e+00 1.390e+03 0.001 0.999
smoothness_mean -1.510e+03 4.299e+06 0.000 1.000
compactness_mean -3.167e+03 2.812e+06 -0.001 0.999
concavity_mean 1.223e+03 1.656e+06 0.001 0.999
concave.points_mean 7.190e+03 1.858e+06 0.004 0.997
symmetry_mean 1.585e+03 2.449e+06 0.001 0.999
fractal_dimension_mean -4.329e+03 1.345e+07 0.000 1.000
radius_se 8.921e+02 2.075e+05 0.004 0.997
texture_se -1.875e+00 2.243e+04 0.000 1.000
perimeter_se -1.063e+02 8.032e+04 -0.001 0.999
area_se -1.092e+00 4.558e+03 0.000 1.000
smoothness_se -3.885e+04 8.735e+06 -0.004 0.996
compactness_se 6.737e+03 5.579e+06 0.001 0.999
concavity_se -6.120e+03 2.281e+06 -0.003 0.998
concave.points_se 4.000e+04 5.828e+06 0.007 0.995
symmetry_se -1.191e+03 5.073e+06 0.000 1.000
fractal_dimension_se -9.371e+04 2.451e+07 -0.004 0.997
radius_worst 9.169e+00 1.039e+05 0.000 1.000
texture_worst 2.808e+00 8.918e+03 0.000 1.000
perimeter_worst 1.459e+01 2.528e+04 0.001 1.000
area_worst 3.234e-01 5.441e+02 0.001 1.000
smoothness_worst 2.641e+03 1.988e+06 0.001 0.999
compactness_worst -3.039e+02 1.448e+06 0.000 1.000
concavity_worst 7.525e+02 8.701e+05 0.001 0.999
concave.points_worst -3.675e+03 5.892e+05 -0.006 0.995
symmetry_worst 4.836e+02 1.066e+06 0.000 1.000
fractal_dimension_worst 7.385e+03 5.186e+06 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5.2317e+02 on 397 degrees of freedom
Residual deviance: 7.4831e-07 on 367 degrees of freedom
AIC: 62
Number of Fisher Scoring iterations: 25
Part 4b: Logistic Regression | Prediction
Now that we have learned about the model fit and error profile, how good is each of the above models at predicting the test dataset? We will first try to use the model we trained in the previous section to predict the class in the test set.In [8]:
pred1.f
B M
121 50
pred2.f
B M
122 49
pred3.f
B M
114 57
pred4.f
B M
103 68
Part 4c: Logistic Regression | Accuracy
Lets try to quantify the accuracy of the predictions in the test set.
In [11]:
Part 5: Decision Trees
In this exercise, we will try to visualize a decision tree using an R library called "party"
In [55]:
Part 6a: Random Forest | Training
In the previous excercise, we visualized one decision tree. Random Forest is a collection of decision trees. Let's try to run a Random Forest by varying the number of decision trees. How does the number of decision trees affect the accuracy of the model?
In [56]:
Part 6b: Random Forest | Prediction
We can use the model we trained previously to predict on test set. We will use the same function we used in logistic regression to perform the prediction.In [57]:
pred5
B M
112 59
pred6
B M
111 60
pred7
B M
111 60
pred8
B M
110 61
Part 6c: Random Forest | Accuracy
In [58]:
Part 6d: Random Forest | Importance Score
In [59]:
In [0]: