Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
rasbt
GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch06/wdbc.names.txt
1245 views
1
1. Title: Wisconsin Diagnostic Breast Cancer (WDBC)
2
3
2. Source Information
4
5
a) Creators:
6
7
Dr. William H. Wolberg, General Surgery Dept., University of
8
Wisconsin, Clinical Sciences Center, Madison, WI 53792
9
[email protected]
10
11
W. Nick Street, Computer Sciences Dept., University of
12
Wisconsin, 1210 West Dayton St., Madison, WI 53706
13
[email protected] 608-262-6619
14
15
Olvi L. Mangasarian, Computer Sciences Dept., University of
16
Wisconsin, 1210 West Dayton St., Madison, WI 53706
17
[email protected]
18
19
b) Donor: Nick Street
20
21
c) Date: November 1995
22
23
3. Past Usage:
24
25
first usage:
26
27
W.N. Street, W.H. Wolberg and O.L. Mangasarian
28
Nuclear feature extraction for breast tumor diagnosis.
29
IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
30
and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
31
32
OR literature:
33
34
O.L. Mangasarian, W.N. Street and W.H. Wolberg.
35
Breast cancer diagnosis and prognosis via linear programming.
36
Operations Research, 43(4), pages 570-577, July-August 1995.
37
38
Medical literature:
39
40
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
41
Machine learning techniques to diagnose breast cancer from
42
fine-needle aspirates.
43
Cancer Letters 77 (1994) 163-171.
44
45
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
46
Image analysis and machine learning applied to breast cancer
47
diagnosis and prognosis.
48
Analytical and Quantitative Cytology and Histology, Vol. 17
49
No. 2, pages 77-87, April 1995.
50
51
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
52
Computerized breast cancer diagnosis and prognosis from fine
53
needle aspirates.
54
Archives of Surgery 1995;130:511-516.
55
56
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
57
Computer-derived nuclear features distinguish malignant from
58
benign breast cytology.
59
Human Pathology, 26:792--796, 1995.
60
61
See also:
62
http://www.cs.wisc.edu/~olvi/uwmp/mpml.html
63
http://www.cs.wisc.edu/~olvi/uwmp/cancer.html
64
65
Results:
66
67
- predicting field 2, diagnosis: B = benign, M = malignant
68
- sets are linearly separable using all 30 input features
69
- best predictive accuracy obtained using one separating plane
70
in the 3-D space of Worst Area, Worst Smoothness and
71
Mean Texture. Estimated accuracy 97.5% using repeated
72
10-fold crossvalidations. Classifier has correctly
73
diagnosed 176 consecutive new patients as of November
74
1995.
75
76
4. Relevant information
77
78
Features are computed from a digitized image of a fine needle
79
aspirate (FNA) of a breast mass. They describe
80
characteristics of the cell nuclei present in the image.
81
A few of the images can be found at
82
http://www.cs.wisc.edu/~street/images/
83
84
Separating plane described above was obtained using
85
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
86
Construction Via Linear Programming." Proceedings of the 4th
87
Midwest Artificial Intelligence and Cognitive Science Society,
88
pp. 97-101, 1992], a classification method which uses linear
89
programming to construct a decision tree. Relevant features
90
were selected using an exhaustive search in the space of 1-4
91
features and 1-3 separating planes.
92
93
The actual linear program used to obtain the separating plane
94
in the 3-dimensional space is that described in:
95
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
96
Programming Discrimination of Two Linearly Inseparable Sets",
97
Optimization Methods and Software 1, 1992, 23-34].
98
99
100
This database is also available through the UW CS ftp server:
101
102
ftp ftp.cs.wisc.edu
103
cd math-prog/cpo-dataset/machine-learn/WDBC/
104
105
5. Number of instances: 569
106
107
6. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
108
109
7. Attribute information
110
111
1) ID number
112
2) Diagnosis (M = malignant, B = benign)
113
3-32)
114
115
Ten real-valued features are computed for each cell nucleus:
116
117
a) radius (mean of distances from center to points on the perimeter)
118
b) texture (standard deviation of gray-scale values)
119
c) perimeter
120
d) area
121
e) smoothness (local variation in radius lengths)
122
f) compactness (perimeter^2 / area - 1.0)
123
g) concavity (severity of concave portions of the contour)
124
h) concave points (number of concave portions of the contour)
125
i) symmetry
126
j) fractal dimension ("coastline approximation" - 1)
127
128
Several of the papers listed above contain detailed descriptions of
129
how these features are computed.
130
131
The mean, standard error, and "worst" or largest (mean of the three
132
largest values) of these features were computed for each image,
133
resulting in 30 features. For instance, field 3 is Mean Radius, field
134
13 is Radius SE, field 23 is Worst Radius.
135
136
All feature values are recoded with four significant digits.
137
138
8. Missing attribute values: none
139
140
9. Class distribution: 357 benign, 212 malignant
141