Classifying Cell Samples (SVM)

Support Vector Machine (SVM) is a classification and regression technique that is particularly suitable for wide datasets. A wide dataset is one with a large number of predictors, such as might be encountered in the field of bioinformatics (the application of information technology to biochemical and biological data).

A medical researcher has obtained a dataset containing characteristics of a number of human cell samples extracted from patients who were believed to be at risk of developing cancer. Analysis of the original data showed that many of the characteristics differed significantly between benign and malignant samples. The researcher wants to develop an SVM model that can use the values of these cell characteristics in samples from other patients to give an early indication of whether their samples might be benign or malignant.

This example uses the stream named svm_cancer.str, available in the Demos folder under the streams subfolder. The data file is cell_samples.data. See the topic Demos Folder for more information.

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository . The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

Field name Description
ID Patient identifier
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape Uniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant

For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record.

Next