GUEST BLOG BY MICHELLE HUCHETTE:
My name is Michelle Huchette and I am a rising fourth year at the University of Virginia studying Computer Science and Statistics. This summer I was fortunate enough to be a part of the IBM Summit Program as a Technical Sales Intern. In this role I was able to experience what it is like to be an IBM seller by attending customer events and working on various tasks and projects over the course of 11 weeks. A few weeks ago I was challenged with creating a Proof of Technology lab that would interest customers in the field of machine learning. This is a brief overview of the creation and utilization of the model I created to diagnose breast cancer tumors.
The data set used for the lab was found in UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) that contained information regarding breast cancer tumors and information to help predict the diagnoses of the tumors as malignant or benign. The data set contains 10 measurements of each cell nucleus captured using images of cell nuclei gathered from a fine needle aspiration procedure (FNA). The average, standard error, and extreme values of all nuclei in the tumor sample were calculated for each of the following features:
- Texture (standard deviation of gray-scale values)
- Smoothness (local variation in radius length)
- Compactness (perimeter2/area – 1)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Fractal dimension (coastline approximation – 1)
After finding a data set, a machine learning model could be created to diagnose breast cancer tumors. In order to do so we needed to set up a Watson Studio account on the IBM Cloud platform( https://console.bluemix.net/registration/ ). Within Watson Studio(https://console.bluemix.net/catalog/services/watson-studio ) we created a Jupyter Notebook which was used to write a python code to work with the data set, create a model, and make the predictions, all using Apache Spark as the analytics engine.
The first step to creating the machine learning model required determining which type of model would be the best fit for the data. Research found that there are different types of models that Spark supports such as Naïve Bayes, Decision Trees, Random Forests, and Regression Models, which are the most common. Because Naïve Bayes required a strong independence assumption between the features, that type of model was ruled out. Ultimately, a Logistic Regression Model was chosen since it is often used for models of binary categorical outcome (exactly what we’re dealing with when trying to diagnose a tumor as malignant or benign) and it is good at measuring the relationship between the labels and features.
To start out, the logistic regression model was set to have the default parameters so that an initial model could be created and improved upon if needed. Once the model was defined, a pipeline was set up which contained a sequence of stages to be run in a specific order. Within a pipeline each stage is either a transformer, which converts a dataframe into other dataframes, or an estimator which calls fit() and trains a model. There are many different options that you can include in your pipeline, including tokenizers, hashes, normalizers, etc. In terms of this dataset and for the sake of creating an easy to follow lab our pipeline started by using StringIndexer to turn the label (diagnosis) into a form that SparkML could use by encoding the input columns to a column of indices based on their frequency. Then a Vector Assembler combined the list of columns into a single vector column to be used in training the model. A normalizer was added to normalize each vector into a standard form to improve the algorithm. Lastly, our defined logistic regression instance was implemented and IndexToString was used to get the results of the model back into human readable form.
Following the definition of a pipeline, the logistic regression model and pipeline could be used to train and test the model. The data set was split with the standard 70/30 split for the training and test dataset, respectively. The training data set was then used to fit the pipeline and train the model to make predictions and the accuracy of the model was tested using a Receiver Operator Characteristic curve for binary classifiers. This value is calculated by plotting the true positive rate (recall/probability of detection) against the false positive rate (fall-out/probability of a false alarm) at various levels. A value when using the ROC curve that is close to 1 suggests that the model performs very well, whereas a value close to 0.5 is about as good as flipping a coin. Once the model was trained and evaluated, the test data set was used to make predictions The logistic regression model that was created in the steps previously described resulted in a value of 0.989, meaning it was able to predict the diagnoses of tumors very well.
Even though the model was already proven to be able to diagnose tumors accurately it could still be improved on. Hyperparameter tuning includes the use of model selection tools that test different parameter values for the pipeline and find the best possible values. There are two main options when working in Spark in terms of model selection tools, a CrossValidator or a Train-Validation Split. For this project we used a CrossValidator because even though they can be more expensive for larger data sets, they are more reliable when the data set isn’t sufficiently large because it evaluates each parameter k times, rather than just once.
CrossValidators first split the data set into “folds” which are used as separate training and test data set pairs. We set the value of the number of folds for this project to be 10 and therefore the CrossValidator generated 10 training/test data set pairs which are all used to test the parameters. The average performance among the 10 instances for each parameter are averaged and compared to other parameter values tested. We defined a paramGrid which stated the values to be used for the parameters within the pipeline. For this pipeline we could define values for maxIter, elasticNetParam, regParam, which are the parameters in the logistic regression model, or the normalizer parameter of the pipeline. Included in our paramGrid for this lab was parameter values for elasticNetParam, which must be between 0 and 1. This is an important parameter in the pipeline because it can make the model closer to a Lasso regression model (coefficients that are not relevant are set to 0) with a value close to 1 or a Ridge regression model (minimize the impact of irrelevant coefficients without setting them to 0) with a value close to 0. Because of this the values to test the elasticNetParam in the grid were set to 0, 0.5, and 1 to see which type of regression model would be best for the data. The second parameter defined in the paramGrid is the normalizer from the pipeline. The normalizer ensures that the algorithm runs correctly and the value set for the parameter represents the p-norm for normalization. The default value of 2 was previously used so within the paramGrid the values to be tested were set to 1 and 3.
After using a CrossValidator to find the best paramMaps, that model was trained using the testing data set and it was evaluated. The model improved 0.058% due to hyperparameter tuning, meaning the newly defined model was 99.5% accurate.
With an almost perfect predictive model defined, the last step was to grab the undiagnosed tumors from the original data set and use the model to predict their diagnoses with a high level of confidence.
The creation of this highly accurate model shows the power of Machine Learning in bettering the lives of people worldwide. It allows for the augmentation of breast cancer diagnosing and ensures that doctors see the patients in dire need of medical attention. Models such as these can help detect cancer earlier and, in more individuals, than doctors can do alone. Machine Learning has already started to be implemented in oncology to diagnose tumors, pathology to analyze bodily fluids, and in diagnosing rare diseases using facial recognition and deep learning to detect rare genetic diseases. Machine Learning serves many purposes from chatbots to augmenting the medical diagnostic process and with the continued advancements in technology and AI its applications are sure to expand even more.