Addressing South Africa’s Cancer Reporting Delay with Machine Learning

Share this post:

Could you make a critical health policy decision using four-year-old data?

Cancer registries hold vital data sets, kept tightly encrypted, containing demographic information, medical history, diagnostics and therapy. Oncologists and health officials access the data to understand the diagnosed cancer cases and incidence rates nationally. The ultimate goal is to use this data to inform public health planning and intervention programs. While real time updates are not practical, multi-year delays make it challenging for officials to understand the impact of cancer in the country and allocate resources accordingly.

Cancer reporting lag

Unstructured pathology reports contain tumor specific data and are the main source of information collected by cancer registries. Human experts label the pathology reports using International Classification of Disease for Oncology (ICD-O) codes spanning 42 different cancer types. The combination of manual processes and the magnitude of reports received annually leads to a four-year lag for the country. In comparison, there is nearly a two-year delay in the United States.

Promising results

In 2016, when we inaugurated our new IBM Research lab in Johannesburg, we took on this challenge and are reporting our first promising results at Health Day at the KDD Data Science Conference in London this month.

Waheeda Saib cancer research

Waheeda Saib

Our goal from the beginning was to apply deep learning to automate cancer pathology report labeling to speed up the reporting process. Working with the National Cancer Registry in South Africa, we used 2,201 de-identified, free text pathology reports and I am proud to report that our paper demonstrates 74 percent accuracy – an improvement over current benchmark models. We believe we can get to 95 percent accuracy with more data.

We employed hierarchical classification with convolutional neural networks, although this was not our first choice. We initially started exploring multiclass and binary convolutional neural networks models, but the results were not promising and I nearly quit in frustration. Eventually, with the advice and support of my colleagues, we cleaned up the text, refined the feature engineering process and improved it to 60 percent. This result was an improvement, but we knew we needed 90-95 percent to make it trustworthy enough for the real world.

After more research and exploration, we thought about reducing the complexity of the multiclass problem, which led us to create a state-of-the-art hierarchical deep learning classification method based on the hierarchical structure of the oncology ICD-O coding system. Thus, we used a combined approach to identify class hierarchy and validate it using expert knowledge to achieve better performance than a flat multiclass model for classification of free text pathology reports.

Our work is of course not done yet; we need to reach above 95 percent accuracy, and we think this is possible with more data, which will be provided by our partners at the National Cancer Registry. Once we get this, we think South Africa can be the best in the world in terms of cancer reporting, which is significant particularly because it’s been reported that my country will see a 78 percent increase in cancer by 2030.

Hierarchical Deep Learning Ensemble to Automate the Classification of Breast Cancer Pathology Reports by ICD-O Topography, Waheeda Saib, David Moinina Sengeh, Gciniwe Dlamini and Elvira Singh (Download Poster)

More Healthcare stories

Could Mitochondria Numbers Be the Key to Solving Cancer Drug Resistance?

New research shows that the number of mitochondria in a cell is associated with drug resistance, which could have a far-reaching impact in cancer treatment.

Continue reading

Using Machine Learning to Develop Blood Test For Key Alzheimer’s Biomarker

Alzheimer’s disease, a terminal neurodegenerative disease, has historically been diagnosed based on observing significant memory loss.

Continue reading

Biophysics-Inspired AI Uses Photons to Help Surgeons Identify Cancer

Biophysics-inspired AI tools would provide a richer amount of information to support intraoperative decisions of surgeons during removal of cancerous tissue.

Continue reading