Healthcare

Addressing South Africa’s Cancer Reporting Delay with Machine Learning

Share this post:

Could you make a critical health policy decision using four-year-old data?

Cancer registries hold vital data sets, kept tightly encrypted, containing demographic information, medical history, diagnostics and therapy. Oncologists and health officials access the data to understand the diagnosed cancer cases and incidence rates nationally. The ultimate goal is to use this data to inform public health planning and intervention programs. While real time updates are not practical, multi-year delays make it challenging for officials to understand the impact of cancer in the country and allocate resources accordingly.

Cancer reporting lag

Unstructured pathology reports contain tumor specific data and are the main source of information collected by cancer registries. Human experts label the pathology reports using International Classification of Disease for Oncology (ICD-O) codes spanning 42 different cancer types. The combination of manual processes and the magnitude of reports received annually leads to a four-year lag for the country. In comparison, there is nearly a two-year delay in the United States.

Promising results

In 2016, when we inaugurated our new IBM Research lab in Johannesburg, we took on this challenge and are reporting our first promising results at Health Day at the KDD Data Science Conference in London this month.

Waheeda Saib cancer research

Waheeda Saib

Our goal from the beginning was to apply deep learning to automate cancer pathology report labeling to speed up the reporting process. Working with the National Cancer Registry in South Africa, we used 2,201 de-identified, free text pathology reports and I am proud to report that our paper demonstrates 74 percent accuracy – an improvement over current benchmark models. We believe we can get to 95 percent accuracy with more data.

We employed hierarchical classification with convolutional neural networks, although this was not our first choice. We initially started exploring multiclass and binary convolutional neural networks models, but the results were not promising and I nearly quit in frustration. Eventually, with the advice and support of my colleagues, we cleaned up the text, refined the feature engineering process and improved it to 60 percent. This result was an improvement, but we knew we needed 90-95 percent to make it trustworthy enough for the real world.

After more research and exploration, we thought about reducing the complexity of the multiclass problem, which led us to create a state-of-the-art hierarchical deep learning classification method based on the hierarchical structure of the oncology ICD-O coding system. Thus, we used a combined approach to identify class hierarchy and validate it using expert knowledge to achieve better performance than a flat multiclass model for classification of free text pathology reports.

Our work is of course not done yet; we need to reach above 95 percent accuracy, and we think this is possible with more data, which will be provided by our partners at the National Cancer Registry. Once we get this, we think South Africa can be the best in the world in terms of cancer reporting, which is significant particularly because it’s been reported that my country will see a 78 percent increase in cancer by 2030.

Hierarchical Deep Learning Ensemble to Automate the Classification of Breast Cancer Pathology Reports by ICD-O Topography, Waheeda Saib, David Moinina Sengeh, Gciniwe Dlamini and Elvira Singh (Download Poster)

More Healthcare stories

AI Pilot to Address Transboundary Water Challenges in Southern Africa Launches

Starting this month, experts at IBM Research, Wits University, University of the Western Cape, Umvoto Africa and Delta-H, who know how to deploy these technologies in South Africa, are starting a new pilot to develop news techniques, which are more user-friendly in the regional context.

Continue reading

Dark Matter Matters: AI Makes DNA Dark Matter Useful

What is the minimal description that captures a space? Asking a mathematician’s basic question of a  biological dataset reveals interesting answers about biology itself. This summarizes our underlying approach to subtyping hematological cancer. Disease subtyping is a central tenet of precision medicine, and is the challenging task of identifying and classifying patients with similar presentations […]

Continue reading

Becoming Quantum Ready in Africa

Classical computing has served society incredibly well. It gave us the Internet and cashless commerce. It sent humans to the moon, put robots on Mars and smartphones in our pockets. But many of the world’s biggest mysteries and potentially greatest opportunities remain beyond the grasp of classical computers forever. To continue the pace of progress, […]

Continue reading