Share this post:
One of the best ways to fight cancer is early detection, when it is still confined and can be fully excised surgically or treated pharmacologically. Cancer screening programs, that is, the practice of testing for the presence of cancer in people who have no symptoms, has been medicine’s tool of choice for the earliest detection.
The first cancer screening test to be widely used for cancer was the Pap test for finding cervical cancer. Since its introduction as a widely used test in the 1960’s, cervical cancer death rate in the United States has declined by about 70%1. Similarly, breast cancer screening started to be widely used in the 1970’s and has been shown to decrease mortality in multiple randomized controlled trials1. Screening for breast cancer is done using mammography exams in which radiologists scrutinize x-ray pictures of the breast for the possible presence of cancer. Mammography screenings, on average only find 7 out of 8 asymptomatic breast cancers 2, and this sensitivity has been increasing over the past years. On the other end of the spectrum there are the false positives. Out of 1,000 women, about 100 are recalled for additional diagnostic imaging, and of these 100 women, 4 or 5 are ultimately diagnosed with breast cancer2 (See Figure 1). These false positive exams lead to preventable harms, including patient anxiety, benign biopsies, and unnecessary intervention or treatment. Furthermore, high false-positive rates significantly contribute to the annual $7.8 billion mammography screening costs in the U.S3.
Figure 1. Out of 1000 woman going for a screening mammogram, 100 will be recalled for further studies, but only 5 of them will actually have cancer. This represents a 9.5% of false positives
In this day and age in which we see the remarkable success of applying deep learning to speech recognition, visual object detection and recognition and natural language processing to name a few, it is natural to ask whether deep learning can help improve radiologist’s interpretation of mammograms. At IBM Research, we are working hard to apply deep learning to create technology which could support radiologists as they work to detect breast cancer at screening4 – an effort that many others across AI and health communities are also focused on5.
We believe AI could help improve the scalability of breast cancer screening and ameliorate the shortage of mammography professionals around the world. In a recent paper published on March 2nd, 2020 in JAMA Network Open6 we have shown progress is being made in this direction.
In this paper, we report on the result of an international challenge to assess whether deep learning algorithms could reach radiologists’ performance in mammography interpretation. The competition was one of the DREAM Challenges7, a series of biomedical related competitions that one of us (GS) founded in 2006. In this particular challenge, which IBM co-organized in close collaboration with Sage Bionetworks, Kaiser Permanente Washington and others, with funding from the Arnold Foundation, we wanted to leverage crowdsourcing as the innovation engine to develop algorithms that help determine the likelihood of a breast cancer diagnosis within one year of the screening exam. The competition ran for most of 2017, with 126 teams from 44 countries participating in the Challenge.
Model-to-data practices protect sensitive datasets in crowdsourced competitions
The success of deep learning algorithms require large datasets for training, and statistical power requires large datasets for testing. However, for diverse and legitimate reasons, data sharing continues to often be the exception and not the norm in biomedical research8, with many datasets siloed inside hospitals, companies and Universities. The DREAM Challenges mission is to promote collaboration to solve important biomedical problems in a way that is respectful of privacy and security concerns. In the Digital Mammography DREAM Challenge, Kaiser Permanente Washington (KPW) generously provided a large dataset of de-identified digital mammograms that could be used by the community provided it was not downloaded or accessible by challenge participants. We created what we call a Model to Data system9, in which participants submitted their models as Docker containers to the IBM Cloud10. Using a model-to-data approach allowed study organizers to help protect data privacy and prevent participants from downloading sensitive mammography data. It also avoids the distribution of data to participants and mitigates the risk of sensitive patient data being released. Participants were invited to submit their algorithms to the study organizers who developed a system that automatically ran the models on the data.
Challenge organizers then ran the models either for training or for doing inference on a test dataset securely protected behind a firewall. A second dataset was provided by the Karolinska Institute (KI) in Sweden, which included 165,000 examinations from 68,000 Swedish women with 780 cancer positive. The KI dataset was only used for testing algorithms already trained on the KPW training dataset. All the top performing methods used deep learning models.
Objective benchmarking of algorithms
One of the most important features in these DREAM Challenges, is that it uncouples algorithm development from the algorithm evaluation, thus preventing what we have called the self-assessment trap11, in which algorithm developers judge their own algorithms, thus becoming judge, jury and executioners. In this challenge the evaluation was done by the organizers, using test datasets that were blind to the algorithm developers. Participants were allowed to use a special “leaderboard dataset” common to all participants, to monitor the progress of their algorithm development over the course of the challenge, but the number of times they could use the leaderboard dataset was restricted to nine to avoid overfitting. This allowed for an objective benchmarking and comparison of algorithms8. This important facet of the work makes the results more robust and rigorous, avoiding the traps of overfitting and lack of generalization. This is clearly shown in Figure 2, which shows that the Area Under the receiver operating characteristics Curve (AUC) of algorithms trained and evaluated in the KPW dataset generalize very well to the KI dataset. Indeed, the performance of the same algorithms in the two datasets were very well correlated (correlation coefficient of 0.98), even though the datasets were completely independent, acquired in different facilities in different countries.
Collaborating after Competing and Augmented AI
Figure 3 The Challenge Ensemble Method (CEM) was created integrating the algorithm of the 8 best performer teams refined during the community phase. The CEM+R method resulted from the integration of the predictions from the 8 performer teams and the radiologists’ assessment.
At the end of the competition of the DREAM Challenge, we found the sobering result that no single algorithm had a lower false positive rate than the radiologists in that same dataset. For example, the best algorithm in the KPW dataset had a specificity of 66.3% (equivalent to a false positive rate of 33.7%), but for the same cohort the radiologists had a higher specificity of 90.5% (equivalent to a false positive rate of 9.5%). The Area under the Receiving Operator Characteristics curve (or AUC for short) of the best performing team in the KPW dataset was acceptable (0.858), but it was not spectacular. Similar results were found for the KI dataset (See Tables 1 and 2). But while that was the end of the road for the competition, it was a starting point for the ensuing collaboration. As is customary in the DREAM Challenges, at the end of the competitive phase we invite the best performing teams to start what we call the community phase, in which competitors turn into collaborators and work together to refine their algorithms learning from each other’s successes and failures during the competitive phase. The results of this community phase were very encouraging. During this phase the algorithms from each of the eight participating teams were integrated into a new algorithm that we caller the Challenge Ensemble Model or “CEM” for short. The CEM model was further integrated with the radiologists’ assessment into another ensemble model called the “CEM+R” model (See Figure 3). The CEM and CEM+R ensemble models were trained using the training data used in the competitive phase, with performance assessed using the KPW and KI final evaluation datasets.
Tables 1 and 2 show the results for the specificity and AUC for both the KPW and KI datasets. Because for the KI data set each mammogram undergoes double reading by two radiologists, we used the first KI reader interpretation to directly compare with the single radiologist interpretation of the KPW data set. The CEM method had a better specificity and AUC over the best performing team, but the false positive rate was still much higher than the radiologists at the same sensitivity, for both the KPW and KI datasets. We reasoned that whereas the CEM was still lower in performance than the radiologists, the AI methods may have complementary strengths than the radiologists, and therefore, integrating the algorithmic results with the radiologists’ assessment might yield better results than the radiologists’ assessment alone. This was indeed the case. The false positive rate for the CEM+R algorithm was 8% for the KPW (1.5% less than the radiologists’ 9.5%), and 1.5% for the KI dataset (1.8% less than the radiologists’ 3.3%). This statistically significant reduction of 1.5% in the false positive rate in the KPW dataset (and more in the KI dataset), may not seem like much. However, considering that each year roughly 40 million women are screened for breast cancer in the US only, if a 1.5% reduction in false positives could be replicated across all. Populations and testing methodologies, that could potentially result in about 600,000 women per year who would be spared the unnecessary diagnostic work-up, anxiety, and of course cost associated with a recall for further examinations. Of course, confirmation of these estimates will require additional validation and testing in clinical settings.
We recognize that the proposition to combine radiologist interpretation and AI algorithms is at this point theoretical, and there is much work ahead to study the interaction of a human interpreter with AI algorithms results and how AI might influence radiologists’ final assessments. What is clear is that the rapid development of AI in general and deep learning technology in particular, and its implementation into routine clinical imaging could transform the practice of radiology. The results of the Digital Mammography DREAM Challenge indicate that there is promise for the future of breast radiologists’ interpretation to be augmented with AI technology.
Table 1 The specificity of the best performing algorithm in the competitive phase, the Challenge Ensemble Method (CEM) and the CEM integrated with the radiologists’ assessment (CEM+R), all calculated at the radiologists’ sensitivity for each the Kaiser Permanente Washington (KPW) and the Karolinska Institute (KI) datasets.
Table 2 The Area Under the Receiver Operating Characteristics curve (AUC) for the best performing algorithm in the competitive phase, the Challenge Ensemble Method (CEM) and the CEM integrated with the radiologists’ assessment (CEM+R) for the Kaiser Permanente Washington (KPW) and the Karolinska Institute (KI) datasets.
1.Nelson HD, Tyne K, Naik A, Bougatsos C, Chan BK, Humphrey L; U.S. Preventive Services Task Force. Screening for breast cancer: an update for the U.S. Preventive Services Task Force. Ann Intern Med. 2009;151(10):727-737, W237-42.
2. Lehman CD, AraoRF, SpragueBL, etal. National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium. Radiology. 2017;283(1):49-58. doi:10. 1148/radiol.2016161174
3. Jennifer S. Haas, The Complexity of Achieving the Promise of Precision Breast Cancer Screening, J Natl Cancer Inst. 2017 Jan; 109(5), doi: 10.1093/jnci/djw301
4. Akselrod-Ballin A, Chorev M, Shoshan Y, Spiro A, Hazan A, Melamed R, Barkan E, Herzel E, Naor S, Karavani E, Koren G, Goldschmidt Y, Shalev V, Rosen-Zvi M, Guindy M., Predicting Breast Cancer by Applying Deep Learning to Linked Health Records and Mammograms. Radiology. 2019 Aug;292(2):331-342. doi: 10.1148/radiol.2019182622. Epub 2019 Jun 18.
5. https://medium.com/syncedreview/three-papers-in-the-eye-of-the-ai-breast-cancer-detection-storm-a63d2a2480ea; also: Kim, H.E., Kim, H.H., Han, B.K., Kim, K.H., Han, K., Nam, H., Lee, E.H. and Kim, E.K., 2020. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multi-reader study, The Lancet Digital Health, February 6, 2020, https://doi.org/10.1016/S2589-7500(20)30003-0
6. Thomas Schaffter, Diana S. M. Buist, Christoph I. Lee, et al., Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms, JAMA Network Open. 2020;3(3):e200265. doi:10.1001/jamanetworkopen.2020.0265
8. Marx, V. Bench pressing with genomics benchmarkers. Nat Methods (2020). https://doi.org/10.1038/s41592-020-0768-1
9. Lisa M. Federer, Ya-Ling Lu, Douglas J. Joubert, Judith Welsh, and Barbara Brandys, Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff, PLoS One. 10(6): e0129506 (2015).
1o. Kyle Ellrott, et al., Reproducible biomedical benchmarking in the cloud: lessons from crowd-sourced data challenges, Genome Biology, 20:195 (2019).
11. Raquel Norel, John Jeremy Rice, and Gustavo Stolovitzky, The self-assessment trap: can we all be better than average?, Mol Syst Biol. 2011; 7: 537. doi: 10.1038/msb.2011.70
About DREAM Challenges
Challenge leadership was provided by IBM Research through the DREAM Challenges and by Sage Bionetworks. The DREAM (Dialogue for Reverse Engineering Assessment and Methods) Challenges, founded in 2006 by IBM Researcher Gustavo Stolovitzky, are a community-driven, open science, crowdsourcing effort that has hosted about 60 Challenges so far, addressing questions that range from predictive models for disease progression to developing models for cell signaling networks. DREAM Challenges are organized by groups of scientists coming from academia, industry, government, and non-profit organizations working together to organize and implement the challenge. In 2013 DREAM partnered with Sage Bionetworks, a non-profit research organization that seeks to accelerate health research through the creation of platforms that empower patients and researchers to share and interpret data on a massive scale. One of the technologies developed by Sage Bionetworks is Synapse, a collaborative platform where all DREAM Challenges take place. The partnership between the DREAM Challenges and Sage Bionetworks has allowed this open science community effort to take on tougher and more complex challenges like the Digital Mammography DREAM Challenge.