IBM AI algorithms can read chest X-rays at resident radiologist levels

Share this post:

The growth in imaging orders and the availability of high-resolution scanners have led to large workloads for radiologists1,2.  Coupled with lower resolution imaging, cognitive biases, and system factors, they are leading to overcalls or missed findings by radiologists3.

A case in point are chest X-rays, the most common imaging modality read by radiologists in hospitals and tele-radiology practices today. Despite their widespread use, these lower resolution modalities are not the easiest to interpret. To read them correctly, it is important to understand viewing limitations due to patient positions, image quality and tissue overlays4.

Figure 1 illustrates the difficulty of interpreting chest X-rays for the variety of conditions for which they are used for evaluation. A typical interpretation process involves first looking at technical problems during capture to see that they are not hindering interpretation. Next, viewpoint and positions need to be assessed such as antero-posterior (AP) or postero-anterior (PA), Lordotic, etc. An AP view is typically taken of patients in hospitals, who may have tubes and lines inserted in them. In that case, they need to check if they are properly positioned. Similarly any devices such as cardiac pacemakers or wires may have to be noted. Finally, they systematically look for anomalies seen in the various anatomical regions and thereby infer the disease.

Figure 1. Illustration of the variety of findings radiologists look for in preparing a report.

Figure 1. Illustration of the variety of findings radiologists look for in preparing a report.

Chest X-ray interpretation by AI Algorithms

Recently, a number of AI algorithms have emerged to offer assistance by performing preliminary interpretations on a limited number of findings. After several institutions (NIH5, MIMIC6, Stanford7) released large public datasets of chest X-rays capturing about 14 labels, several articles have reported machine learning models achieving radiologist-level performance for different chest radiograph findings5,8–10.  A few FDA-approved AI algorithms also now exist for a limited number of findings11.  Recent work is beginning to target not only the detection of core findings, but also finer granularities such as the severity of the findings12,13.

However, the chest X-ray read problem is far from solved. Unless the algorithms can assert that the findings are comprehensive enough for a full-fledged preliminary read, the coverage itself is insufficient.

Added to this is the complexity that there has been no systematic effort to catalog the spectrum of possible findings seen in chest radiographs in the radiology community itself. Secondly, many of the methods have been trained on single hospital source datasets, so the generalizability to different demographics across the world is questionable. Third, the evaluation methodology used for assessing these algorithms uses popular AI metrics such as area under the curve (AUC), or F-scores for label-based precision and recall evaluations.

However, for an actual clinical workflow use, what is more important in accepting such preliminary reports is to minimize the number of overall or misses on a per image basis. This means we have been evaluating these methods perhaps on the wrong metrics themselves, and perhaps they should be evaluated for clinical purposes based on image-based sensitivity, positive predictive value (PPV), and specificity rather than label-based evaluations. Finally, in order to claim that the problem is solved, and that AI algorithms can read at least as well as a radiology residents, objective comparison methods are needed. Existing studies report comparisons of model performance for limited labels by comparing them to the ground truth, which is obtained by a loose consensus process such as a majority vote from a few board-certified radiologists14–17.

Our team of researchers based at the IBM Research-Almaden lab in California have been pursuing an ambitious challenge of building machines that can perform a preliminary read of chest X-rays provably at the level of at least entry-level radiologists. In a recent publication in JAMA18, we have reported results of a diagnostic study conducted among five third-year radiology residents and our AI algorithm, using a study data set of 1998 AP frontal chest radiographs. These images were assembled through a triple consensus with adjudication ground truth process, covering a comprehensive set of 72 chest radiograph findings seen in AP chest X-rays.

We found no statistically significant difference in image-based sensitivity between the AI algorithm and the radiology residents, but the image-based specificity and positive predictive value were statistically higher for AI algorithm. Specifically, using  the analysis of variance (ANOVA) test to compare the distributions, our results were:

  • The mean image-based sensitivity for AI algorithm was 0.716 (95% CI, 0.704-0.729) and for radiology residents was 0.720 (95% CI, 0.709-0.732) (P = .66),
  • The positive predictive value (PPV) was 0.730 (95% CI, 0.718-0.742) for the AI algorithm and 0.682 (95% CI, 0.670-0.694) for the radiology residents (P < .001),
  • Specificity was 0.980 (95% CI, 0.980-0.981) for the AI algorithm and 0.973 (95% CI, 0.971-0.974) for the radiology residents (P < .001).

These findings suggest that well-trained AI algorithms can indeed reach performance levels similar to radiology residents even when covering the large breadth of findings seen in AP frontal chest radiographs.

So how did we achieve this?  

We assembled a large team of clinicians, radiologists, machine learning researchers, software engineers, statisticians, and project managers to complete this project in about two years of time. In particular, our clinical partners were involved throughout the design and execution of the study and played an integral role in the development of AI algorithms.

Cataloging the findings

The first order of business was to assemble the comprehensive list of all possible findings seen in AP chest X-rays. For this, we ran a multi-disciplinary effort to build a chest x-ray lexicon. This lexicon now consists of  11,977 vocabulary terms mined from over 200,000 radiology reports and curated by clinicians using an active domain learning assistant (see Figure 2) to capture over 78 core findings and over 457 fine-grained findings in chest X-rays that characterize anatomical location, laterality, size, severity, shape, appearance, and other features. The resulting lexicon itself can be valuable for other informatics and clinical purposes and is being described in an upcoming paper at AMIA 202019.

Figure 2: Illustration of chest X-ray lexicon assembly.

Figure 2: Illustration of chest X-ray lexicon assembly.

Creating labeled chest X-ray dataset for fine-grained labels

The next problem to solve was to obtain a set of labeled images for these findings. At the time we started the study, only the NIH dataset was available, which was labeled for only 14 findings from a text labeling method and the original reports were not provided.

To obtain our labels, we launched a massive effort to collect fresh radiology reads for these images using a cloud-based, crowd-sourced annotation system called MedNet, developed and deployed to over 45 radiologists around the country. Even so, only 17,000 reports could be collected in a period of a few months.

To increase the set of labeled data, we obtained a dataset from MIMIC-4 through an LCP consortium agreement that allowed advanced access to radiology reports for over 320,000 images. Since our goal was to extract fine-grained findings, and no available text analytics was available that could achieve this, we developed a new algorithm for analysis of sentences that combined natural language dependency parsing with finding detection and phrasal grouping.

This approach is being described in an upcoming paper at AMIA20 and is illustrated in Figure 3.

Using this algorithm and the reports provided in the MIMIC dataset, we were able to accurately label 342,000 images with over 97 percent precision and recall, with about 2 percent error in detecting negative findings and 1 percent error is finding associations of modifiers with findings. The whole labeling time reduced dramatically to one week including verification, making this a scalable approach to labeling images in the future. This high accuracy was possible both due to the large chest X-ray lexicon we created, and the phrasal grouping combined with natural language dependency parsing approach adopted.

Figure 3. Illustration of assembly of fine-grained labels

Figure 3. Illustration of assembly of fine-grained labels

Designing deep learning networks for large number of labels

Since such a large label set was not used previously to develop a deep learning network for chest X-rays, we experimented with several designs. Ultimately, we came to the surprising conclusion that a single network with all labels was best if we could incorporate principles of good network design, as shown in Figure 4.

Figure 4. Deep learning architecture used to learn large scale chest X-ray labels.

Figure 4. Deep learning architecture used to learn large scale chest X-ray labels.

Figure 5 shows the predicted labels by our deep learning network for sample chest X-rays.

Figure 5. Upper row, sample chest X-rays depicting different findings. Second row shows the actual labels used for training and prediction. The last row gives an appropriate sentence to describe the finding in words.

Figure 5. Upper row, sample chest X-rays depicting different findings. Second row shows the actual labels used for training and prediction. The last row gives an appropriate sentence to describe the finding in words.

Conducting clinical studies

Once the AI algorithm was ready, it was time to test both AI algorithms and residents on an independent dataset that neither of them had seen. Since the goal was to compare performance on AP chest X-rays, a separate dataset of 1,800 AP chest X-rays was assembled, drawing from the AP views of unique patients from the NIH dataset. A triple consensus ground truth with adjudication process was used to ground truth the dataset, in which three board certified radiologists with three to five years of experience independently labeled the images first.

The discrepancies were then resolved through in-person discussions to arrive at consensus labels. The same labels used for training the AI algorithms were shown in special clinical study user interfaces developed for this task. The ground truth assembly process took several months and since we started from AP chest X-rays, the resulting prevalence distribution showed considerable skew towards tubes and lines labels and, although unintended, served as a difficult case to test the generalizability of the AI algorithm.

Figure 6. Illustration of user interface designed to collect objective reads from radiology residents.

Figure 6. Illustration of user interface designed to collect objective reads from radiology residents.

Five radiology residents were selected from academic medical centers around the country after passing a reading adequacy test on five unrelated chest radiographs. Each of the residents independently evaluated about 400 non-overlapping set of images from the test set using a special web-based structured form interface shown in Figure 6, in which they were asked to select from the same 72 possible findings. This objective capture avoided the accidental ‘forgets’ typical of free-form reporting. Prior training on finding definitions was provided to residents,  so that the selection of abnormalities through the list shown in the user interface did not artificially constrain the residents.

Interpretation of results

The results as indicated in our JAMA paper 18 showed close performance between AI algorithms and residents. Overall, the AI algorithm performed similarly to residents for tubes and lines and normal reads, and generally outperformed for high prevalence labels such as  cardiomegaly, pulmonary edema, subcutaneous air, and hyperaeration. Conversely, the AI algorithm generally performed worse for lower prevalence findings that also had a higher level difficulty of interpretation such as masses/nodules and enlarged hilum.

So what does this all mean?

Our chest X-ray challenge started with the goal of building machines that can perform a preliminary read of chest X-rays provably at the level of at least entry-level radiologists. While we have reached this goal, the systematic development process we used along the way can be a good teaching experience to the AI and radiology community in several ways.

Currently, we see a trend in which radiologists are leveraging deep learning techniques to build their own brand of models, while the AI researchers are using clinicians primarily in a supportive capacity. This work shows that it takes a multidisciplinary team effort where clinicians, machine learning researchers, medical imaging specialists, software developers, and statisticians all need to work together to do a systematic end-to-end study design and execution.

From the study, we learned that well-trained artificial intelligence algorithms can reach performance levels similar to radiology residents in covering the breadth of chest X-ray findings seen in AP frontal chest radiographs. This opens up the potential for their future use in generating full-fledged preliminary textual reports, which could further expedite radiology reads and address resource scarcity.


  1. Zha N, Patlas MN DR. Radiologist burnout is not just isolated to the United States: perspectives from Canada. J Am Coll Radiol. 2019;16(1):121-123.
  2. Kane L. Medscape National Physician Burnout, Depression & Suicide Report 2019. medscape com/slideshow/2019-lifestyle-burnout-depression-6011056. Published online 2019.
  3. Lee CS, Nagy PG, Weaver SJ, Newman-Toker DE. Cognitive and system factors contributing to diagnostic errors in radiology. Am J Roentgenol. 2013;201(3):611-617. doi:10.2214/AJR.12.10375
  4. Delrue L, Gosselin R, Ilsen B, Landeghem A, De Mey J, duyck philippe. Difficulties in the Interpretation of Chest Radiography. In: Comparative Interpretation of CT and Standard Radiography of the Chest. ; 2011:27-49. doi:10.1007/978-3-540-79942-9_2
  5. Wang X, Peng Y, Lu L, Lu Z, Bagheri M SR. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. ; 2017:2097-2106.
  6. et al. Johnson AEW, Pollard TJ, Berkowitz S. MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv Prepr arXiv190107042. Published online 2019.
  7. et al. Irvin J, Rajpurkar P, Ko M. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. ; 2019:590-597.
  8. Taylor AG, Mielke C MJ. Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: A retrospective study. PLoS Med. 2018;15(11).
  9. Rajpurkar P, Irvin J, Zhu K et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv Prepr arXiv171105225.
  10. Pan I, Cadrin-Chênevert A CP. Tackling the Radiological Society of North America Pneumonia Detection Challenge. Am J Roentgenol. 2019;213(3):568-574.
  11. FDA-approved A.I.-based algorithm. Published 2020.
  12. Syeda-Mahmood T, Wong KCL, Gur Y, et al. Chest X-Ray Report Generation Through Fine-Grained Label Learning. In: Springer, Cham; 2020:561-571. doi:10.1007/978-3-030-59713-9_54
  13. Chauhan G, Liao R, Wells W, et al. Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment. In: Springer, Cham; 2020:529-539. doi:10.1007/978-3-030-59713-9_51
  14. Irvin J, Rajpurkar P, Ko M et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. ; 2019:590-597.
  15. Majkowska A, Mittal S, Steiner DF et al. Chest radiograph interpretation with deep learning models:assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology. 2020;294(2):421-431.
  16. Annarumma M, Withey SJ, Bakewell RJ, Pesce E, Goh V MG. Automated triaging of adult chest radiographs with deep artificial neural networks. Radiology. 2019;291(1):196-202.
  17. Wong KCL, Moradi M, Wu J S-MT. Identifying disease-free chest x-ray images with deep transfer learning. In: Proc. SPIE Computer-Aided Diagnosis. ; 2019.
  18. Wu JT, Wong KCL, Gur Y, et al. Comparison of Chest Radiograph Interpretations by Artificial Intelligence Algorithm vs Radiology Residents. JAMA Netw Open. 2020;3(10):e2022779. doi:10.1001/jamanetworkopen.2020.22779
  19. J. Wu et al. AI Accelerated Human-in-the-loop Structuring of Radiology Report. In: Americal Medical Informatics Association (AMIA )Annual Symposium. ; 2020.
  20. T. Syeda-Mahmood et al. Extracting and Learning Fine-grained Labels from Chest Radiographs. In: Americal Medical Informatics Association (AMIA )Annual Symposium. ; 2020.
  21. Yu F K V. Multi-scale context aggregation by dilated convolutions. arXiv Prepr arXiv151107122.
  22. He K, Zhang X, Ren S SJ. Identity mappings in deep residual networks. In: Proc. European Conference on Computer Vision. ; 2016:630-645.
  23. Wu Y HK. Group normalization. In: Proc. European Conference on Computer Vision. ; 2018:3-19.
  24. Brownlee J. A Gentle Introduction to the Rectified Linear Unit (ReLU). Deep Learning Performance. Published 2019. Accessed June 26, 2020.
  25. Lin T-Y MS. Improved bilinear pooling with CNNs. arXiv Prepr arXiv170706772.


Inventing What’s Next.

Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.


IBM Fellow, Chief Scientist, Medical Sieve Radiology Grand Challenge

More AI stories

New research helps make AI fairer in decision-making

To tackle bias in AI, our IBM Research team in collaboration with the University of Michigan has developed practical procedures and tools to help machine learning and AI achieve Individual Fairness. The key idea of Individual Fairness is to treat similar individuals well, similarly, to achieve fairness for everyone.

Continue reading

IBM researchers investigate ways to help reduce bias in healthcare AI

Our study "Comparison of methods to reduce bias from clinical prediction models of postpartum depression” examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading