Explainable, Radiologist Mimicking, Deep-Learning for Detection of Acute Intracranial Haemorrhage from Small CT Datasets

Our paper reports the construction of understandable deep learning algorithms for accurate, highly sensitive, CT detection & classification of intracranial hemorrhage (ICH) on unenhanced head CT scans, using a small dataset from fewer than 1000 patients.

Like Comment
Read the paper

Our paper reports the construction of understandable deep learning algorithms for accurate, highly sensitive, CT detection & classification of intracranial hemorrhage (ICH) on unenhanced head CT scans, using a small dataset from fewer than 1000 patients [Fig 1].

We selected ICH as our use case, because it is a high impact, potentially life-threatening condition.  It requires rapid recognition for prompt, appropriate treatment of patients presenting with acute stroke symptoms, to prevent or mitigate major disability or death. As many facilities do not have access to subspecialty-trained neuroradiologists - especially at night, on weekends, or in developing countries - non-expert healthcare providers are often required to diagnose or exclude acute hemorrhage. The availability of a reliable, “virtual” second opinion, that is trained by neuroradiologists, could help make healthcare providers more efficient and confident. Moreover, such a platform could also help to prioritize CT scans for readout, by alerting the care team to the presence of hemorrhage. Noteworthy features of our approach include: (1) EXPLAINABILITY, (2) RADIOLOGIST MIMICKING, and the use of (3) SMALL DATASETS [Fig 2].

Our team consists of dedicated clinical radiologists and imaging data scientists, who had a terrific time collaborating on this effort [Fig 3]. One of the most important lessons we learned early on was that, to develop radiologist-level “artificial” intelligence – especially with small data - it was essential that we leverage our "human" intelligence domain expertise. “Platinum level” reference standard annotation and labelling proved to be essential for both training and testing.

Fig 3: Top row: Hyunkwang Lee, Sehyo Yune MD, Mohammad Mansouri MD, Myeongchan Kim MD, Shahein H. Tajmir MD; Middle row: Claude E. Guerrier MD, Sarah A. Ebert MD, Stuart R. Pomerantz MD, Javier M. Romero MD; Bottom row: Shahmir Kamalian MD, Ramon G. Gonzalez MD PhD, Michael H. Lev MD, and Synho Do PhD.

The origins of our speculation on how convolutional neural networks (CNN’s - although we didn’t call them that back then) might “mimic” radiologist behavior date back over 30 years - to a casual conversation in the mid 1980’s at the corner of Brown & Meeting streets in Providence RI - between then medical students Michael Lev and Michael Shadlen (the latter a newly minted Stanford PhD, currently an HHMI professor of neuroscience at Columbia University). Dr. Shadlen – through some clever back-of-napkin sketches of a 3x3 matrix - explained how associative memory could be modeled as the “weights” of the synaptic connections between an “input” layer of neurons (such as in the retina) and an “output” layer of neurons (such as in the occipital visual cortex). Moreover, even this simple, nine-interneuron model, could quickly “learn” by updating the synaptic weights through iterative feedback with error correction [Fig 4].

Fast forward to fall 2015, when IBM Watson acquired Merge Healthcare, signaling what has since become an explosive growth in medical AI in general, and imaging AI in particular. As recent advances in image recognition have achieved physician-level performance in various tasks, however, “big data” and the “black box” problem increasingly present substantial obstacles to the development & adoption of medical deep-learning systems into clinical practice. Specifically, collecting and labeling the very large (10-to-100’s of thousands of images), well annotated datasets required for training, validation, & testing, is an expensive, time consuming, and often infeasible task. Moreover, despite their success, the inner workings and decision-making processes of machine learning algorithms remain opaque. Indeed, the FDA requires clinical decision support software to explain the rationale for its decisions, to enable users to independently review the basis for their recommendations. Overcoming these two major barriers to training (i.e., big data) and implementation (i.e., black box) of our AI algorithms has been our overarching goal, and has been largely accomplished through our “radiologist mimicking” approach.

Specifically, we achieved high performance using a relatively small dataset by mimicking the workflow of radiologists.  Our approach includes: (a) encoding images into a multichannel composite with standard and customized “window & level” display settings; (b) use of slice interpolation to reduce the false positive impact of volume averaging and intracranial calcifications (without the need for manually segmented data from thin slice CT images); and (c) use of axial 2D datasets, rather than volumetric 3D datasets, which not only significantly improves our mean average precision (mAP), but also avoids the “curse of dimensionality” (i.e., the amount of data required to train a deep-learning model scales exponentially with the dimensionality of the data).

Also noteworthy is our creation of an “explainable” prediction basis. Although our platform provides quantitative hemorrhage localization by attention maps, these do not fully explain the reasons for the decisions made by the model. We found that our models learned invariant features of various appearances & locations of ICH, so we incorporated retrieval of these representative training images – which contain significant features relevant to each test image - into our platform as a supplement to the attention maps, thus providing the user with an explanation for the model’s predictions [Fig 5].

We hope that our explainable AI system offers a practical tool that could stimulate greater physician adoption. There is current widespread belief that the answers to many crucial questions can be found from big data using deep learning. A large portion of healthcare big data, however, is unstructured and unconfirmed, making it potentially unsuitable for building deep-learning models. The balance between data size and quality, plus careful data processing tailored to each application, may be a key to developing high-performance deep-learning imaging algorithms in the future.

Michael H Lev, MD FAHA FACR

Director Emergency Radiology / ED Neuroradiology, Massachusetts General Hospital / HMS