Transformer-based mass detection in digital mammograms

Betancourt Tarifa, A. S.; Marrocco, C.; Molinara, M.; Tortorella, F.; Bria, A.

doi:10.1007/s12652-023-04517-9

In the last decade, Convolutional Neural Networks (CNNs) have been the de facto approach for automated medical image detection. Recently, Vision Transformers have emerged in computer vision as an alternative to CNNs. Specifically, the Shifted Window (Swin) Transformer is a general-purpose backbone that learns attention-based hierarchical features and achieves state-of-the-art performances in a variety of vision tasks. In this work, for the first time, we design and experiment transformer-based models for mass detection in digital mammograms leveraging Swin transformer as a backbone multiscale feature extractor. Experiments on the largest publicly available mammography image database OMI-DB yield a True Positive Rate (TPR) of 75.7 % at 0.1 False Positives per Image (FPpI) for the best transformer model, with 2.5 % TPR improvement over its convolutional counterpart and a massive 7.4 % TPR over the state-of-the-art. We also combine transformer- and convolution-based detectors with weighted box fusion, achieving an additional 2.4 % TPR improvement reaching 78.1 % TPR at 0.1 FPpI.