A MHA-based integrated diagnosis and segmentation method for COVID-19 pandemic

Li, J.; Sun, J.; Li, C.; Ahmad, B.

ABSTRACT

Objective In order to alleviate the COVID-19 (corona virus disease 2019) pandemic, the initial implementation is focused on targeting and isolating the infectious patients in time. Traditional PCR (polymerase chain reaction) screening method is challenged for the costly and time-consuming problem. Emerging AI (artificial intelligence) -based deep learning networks have been applied in medical imaging for the COVID-19 diagnosis and pathological lung segmentation nowadays. However, current networks are mostly restricted by the experimental datasets with limited number of chest X-ray (CXR) images, and it merely focuses on a single task of diagnosis or segmentation. Most networks are based on the convolution neural network (CNN). However, the convolution operation of CNN is capable to extract local features derived from intrinsic pixels, and has the long-range dependency constraints for explicitly modeling. We develop a vision transformer network (ViTNet). The multi-head attention (MHA) mechanism is guided for long-range dependency model between pixels. Method We built a novel transformer network called ViTNet for diagnosis and segmentation both. The ViTNet is composed of three parts, including dual-path feature embedding, transformer module and segmentation-oriented feature decoder. 1) The embedded dual-path feature is based on two manners for the embedded CXR inputs. One manner is on the basis of 2D convolution with the sliding step equal to convolution kernel size, which divides a CXR to multiple patches and builds an input vector for each patch. The other manner is concerned of a pre-trained feature map (ResNet34-derived) as backbone in terms of deep CXR-based feature extraction. 2) The transformer module is composed of six encoders and one cross-attention module. The 2D-convolution-generated vector sequence is as inputs for transformer encoder. Owing that the encoder inputs are directly extracted from image pixels, they can be considered as the shallow and intuitive feature of CXR. The six encodes are in sequential, transforming the shallow feature to advanced global feature. The cross-attention module is focused on the results obtained by backbone and transformer encoders as inputs, the network can combine the deep feature and encoded shallow feature, and absorb both the global information and the local information in terms of the encoded shallow feature and deep feature, respectively. 3) The feature decoder for segmentation can double the size of feature map and provide the segmentation results. Our network is required to deal with two tasks simultaneously for both of classification and segmentation. A hybrid loss function is employed for their training, which can balance the training efforts between classification and segmentation. The classification loss is the sum of a contrastive loss and a multi-classes cross-entropy loss. The segmentation loss is a binary cross-entropy loss. What is more, a new five-levels CXR dataset is compiled. The dataset samples are based on 2 951 CXRs of COVID-19, 16 964 CXRs of healthy, 6 103 CXRs of bacterial pneumonia, 5 725 CXRs of viral pneumonia, and 6 723 CXRs of opaque lung. In this dataset, COVID-19 CXRs are all labeled with COVID-19 infected lung masks. In our training process, the input images were resized as 448 × 448 pixels, the learning rate is initially set as 2 × 10 - 4 and decreased gradually in a self-adaptive manner, and the total number of iterations is 200, the Adam learning procedure is conducted on four Tesla K80 GPU devices. Result In the classification experiments, we compared ViTNet to a general transformer network and five popular CNN deep-learning models (i. e., Res-Net18, ResNet50, VGG16 (Visual Geometry Group), Inception _ v3, and deep layer aggregation network (DLAN) in terms of overall prediction accuracy, recall rate, F1 and kappa evaluator. It can be demonstrated that our model has the best with 95. 37% accuracy, followed by Inception_ v3 and DLAN with 95. 17% and 94. 40% accuracy, respectively, and the VGG16 is reached 94. 19% ac uracy. For the recall rate, F1 and kappa value, our model has better performance than the rest of networks as well. For the segmentation experiments, ViTNet is in comparison with four commonly-used segmentation networks like pyramid scene parsing network (PSPNet), U-Net, U-Net + and context encoder network (CE-Net) . The evaluation indicators used are derived of the accuracy, sensitivity, specificity, Dice coefficient and area under ROC (region of interest) curve (AUC) . The experimental results show that our model has its potentials in terms of the accuracy and AUC. The second best sensitivity is performed inferior to U-Net + only. More specifically, our model achieved the 95. 96% accuracy, 78. 89% sensitivity, 97. 97% specificity, 98. 55% AUC and a Dice coefficient of 76. 68% . When it comes to the network efficiency, our model speed is 0. 56 s per CXR. In addition, we demonstrate the segmentation results of six COVID-19 CXR images obtained by all the segmentation networks. It is reflected that our model has the best segmentation performance in terms of the illustration of Fig. 5. Our model limitation is to classify a COVID-19 group as healthy group incorrectly, which is not feasible. The PCR method for COVID-19 is probably more trustable than the deep-leaning method, but the feedback duration of tested result typically needs for 1 or 2 days. Conclusion A novel ViTNet method is developed, which achieves the auto-diagnosis on CXR and lung region segmentation for COVID-19 infection simultaneously. The ViTNet has its priority in diagnosis performance and demonstrate its potential segmentation ability. © 2022 Editorial and Publishing Board of JIG. All rights reserved.

Similar