A Vision Transformer Model Using Self-Supervised Task for Scoliosis Identification From X-Ray Images Based on Internet of Medical Things

Document Type

Article

Publication Date

10-28-2025

Identifier/URL

43079315 (Pure)

Abstract

In recent years, the rapid advancement and application of deep learning in medical imaging have demonstrated its effectiveness in reducing physicians’ workload and lowering the risk of misdiagnosis in pathological spine diagnosis. Nevertheless, deep learning–based models for pathological spine diagnosis have not yet matured to the level required for clinical deployment. Several challenges contribute to this limitation. First, the availability of spinal X-ray images for training is limited, and the class distribution of samples is often imbalanced. Second, conventional deep learning models rely on convolutional kernels that primarily capture local features in X-ray images, while overlooking the global morphological characteristics of the spine. To address these issues, we propose ViTST, a Vision Transformer (ViT)–based model with a self-supervised learning task for scoliosis classification. ViTST incorporates a masked strategy–based self-supervised pretext task to mitigate the challenges posed by limited training data and leverages the ViT architecture to capture global structural features of spinal X-ray images. This design enables more effective modeling of inter-regional relationships and variations within the spine. Moreover, by jointly optimizing reconstruction loss and cross-entropy loss, ViTST learns robust image representations even from relatively small datasets. In addition, we introduce a healthcare Internet of Medical Things (IoMT) architecture to enable the practical deployment of ViTST in clinical environments. Through this IoMT platform, clinicians can monitor patients’ conditions in real time and adapt treatment plans dynamically, thereby enhancing clinical decision-making and accelerating patient recovery. Finally, we conducted extensive experiments on a real-world pathological spine image dataset to validate the effectiveness of the proposed model. Experimental results demonstrate that ViTST achieved a Precision of 0.975, an Accuracy of 0.979, and an F1-score of 0.975, confirming its strong potential for application in clinical practice.

DOI

10.1109/JIOT.2025.3625928

Find in your library

Off-Campus WSU Users


Share

COinS