A Vision Transformer Model Using Self-Supervised Task for Scoliosis Identification From X-Ray Images Based on Internet of Medical Things
Document Type
Article
Publication Date
10-28-2025
Identifier/URL
43079315 (Pure)
Abstract
In recent years, the rapid advancement and application of deep learning in medical imaging have demonstrated its effectiveness in reducing physicians’ workload and lowering the risk of misdiagnosis in pathological spine diagnosis. Nevertheless, deep learning–based models for pathological spine diagnosis have not yet matured to the level required for clinical deployment. Several challenges contribute to this limitation. First, the availability of spinal X-ray images for training is limited, and the class distribution of samples is often imbalanced. Second, conventional deep learning models rely on convolutional kernels that primarily capture local features in X-ray images, while overlooking the global morphological characteristics of the spine. To address these issues, we propose ViTST, a Vision Transformer (ViT)–based model with a self-supervised learning task for scoliosis classification. ViTST incorporates a masked strategy–based self-supervised pretext task to mitigate the challenges posed by limited training data and leverages the ViT architecture to capture global structural features of spinal X-ray images. This design enables more effective modeling of inter-regional relationships and variations within the spine. Moreover, by jointly optimizing reconstruction loss and cross-entropy loss, ViTST learns robust image representations even from relatively small datasets. In addition, we introduce a healthcare Internet of Medical Things (IoMT) architecture to enable the practical deployment of ViTST in clinical environments. Through this IoMT platform, clinicians can monitor patients’ conditions in real time and adapt treatment plans dynamically, thereby enhancing clinical decision-making and accelerating patient recovery. Finally, we conducted extensive experiments on a real-world pathological spine image dataset to validate the effectiveness of the proposed model. Experimental results demonstrate that ViTST achieved a Precision of 0.975, an Accuracy of 0.979, and an F1-score of 0.975, confirming its strong potential for application in clinical practice.
Repository Citation
Yang, C.,
Huang, K.,
Zhao, X.,
Lu, X.,
Pan, S.,
Song, S.,
Wu, Z.,
Wu, H.,
& Lan, C.
(2025). A Vision Transformer Model Using Self-Supervised Task for Scoliosis Identification From X-Ray Images Based on Internet of Medical Things. IEEE Internet of Things Journal.
https://corescholar.libraries.wright.edu/ee/142
DOI
10.1109/JIOT.2025.3625928
