Mining Sequence Classifiers for Early Prediction
Document Type
Conference Proceeding
Publication Date
4-2008
Abstract
Supervised learning on sequence data, also known as sequence classification, has been well recognized as an important data mining task with many significant applications. Since temporal order is important in sequence data, in many critical applications of sequence classification such as medical diagnosis and disaster prediction, early prediction is a highly desirable feature of sequence classifiers. In early prediction, a sequence classifier should use a prefix of a sequence as short as possible to make a reasonably accurate prediction. To the best of our knowledge, early prediction on sequence data has not been studied systematically.
In this paper, we identify the novel problem of mining sequence classifiers for early prediction. We analyze the problem and the challenges. As the first attempt to tackle the problem, we propose two interesting methods. The sequential classification rule (SCR) method mines a set of sequential classification rules as a classifier. A so-called early-prediction utility is defined and used to select features and rules. The generalized sequential decision tree (GSDT) method adopts a divide-and-conquer strategy to generate a classification model. We conduct an extensive empirical evaluation on several real data sets. Interestingly, our two methods achieve accuracy comparable to that of the state-of-the-art methods, but typically need to use only very short prefixes of the sequences. The results clearly indicate that early prediction is highly feasible and effective.
Repository Citation
Xing, Z.,
Pei, J.,
Dong, G.,
& Yu, P. S.
(2008). Mining Sequence Classifiers for Early Prediction. Proceedings of the 2008 SIAM International Conference on Data Mining, 644-655.
https://corescholar.libraries.wright.edu/knoesis/390
DOI
10.1137/1.9781611972788.59
Comments
Presented at the Society for Industrial and Applied Mathematics' International Conference on Data Mining, Atlanta, GA, April 24-26, 2008.