Publication Date


Document Type


Committee Members

Guozhu Dong (Advisor)

Degree Name

Doctor of Philosophy (PhD)


As a revolutionary technology, microarrays have great potential to provide genome-wide patterns of gene expression, to make accurate medical diagnosis, and to explore genetic causes underlying diseases. It is commonly believed that suitable analysis of microarray datasets can lead to achieve the above goals. While much has been done in microarray data mining, few previous studies, if any, focused on multiple datasets at the comparative level. This dissertation aims to fill this gap by developing tools and methods for set-based comparative microarray data mining. Specifically, we mine highly differentiative gene groups (HDGGs) from given datasets/classes, evaluate the concordance of datasets generated from different platforms/laboratories, investigate the impact of variability in microarray dataset on data mining results, provide tools and algorithms for the above tasks, and identify reliable invariant HDGG patterns for better understanding diseases. It is a big challenge to discover high-quality discriminating (emerging) patterns from high dimensional microarray datasets. We develop a novel feature-group selection method to help discover HDGGs, especially signature HDGGs that completely characterize some disease classes. In addition to giving insights on the diseases, better classification results are also obtained using HDGG-based classifiers compared with other existing classifiers. As microarray datasets are often generated from different platforms/laboratories, it is necessary to evaluate their concordance/consistence before they can be studied together. We provide measures and techniques to quantitatively test such concordance at the comparative level. In addition to applying measures to evaluate the degree of variability in microarray datasets, we also develop a novel algorithm called C-loocv to effectively minimize the variability. As an indicator of the utility of C-loocv, classifiers trained from C-loocv-refined datasets become more robust and predict test samples at significantly higher accuracy over classifiers trained from original datasets. Based on the variability minimization algorithm, we provide a novel strategy to mine invariant patterns from multiple datasets concerning a common disease. As a demonstration, invariant patterns are identified from two datasets concerning lung cancer; these patterns may shed light on the mechanism underlying the pathogenesis of lung cancer. Our methods are generic and can be applied to microarrays concerning any human diseases.

Page Count


Department or Program

Department of Computer Science and Engineering

Year Degree Awarded