Publication Date


Document Type


Committee Members

Guozhu Dong (Advisor)

Degree Name

Doctor of Philosophy (PhD)


Large document repositories need to be organized and summarized to make them more accessible and understandable. Such needs exist in many applications, including web search, e-rulemaking (electronic rulemaking) and document archiving. Even though much has been done in the areas of document clustering and summarization, there are still many new challenges and issues that need to be addressed as the repositories become larger, more prevalent and dynamic. In this dissertation, we investigate more informative ways to organize and summarize large document repositories, especially e-rulemaking feedback repositories (ERFRs), so that the large repositories can be managed and digested more efficiently and effectively. Specifically, we mainly consider the following four tasks: 1) identifying important aspects of ERFR, 2) constructing cluster descriptions for document clustering, 3) clustering of ERFR with simultaneous construction of succinct cluster descriptions, and 4) selecting representative arguments for ERFR clustering.

We propose to organize and summarize e-rulemaking feedbacks based on three different major aspects of the rulemaking process, in order to meet the different needs of the rule-writers or analysts; the three aspects are: opinions (O), issues (I) and stakeholders (S). We introduce an OIS-based approach to producing informative summaritive digest (SD) for given ERFRs. In addition, several novel concepts, approaches and algorithms are introduced, including the CDD measure, active feature selection (AFS), Pagoda search algorithms, etc.

An SD, simply put, consists of a document clustering, along with certain succinct cluster descriptions (SCDs) and representative arguments (RAs) for each cluster in the clustering. The clustering of an SD can be constructed in either a flat or hierarchical manner. For hierarchical clustering, each level of the hierarchy can be constructed by emphasizing one of the O, I, and S aspects. Different orders of O, I and S can be used for the levels of the hierarchy. Different clusterings could be used to meet the needs of different users. Given a goodness measure, a "best" clustering can be recommended to the user. An SCD consists of a set of carefully selected terms along with some statistics, and the RAs are some typical arguments selected from each cluster. An RA should be a statement where certain major stakeholders have expressed opinions on some of the important issues. Collectively, an SD provides an informative navigation aid for the rule-writers and analysts to manage and digest large ERFRs.

We conduct an experimental evaluation on our approaches by using some publicly available ERFRs. The results suggest that the SD not only helps user for "browsing" the feedbacks, but also gives the users some high-level sense about the feedbacks before they dig into each individual comment. The results also show that our approaches are efficient and scalable for managing large document repositories.

Even though we devoted special attention to the application of e-rulemaking, we believe that most of the ideas are very generic and can be easily applied to other types of repositories, including digital archives.

Page Count


Department or Program

Department of Computer Science and Engineering

Year Degree Awarded