Publication Date
2012
Document Type
Thesis
Committee Members
Gouzhu Dong (Committee Member), Pascal Hitzler (Committee Chair), Krishnaprasad Thirunarayan (Committee Member)
Degree Name
Master of Science (MS)
Abstract
The terms Semantic Web and OWL are relatively new and growing concepts in the World Wide Web. Because these concepts are so new there are relatively few applications and/or tools for utilizing the potential power of this new concept. Although there are many components to the Semantic Web, this thesis will focus on the research question, "How do we go about developing a web crawler for the Semantic Web that locates and retrieves OWL documents." Specifically for this thesis, we hypothesize that by giving URIs to OWL documents, including all URIs from within these OWL documents, priority over other types of references, then we will locate more OWL documents than by any other type of traversal. We reason that OWL documents have proportionally more references to other OWL documents than non-OWL documents do, so that by giving them priority we should have located more OWL files when the crawl terminates, than by any other traversal method.
In order to develop such an OWL priority queue, we needed to develop some heuristics to predict OWL documents during real-time parsing of Semantic Web documents. These heuristics are based on filename extensions and OWL language constructs, which are not absolute when predicting a document type before retrieval. However, if our reasoning is correct, then URIs found in an OWL document will likely lead to more OWL documents, such that when the crawl ends because of reaching a maximum document limit, we will have retrieved more OWL documents than by other methods such as breadth-first or load-balanced. We conclude our research with an evaluation of our results to test the validity of our hypothesis and to see if it is worthy of future research.
Page Count
90
Department or Program
Department of Computer Science
Year Degree Awarded
2012
Copyright
Copyright 2012, some rights reserved. My ETD may be copied and distributed only for non-commercial purposes and may be modified only if the modified version is distributed with these same permissions. All use must give me credit as the original author.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.