Tanvi Banerjee (Committee Member), Derek Doran (Committee Chair), John Gallagher (Committee Member)
Master of Science (MS)
With an ever increasing amount of data that is shared and posted on the Web, the desire and necessity to automatically glean this information has led to an increase in the sophistication and volume of software agents called web robots or crawlers. Recent measurements, including our own across the entire logs of Wright State University Web servers over the past two years, suggest that at least 60\% of all requests originate from robots rather than humans. Web robots display different statistical and behavioral patterns in their traffic compared to humans, yet present Web server optimizations presume that traffic exhibits predominantly human-like characteristics. Robots may thus be silently degrading the performance and scalability of our web systems. This thesis investigates a new take on a classic performance tool, namely web caches, to mitigate the impact of robot traffic on web server operations. It proposes a cache system architecture that:~(i) services robot and human traffic in separate physical memory stores, with separate polices;~(ii) uses an adaptable policy for admitting robot related resources;~(iii) combines a deep neural network with Bayesian models to improve request prediction. Experiments with real data demonstrate (i) significant reduction in bandwidth usage for prefetching and (ii) improvements in hit rate for human driven traffic compared to a number of baselines, especially in configurations where web caches have limited size.
Department or Program
Department of Computer Science and Engineering
Year Degree Awarded
Copyright 2016, all rights reserved. My ETD will be available under the "Fair Use" terms of copyright law.