Data Analytics Over a Search Engine's Corpus and Aggregate Suppression Open Access
Downloadable ContentDownload PDF
Search engines over document corpora typically provide keyword-search interfaces. Examples include search engines over the web as well as those over enterprise and government websites. The corpus of such a search engine forms a rich source of information with analytical interest to third parties, and the only available access to such information is by issuing search queries through its interface. To support data analytics over a search engine's corpus, one needs to address two main problems, the sampling of documents (for offline analytics) and the direct (online) estimation of aggregates, while issuing a small number of queries through the keyword-search interface. Existing work on sampling produces samples with unknown bias and may incur an extremely high query cost. Existing aggregate estimation technique suffers from a similar problem -- the estimation error and query cost can both be large for certain aggregates. In this dissertation, I propose novel techniques which produce unbiased samples as well as unbiased aggregate estimates with small variances while incurring a query cost an order of magnitude smaller than the existing techniques. Although current search engine's corpus mining techniques can be effective and efficient, they share a significant drawback that almost all of them require prior knowledge of a query pool - i.e., a pre-compiled small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement, as we shall show in this dissertation, is that it is extremely difficult, if not impossible, to create a ``good'' query pool unless one has very specific knowledge (e.g., size, topic, special terms used, etc.,) about the underlying corpus. On the other hand, the only existing pool-free technique was designed to be a theoretical demonstration with extremely high query costs and only AVG support (i.e., no SUM or COUNT queries). In this dissertation, I develop QG-SAMPLER and QG-ESTIMATOR, two novel pool-free techniques for estimating AVG, COUNT, and SUM queries over a search engine's corpus. The recent development of search engine's corpus mining techniques allows third parties to get big-picture aggregate information about the underlying corpus. This contradicts the traditionally belief that, because of its highly-restrictive interface (i.e., keyword search only, no SQL-style queries), a search engine serves its purpose of answering individual keyword-search queries without disclosing big-picture aggregates. The disclosure of such information may incur significant privacy concerns. I propose to consider a novel problem of suppressing sensitive aggregates for enterprise search engines while maintaining the quality of answers provided to individual keyword-search queries. In short, in this dissertation, I first propose novel techniques to support data analytics over a search engine's corpus, then I try to solve a common drawback for both of the current techniques - query pool creation. At last, from the perspective of a search engine's owner, I develop methods to preserve sensitive aggregate information while providing key word search service. I demonstrate the effectiveness and efficiency of these novel techniques through theoretical analysis and extensive experimental studies.