Aggregate Estimations Over The Deep Web Open Access
Downloadable ContentDownload PDF
Many databases on the web are “hidden” behind (i.e., accessible only through) their restrictive search interfaces. Some of them allow a user to specify the desired values for one or a few attributes (i.e., form a conjunctive search query), and return to the user a small number (bounded by a constant k which can be 50 or 100) of tuples that match the user-specified query, selected and ranked according to a proprietary scoring function. Examples of such databases include Yahoo! Autos, Amazon.com, eBay.com, CareerBuilder.com, etc.Location based service (LBS) is another kind of hidden databases, which becomes very popular in recent years. They range from map services (e.g., Google Maps) that store geographic locations of points of interests, to online social networks (e.g., WeChat, Sina Weibo, FourSquare) that leverage user geographic locations to enable various recommendation functions. The public query interfaces of these services may be abstractly modeled as a kNN interface over a database of two dimensional points on a plane: given an arbitrary query point, the system returns the k points in the database that are nearest to the query point.In this dissertation we consider two problems. The first one is to obtain approximate estimates of SUM and COUNT aggregates by only querying these two types of databases (dynamic traditional hidden web databases and LBS databases) via their restrictive public interfaces. Theoretical analysis and extensive real-world experiments demonstrate the effectiveness of our proposed algorithms and their superiority over baseline solutions. The second one is to infer private attributes of real-world tuples. We demonstrate the effectiveness and efficiency of our techniques through theoretical analysis, extensive experiments over real-world datasets, as well as successful online attacks over websites with tens to hundreds of millions of users - e.g., Amazon Goodreads and Renren.com.