Electronic Thesis/Dissertation

 

Privacy Preservation and Data Exploration on Databases Open Access

Downloadable Content

Download PDF

Privacy-preserving data publishing aims to publish a {\em microdata} table for research and statistical analysis, without disclosing sensitive information at the individual level. A large number of such microdata tables are published regularly, especially in the healthcare and census industry. Privacy-preservation problem may not only occur at the individual privacy level, but also can be caused by disclosing aggregate information. For example, many web databases are hidden behind restrictive form-like interfaces which allow users to execute search queries over the underlying hidden database. While it is important to support such search queries, many hidden database owners also want to maintain a certain level of privacy for aggregate information (e.g,., SUM, COUNT) over their databases, for reasons including business secrets and homeland security.This dissertation contains two parts to address the above problems. The first part studies the problem of traditional privacy-preserving data publishing. Regarding the data publishing algorithm design, when an adversary learns the details of an algorithm, s/he may utilize this knowledge to reverse-engineer the published table to compromise additional private information. We aim to eliminate algorithm-based disclosure from existing privacy-preserving data publishing algorithms via minor alterations. Regarding real-world privacy requirements for data publishing, we demonstrate that it is insufficient for the existing quasi-identifier/sensitive-attribute (QI-SA) framework on modeling them. We then propose a novel {\em versatile publishing} scheme with which privacy requirements can be specified as an arbitrary set of privacy rules over attributes in the microdata table. To enable versatile publishing, we introduce the {\em Guardian Normal Form} (GNF), a novel method of publishing multiple sub-tables such that each sub-table is anonymized by an existing QI-SA publishing algorithm, while the combination of all published tables guarantees all privacy rules.The second part focuses on the problem of privacy-preservation for aggregate information. Many deep web sites provide restrictive form-like interfaces which allow users to execute search queries on the underlying hidden databases. To demonstrate the importance of study in this area, we consider the problem of estimating the size (or other aggregates) of a hidden database through its web interface. We propose novel techniques which use a small number of queries to produce unbiased estimates with small variance. These techniques can also be used for approximate query processing over hidden databases.We observe that many deep web sites have restrictive form-like interfaces which may or may not provide domain information for an attribute. When attribute domains are not available, domain discovery becomes a critical challenge facing the application of a broad range of existing techniques on third-party analytical and mash-up applications over hidden databases. Then, we consider the problem of domain discovery over a hidden database through its web interface.We finally consider the problem of suppressing SUM and COUNT queries over a hidden database. In particular, we develop randomized generalization, a novel technique which provides rigid aggregate-suppression guarantee while maintaining the utility of individual search queries.

Author Language Keyword Date created Type of Work Rights statement GW Unit Degree Advisor Committee Member(s) Persistent URL
License

Relationships

Items