Faster Sampling Over Online Social Networks Open Access
Downloadable ContentDownload PDF
Many online social networks feature restrictive web interfaces that only allow the query of a user's local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chain Monte Carlo methods such as random walks to sample the social network and support analytics based on the samples. The problem with such an approach, however, is the large amount of queries often required (i.e., a long ``mixing time'') for a random walk to reach a desired (stationary) sampling distribution. In this dissertation, we targeted the problem by shortening or eliminating the mixing time and leveraging the history of random walks. To reduce the mixing time, we have developed Modified TOpology (MTO)-Sampler which, by using only information exposed by the restrictive web interface, constructs a ``virtual'' overlay topology of the social network while performing a random walk, and ensures that the random walk follows the modified overlay topology (with shorter mixing time) rather than the original one. We then extend the idea of MTO from SRW to more general random walks like Metropolis-Hastings Random Walk (MTO-MHRW) and General Random Walks(MTO-GRW). To eliminate the mixing time, we also introduced WALK-ESTIMATE (WE)-Sampler, which starts with a much shorter random walk, and then proactively estimate the sampling probability for the node taken before using acceptance-rejection sampling to adjust the probability to the predetermined target distribution. We demonstrated the superiority of MTO-Sampler and WE-Sampler over traditional random walks through theoretical analysis and extensive experiments over real world online social networks. To leverage the history of random walks, we construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and GroupBy Neighbors Random Walk (CNRW and GNRW) and prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution.