Something about the Sampling of OSNs Data

来源：互联网发布：哪里可以买备案域名编辑：程序博客网时间：2024/05/20 15:40

声明：该文为本人读书笔记，目前国内针对不同sampling对社交网络分析的论文比较少，除了看到清华的那篇USGSD，sampling是社交分析第一步。

In recent years, thepopulation of Online Social Networks (OSNs) has experienced an explosiveincrease. Twitter for example, has attracted almost 591 million by May 2012 [1]counted by Twopcharts. The world-wide spreading of OSNs has motivated a largenumber of academies and researchers to model, analysis and implement thestructures and characteristics of OSNs. However, the complete dataset istypically not available for privacy and economic considerations at some extent.Therefore, a relatively small but representative sample is desirable in orderto study properties and test algorithms for these OSNs. And variousgraph-sampling algorithms are proposed for producing a representative sample ofOSNs users.

Currently, the graph-sampling algorithms for crawling OSNs can beroughly divided into two main categories: graph traversal techniques and randomwalks. In graph traverse techniques, each node in the connected component isvisited only once, if we let the process run until completion. The graphtraverse techniques include Breadth-First-Search (BFS), Depth-First-Search (DFS),Forest Fire (FF) and Snow-ball Sampling [2, 3]. Snowball sampling randomlyselect one seed node and performs a Breadth-First-Search, until the number ofselected nodes reaches the desired sampling ratio. So snowball sampling isoften considered ending a BFS early. BFS, in particular, is a basic techniquethat has been used extensively for sampling OSNs in past research [6, 7, 8].Thereare also other methods for sampling under well-known the whole networks circumstanceand are investigated in [4].

In above algorithms, BFS,snowball sampling, DFS and the classic random walk are all biased samplingalgorithms. BFS leads to a bias towards high degree nodes [4] and underestimatethe level of symmetry [16]. Furthermore, this bias has not been analyzed so farfor arbitrary graphs. In order to remove this bias, effort is usually put oncompleting BFS. In social network graphs, collecting samples via the snowball samplingmethod has been shown to underestimate the power-law coefficient but moreclosely match other metrics, including the overall clustering coefficient [4].And the snowball sampling method is efficient at collecting data to formconnected graphs. The example comparison between the snowball sampled networksand the complete network can be seen in [7].

While analyzing the impact of thesampling algorithms on measuring and analyzing the OSNs, the chosen of the baselineis important. BFS and RW can be used as baselines but the perfect one is thecomplete OSN which is difficult to get. The most widely used baseline which isalso a sampling algorithm is URI (uniform sample) called “ground truth”. Takefacebook as an example in which the user or node is allocated by a unique32-bit userIDs. URI allowed us to obtain uniformly random 32-bit userIDs bygenerating uniformly random 32-bit IDs and if the ID exists, we keep it,otherwise we discard it. This simple method is a textbook technique known asrejection sampling [19] and in general it allows to sample from anydistribution of interest. But this algorithm has limitations. Firstly, the IDspace must not be sparse for this operation to be efficient. Secondly, theoperation which enables us to verify the user and retrieve the user’s list offriends should be supported by OSNs. And the uniform proof can be seen in [17].

Conclusion:

We have listed the most often used samplingalgorithms for OSNs including some considerations. It’s best to use thecomplete data of OSNs and the UNI can be used for specific circumstance. BFSand the snow-ball sampling are used for biased sampling in undirected OSNs,while RWRW and MHRW are used for unbiased sampling in undirected OSNs. AndUSDSG is the unbiased sampling algorithm for directed OSNs. While we crawlingdata parallel using the MHRW, RWRW and USDSG, we have to consider theconvergence.

[1] http://twopcharts.com/twitter500million.php 2012-5-26

[2]The term snowball sampling is from Steven K. Thomson, Sampling (John Wiley& Sons, Inc., New York, 2002).

[3]M. E. Newman, Soc. Netowrks25, 83 (2003).

[4]Sang Hoon Lee, Pan-Jun Kim andHawoong Jeong. “Statistical properties of sampled networks”. Phys. Rev. E,73:016102, 2006.

[5]L. Lovas. “Random walks ongraphs: A survey”. Combinatorics, 1993.

[6]A. Mislove, M. Marcon, K. P.Gummadi, P. Duschel and S. Bhattacharjee. “Measuement and Analysis of OnlineSocial Networks”. IMC, 2007.

[7]Y. Ahn, S. Han, H. Kwak, S.Moon and H. Jeong. “Analysis of Topological Characteristics of Huge OnlineSocial Network”.WWW, 2007.

[8]C. Wilson, B. Boe, A. Sala, K.P. Puttaswamy and B. Y. Zhao. “User interactions in social networks an theirimplications”. EuroSys, 2009.

[9]A. Rasti, M. Torkjazi, R.Rejaie, N. Duffield, W. Willinger and D. Stutzbach. “Respondent-driven samplingfor characterizing unstructured overlays”. INFOCOMM Mini-Conference, April,2009.

[10]D. Heckathorn. “Respondent-drivensampling: A new approach to the study of hidden populations”. Social Problems,vol. 44, pp. 174-199, 1997.

[11]M.Hansen and W. Hurwitz. “On the theory of sampling from finite populations”.Annuals of Mathematical Statistics, vol. 14, 1943.

[12]M.Salganik and D. Heckathorn. “Sampling and estimation in hidden populationsusing respondent-driven sampling”. Sociological Methodology, vol. 34, p. 193239,2004.

[13]E.Volz and D. D. Hechathorn. “Probability based estimation theory forrespondent-driven sampling”. Journal of Official Statistics, 2008.

[14]N.Metroplis, M. Rosenbult, A. Rosenbluth, A. Teller and E. Teller. “Equation ofstate calculation by fast computing machines”. J. Chem. Physics, vol. 21, pp.1087-1092, 1953.

[15]W.Gilks, S. Richardson and D. Spiegelhalter. Markov Chain Monte Carlo inPractice. Chapman and Hall/CRC, 1996.

[16]L.Becchetti, C. Castillo, D. Donato and A. Fazzone. “A Comparison of SamplingTechniques for Web Graph Characterization”. LinkKDD, 2006.

[17]MinasGjoka, Maciej Kurant, Carter T. Butts and Athina P. Markopoulou. “Walking inFacebook: A Case Study of Unbiased Sampling of OSNs”. INFOCOMM, pp. 2498-2506,2010.

[18]TianyiWang, Yang Chen, Zengbin Zhang, Peng Sun, Beixing Deng and Xing Li. “Unbiasedsampling in directed social graph”. ACM SIGCOMM, pp. 401-402, 2010.
[19]A. Leon-Garcia. “Probability, Statistics, and Random Processes For ElectricalEngineering”. Prentice Hall, 2008.

[20]J.Geweke, “Evaluating the accuracy of sampling-based approaches to calculatingposterior moments,” in Bayesian Statist. 4, 1992.

[21]A.Gelman and D. Rubin, “Inference from iterative simulation using multiplesequences,” in Statist. Sci. Volume 7, 1992.